29 lines
		
	
	
		
			1.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			29 lines
		
	
	
		
			1.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								title: Its Generalization That Counts
							 | 
						||
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								## Its Generalization That Counts
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The power of machine learning comes from not having to hard code or explicitly
							 | 
						||
| 
								 | 
							
								define the parameters that describe your training data and unseen data. This is
							 | 
						||
| 
								 | 
							
								the essential goal of machine learning: to generalize a learner's findings.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								To test a learner's generalizability, you'll want to have a separate test data
							 | 
						||
| 
								 | 
							
								set that is not used in any way in training the learner. This can be created by
							 | 
						||
| 
								 | 
							
								either splitting your entire training data set into a training and test set, or
							 | 
						||
| 
								 | 
							
								to just collect more data. If the learner were to use data found in the test
							 | 
						||
| 
								 | 
							
								data set, this would create a sort of bias in your learner to do better than in
							 | 
						||
| 
								 | 
							
								reality.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								One method to get a sense on how your learner will do on a test data set is to
							 | 
						||
| 
								 | 
							
								perform what is called **cross-validation**. This randomly splits up your
							 | 
						||
| 
								 | 
							
								training data into a given number of subsets (for example, ten subsets) and
							 | 
						||
| 
								 | 
							
								leaves one subset out while the learner trains on the rest. And then once the
							 | 
						||
| 
								 | 
							
								learner has been trained, the left out data set is used for testing. This
							 | 
						||
| 
								 | 
							
								training, leaving one subset out, and testing is repeated as you rotate through
							 | 
						||
| 
								 | 
							
								the subsets.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### More Information:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- <a href='https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf' target='_blank' rel='nofollow'>A Few Useful Things to Know about Machine Learning</a>
							 | 
						||
| 
								 | 
							
								- <a href='https://stats.stackexchange.com/a/153058/132399' target='_blank' rel='nofollow'>"How do you use test data set after Cross-validation"</a>
							 |