38 lines
		
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			38 lines
		
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | --- | ||
|  | title: Dataset Splitting | ||
|  | --- | ||
|  | ## Dataset Splitting
 | ||
|  | 
 | ||
|  | Splitting up into Training, Cross Validation, and Test sets are common best practices. | ||
|  | This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data. | ||
|  | 
 | ||
|  | ### Motivation
 | ||
|  | 
 | ||
|  | Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms. | ||
|  | Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data. | ||
|  | For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters. | ||
|  | 
 | ||
|  | #### The Training Set
 | ||
|  | 
 | ||
|  | The Training set is used to compute the actual model your algorithm will use when exposed to new data. | ||
|  | This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set). | ||
|  | 
 | ||
|  | #### The Cross Validation Set
 | ||
|  | 
 | ||
|  | Cross Validation sets are for model selection (typically ~20% of your data). | ||
|  | Use this dataset to try different parameters for the algorithm as trained on the Training set. | ||
|  | For example, you can evaluate differnt model parameters (polynomial degree or lambda, the regularization parameter) on the Cross Validation set to see which may be most accurate. | ||
|  | 
 | ||
|  | #### The Test Set
 | ||
|  | 
 | ||
|  | The Test set is the final dataset you touch (typically ~20% of your data). | ||
|  | It is the source of truth. | ||
|  | Your accuracy in predicting the test set is the accuracy of your ML algorithm. | ||
|  | 
 | ||
|  | #### More Information:
 | ||
|  |  - [AWS ML Doc](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-the-data-into-training-and-evaluation-data.html) | ||
|  |  - [A good stackoverflow post](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio) | ||
|  |  - [Academic Paper](https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf) | ||
|  | 
 | ||
|  | 
 |