38 lines
		
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			38 lines
		
	
	
		
			1.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								title: Dataset Splitting
							 | 
						||
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								## Dataset Splitting
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Splitting up into Training, Cross Validation, and Test sets are common best practices.
							 | 
						||
| 
								 | 
							
								This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								### Motivation
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms.
							 | 
						||
| 
								 | 
							
								Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data.
							 | 
						||
| 
								 | 
							
								For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### The Training Set
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The Training set is used to compute the actual model your algorithm will use when exposed to new data.
							 | 
						||
| 
								 | 
							
								This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### The Cross Validation Set
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Cross Validation sets are for model selection (typically ~20% of your data).
							 | 
						||
| 
								 | 
							
								Use this dataset to try different parameters for the algorithm as trained on the Training set.
							 | 
						||
| 
								 | 
							
								For example, you can evaluate differnt model parameters (polynomial degree or lambda, the regularization parameter) on the Cross Validation set to see which may be most accurate.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### The Test Set
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The Test set is the final dataset you touch (typically ~20% of your data).
							 | 
						||
| 
								 | 
							
								It is the source of truth.
							 | 
						||
| 
								 | 
							
								Your accuracy in predicting the test set is the accuracy of your ML algorithm.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### More Information:
							 | 
						||
| 
								 | 
							
								 - [AWS ML Doc](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-the-data-into-training-and-evaluation-data.html)
							 | 
						||
| 
								 | 
							
								 - [A good stackoverflow post](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)
							 | 
						||
| 
								 | 
							
								 - [Academic Paper](https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 |