56 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			56 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								title: Random Forest
							 | 
						||
| 
								 | 
							
								---
							 | 
						||
| 
								 | 
							
								## Random Forest
							 | 
						||
| 
								 | 
							
								A Random Forest is a group of decision trees that make better decisions as a whole than individually.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								### Problem
							 | 
						||
| 
								 | 
							
								Decision trees by themselves are prone to **overfitting**. This means that the tree becomes so used to the training data that it has difficulty making decisions for data it has never seen before.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								### Solution with Random Forests
							 | 
						||
| 
								 | 
							
								Random Forests belong in the category of **ensemble learning** algorithms. This class of algorithms use many estimators to yield better results. This makes Random Forests usually **more accurate** than plain decision trees. In Random Forests, a bunch of decision trees are created. Each tree is **trained on a random subset of the data and a random subset of the features of that data**. This way the possibility of the estimators getting used to the data (overfitting) is greatly reduced, because **each of them work on the different data and features** than the others. This method of creating a bunch of estimators and training them on random subsets of data is a technique in *ensemble learning* called **bagging** or *Bootstrap AGGregatING*. To get the prediction, the each of the decision trees vote on the correct prediction (classification) or they get the mean of their results (regression).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								### Example of Boosting in Python
							 | 
						||
| 
								 | 
							
								In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand.
							 | 
						||
| 
								 | 
							
								The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal.
							 | 
						||
| 
								 | 
							
								https://www.kaggle.com/c/flavours-of-physics/data
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								```python
							 | 
						||
| 
								 | 
							
								#Data Cleaning
							 | 
						||
| 
								 | 
							
								import pandas as pd
							 | 
						||
| 
								 | 
							
								data_test = pd.read_csv("test.csv")
							 | 
						||
| 
								 | 
							
								data_train = pd.read_csv("training.csv")
							 | 
						||
| 
								 | 
							
								data_train = data_train.drop('min_ANNmuon',1)
							 | 
						||
| 
								 | 
							
								data_train = data_train.drop('production',1)
							 | 
						||
| 
								 | 
							
								data_train = data_train.drop('mass',1)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#Cleaned data
							 | 
						||
| 
								 | 
							
								Y = data_train['signal']
							 | 
						||
| 
								 | 
							
								X = data_train.drop('signal',1)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#adaboost
							 | 
						||
| 
								 | 
							
								from sklearn.ensemble import AdaBoostClassifier
							 | 
						||
| 
								 | 
							
								from sklearn.tree import DecisionTreeClassifier
							 | 
						||
| 
								 | 
							
								seed = 9001 #this ones over 9000!!!
							 | 
						||
| 
								 | 
							
								boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", 
							 | 
						||
| 
								 | 
							
								                                  n_estimators=50, random_state = seed)
							 | 
						||
| 
								 | 
							
								model = boosted_tree.fit(X, Y)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								predictions = model.predict(data_test)
							 | 
						||
| 
								 | 
							
								print(predictions)
							 | 
						||
| 
								 | 
							
								#Note we can't really validate this data since we don't have an array of "right answers"
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#stochastic gradient boosting
							 | 
						||
| 
								 | 
							
								from sklearn.ensemble import GradientBoostingClassifier
							 | 
						||
| 
								 | 
							
								gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed)
							 | 
						||
| 
								 | 
							
								model2 = gradient_boosted_tree.fit(X,Y)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								predictions2 = model2.predict(data_test)
							 | 
						||
| 
								 | 
							
								print(predictions2)
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								#### More Information:
							 | 
						||
| 
								 | 
							
								- <a href='https://www.wikiwand.com/en/Random_forest' target='_blank' rel='nofollow'>Random Forests (Wikipedia)</a>
							 | 
						||
| 
								 | 
							
								- <a href='https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/' target='_blank' rel='nofollow'>Introduction to Random Forests - Simplified</a>
							 | 
						||
| 
								 | 
							
								- <a href='https://www.youtube.com/watch?v=loNcrMjYh64' target='_blank' rel='nofollow'>How Random Forest algorithm works (Video)</a>
							 |