fix(guide): simplify directory structure

This commit is contained in:
Mrugesh Mohapatra
2018-10-16 21:26:13 +05:30
parent f989c28c52
commit da0df12ab7
35752 changed files with 0 additions and 317652 deletions

View File

@ -0,0 +1,62 @@
---
title: Backpropagation
---
## Backpropagation
Backprogapation is a subtopic of [neural networks](../neural-networks/index.md).
**Purpose:** It is an algorithm/process with the aim of minimizing the cost function (in other words, the error) of parameters in a neural network.
**Method:** This is done by calculating the gradients of each node in the network. These gradients measure the "error" each node contributes to the output layer, so in training a neural network, these gradients are minimized.
Backpropogation can be thought of as using the chain rule to compute gradients with respect to different parameters in a neural network in order to perform iterative updates to those parameters.
Note: Backpropagation, and machine learning in general, requires significant familiarity with linear algebra and matrix manipulation. Coursework or reading on this topic is highly recommended before trying to understand the contents of this article.
### Computation
The process of backpropagation can be explained in three steps.
Given the following
- m training examples (x,y) on a neural network of L layers
- g = the sigmoid function
- Theta(i) = the transition matrix from the ith to the i+1th layer
- a(l) = g(z(l)); an array of the values of the nodes in layer l based on one training example
- z(l) = Theta(l-1)a(l-1)
- Delta a set of L matricies representing transitions between the ith and i+1th layer
- d(l) = the array of the gradients for layer l for one training example
- D a set of L matricies with the final gradients for each node
- lambda the regularization term for the network
In this case, for matrix M, M' will denote the transpose of matrix M
1. Assign all entries of the Delta(i), for i from 1 to L, zero.
2. For each training example t from 1 to m, perform the following:
- perform forward propagation on each example to compute a(l) and z(l) for each layer
- compute d(L) = a(L) - y(t)
- compute d(l) = (Theta(l)' • d(l+1)) • g(z(l)) for l from L-1 to 1
- increment Delta(l) by delta(l+1) • a(l)'
3. Plug the Delta matricies into our partial derivative matricies
D(l) = 1\m(Delta(l) + lambda • Theta(l)); if l≠0
D(l) = 1\m • Delta(l); if l=0
This article should only be understood in the greater contexts of neural networks and machine learning. Please read the arrached references for a better understanding of the topic as a whole.
### More Information
**High-Level:**
* Siraj Raval - [Backpropagation in 5 Minutes](https://www.youtube.com/watch?v=q555kfIFUCM)
* [Backprop on Wikipedia](https://en.wikipedia.org/wiki/Backpropagation)
**In-depth:**
* Lecture 4 CS231n [Introduction to Neural Networks](https://youtu.be/d14TUNcbn1k?t=354)
* [In depth, wiki style article](https://brilliant.org/wiki/backpropagation/)
* [Article on computation graphs](http://colah.github.io/posts/2015-08-Backprop/)
* [A Step by Step Backpropagation Example](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/)
* [Andrew Ng's ML Course](https://www.coursera.org/learn/machine-learning/)
If you'd like to learn how to implement a full-blown single (hidden) layer neural network in Python, whilst learning more about the math behind the algorithms used, you can register for [Andrew Ng's Deep Learning Specialization] (https://www.coursera.org/specializations/deep-learning)

View File

@ -0,0 +1,19 @@
---
title: Bayes Classifier
---
Bayes Classifier is based on applying Bayes Theorem to update its belief on the probability of an event occurring.
![Bayes Theorem](https://github.com/Cheungo/bayes_theorem_image/blob/master/CodeCogsEqn.gif?raw=true)
Bayes Theorem allows you to compute the probability of B given A, provided you have probabilities for A given B, A, and B.
Bayes Classifier models uncertainty by keeping probabilities for each of the possible scenarios. As more information becomes available, the probabilities are updated to more accurately reflect what is now known about the given situation.
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/)
- [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [How Naive Bayes Classifier Works 1/2](https://youtu.be/XcwH9JGfZOU)
- [How Naive Bayes Classifier Works 2/2](https://youtu.be/k2diLn5Nqbs)

View File

@ -0,0 +1,10 @@
---
title: Brownian Motion
---
## Brownian Motion
Brownian motion or pedesis (from Ancient Greek: πήδησις /pέːːsis/ "leaping") is the random motion of particles suspended in a fluid (a liquid or a gas) resulting from their collision with the fast-moving atoms or molecules in the gas or liquid.
This transport phenomenon is named after the botanist Robert Brown. In 1827, while looking through a microscope at particles trapped in cavities inside pollen grains in water, he noted that the particles moved through the water; but he was not able to determine the mechanisms that caused this motion. Atoms and molecules had long been theorized as the constituents of matter, and Albert Einstein published a paper in 1905 that explained in precise detail how the motion that Brown had observed was a result of the pollen being moved by individual water molecules, making one of his first big contributions to science. This explanation of Brownian motion served as convincing evidence that atoms and molecules exist, and was further verified experimentally by Jean Perrin in 1908. Perrin was awarded the Nobel Prize in Physics in 1926 "for his work on the discontinuous structure of matter" (Einstein had received the award five years earlier "for his services to theoretical physics" with specific citation of different research). The direction of the force of atomic bombardment is constantly changing, and at different times the particle is hit more on one side than another, leading to the seemingly random nature of the motion.
Brownian motion is among the simplest of the continuous-time stochastic (or probabilistic) processes, and it is a limit of both simpler and more complicated stochastic processes (see random walk and Donsker's theorem). This universality is closely related to the universality of the normal distribution. In both cases, it is often mathematical convenience, rather than the accuracy of the models, that motivates their use.

View File

@ -0,0 +1,127 @@
---
title: Clustering Algorithms
---
# Clustering Algorithms
Clustering is the process of dividing data into separated groups (clusters), while ensuring that:
- Each cluster contains similar objects
- Objects which do not belong to the same clusters are not similar
Clustering algorithms help find structure in a collection of unlabelled data and fall in the category of unsupervised learning.
The difficulty lies in the definition of a similarity measure that can separate the data in the way you want. For instance, a group of persons can be separated by gender, hair color, weight, race, etc.
Clustering algorithms have the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis. It's used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Some applications of clustering algorithms include:
* Grouping consumers according to their purchase patterns
* Grouping photos of animals of the same kind together
* Classification of different types of documents
## Types of Clustering Algorithms:
1. Connectivity-based clustering (hierarchical clustering)
2. Centroid-based or point assignment clustering (k-means clustering)
3. Distribution-based clustering
4. Density-based clustering
Some examples of clustering algorithms are:
1. Agglomerative clustering
2. K-means clustering
3. K-mediods clustering
4. Partition Clustering
### Hierarchical Clustering
There are methods for clustering that only use similarities of instances, without any other requirement on the data; the aim is to find groups such that instances in a group are more similar to each other than instances in different groups. This is the approach taken by hierarchical clustering.
This needs the use of a similarity, or equivalently a distance, measure defined between instances. Generally Euclidean distance is used, where one has to make sure that all attributes have the same scale.
There are two main types of Hierarchical clustering which are used:
1. Agglomerative Clustering - This algorithm starts with a bunch of individual clusters and a proximity matrix. Here, the individual clusters are basically individual points, and the matrix is for the distance between each point with each other points. The algorithm tries to find the closest pair of clusters and then combines them into one cluster, and then update the proximity matrix with the new cluster and removes the two combined clusters. This step is repeated until a single cluster is left. The most important part of this algorithm is the proximity matrix and it's updatation.
2. Divisive Clustering - This algorithm can be called an opposite of Agglomerative in terms of how it approachs clustering. It starts with a single cluster and then starts dividing it into multiple clusters. It has a similarity matrix between each point, similarity here being how close the clusters are with each other. This algorithm tries to divide the cluster into two clusters based on how dissimilar a cluster or a point is from the rest. This is continued until there are multiple individual clusters.
### Point Assignment
This method maintains a set of clusters, and it places points to nearest clusters.
## Specific Clustering Algorithms
### K-Means Clustering
K-means algorithm is a popular clustering algorithm since it is relatively simple and fast, as opposed to other clustering algorithms. The algorithm is defined as the following:
1. Decide input parameter k (number of clusters)
2. Pick k random data points to use as centroids
3. Compute distances for all data points to each k centroids, and assign each data point to a cluster containing the closest centroid
4. Once all data points have been classified, compute the midpoint of all points for each cluster and assign as new centroid
5. Repeat steps 3 and 4 until the centroids converge upon certain k points.
Since we only need to calculate k x n distances (rather than n(n-1) distances for knn algorithm), this algorithm is quite scalable.
Here's a clustering example in Python that uses the [Iris Dataset](https://www.kaggle.com/uciml/iris)
```python
import pandas as pd
import numpy as np
iris = pd.read_csv('Iris.csv')
del iris['Id']
del iris['SepalLengthCm']
del iris['SepalWidthCm']
from matplotlib import pyplot as plt
# k is the input parameter set to the number of species
k = len(iris['Species'].unique())
for i in iris['Species'].unique():
# select only the applicable rows
ds = iris[iris['Species'] == i]
# plot the points
plt.plot(ds[['PetalLengthCm']],ds[['PetalWidthCm']],'o')
plt.title("Original Iris by Species")
plt.show()
from sklearn import cluster
del iris['Species']
kmeans = cluster.KMeans(n_clusters=k, n_init=10, max_iter=300, algorithm='auto')
kmeans.fit(iris)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
for i in range(k):
# select only data observations from the applicable cluster
ds = iris.iloc[np.where(labels==i)]
# plot the data observations
plt.plot(ds['PetalLengthCm'],ds['PetalWidthCm'],'o')
# plot the centroids
lines = plt.plot(centroids[i,0],centroids[i,1],'kx')
# make the centroid x's bigger
plt.setp(lines,ms=15.0)
plt.setp(lines,mew=2.0)
plt.title("Iris by K-Means Clustering")
plt.show()
```
Since the data points belong usually to a high-dimensional space, the similarity measure is often defined as a distance between two vectors (Euclidean, Manhathan, Cosine, Mahalanobis...)
### Mixture Density
We can write *mixture density* as:
![mixture density](https://latex.codecogs.com/gif.latex?p%28x%29%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp%28x%7CG_%7Bi%7D%29p%28G_%7Bi%7D%29)
where Gi are the mixture components. They are also called group or clusters. p(x|Gi) are the component densities and P(Gi) are the mixture proportions. The number of components, k, is a hyperparameter and should be specified beforehand.
### Expectation-Maximization (EM)
In this approach is probabilistic and we look for the component density parameters that maximize the likelihood of the sample.
The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood (ML) estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate the model parameter(s) for which the observed data are the most likely.
Each iteration of the EM algorithm consists of two processes: The E-step, and the M-step.
1. In the expectation, or E-step, the missing data are estimated given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation, explaining the choice of terminology.
2. In the M-step, the likelihood function is maximized under the assumption that the missing data are known. The estimate of the missing data from the E-step are used in lieu of the actual missing data.
Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration.
## More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
* [Wikipedia Cluster Analysis article](https://en.wikipedia.org/wiki/Cluster_analysis)
* [Introduction to Clustering and related algorithms](https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/)
* [Clustering Algorithms-Stanford University Slides](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
* [Clustering Algorithms: From Start To State Of The Art](https://www.toptal.com/machine-learning/clustering-algorithms)
* [Cluster Analysis: Basic Concepts and Algorithms](https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
* [K-means Clustering](https://www.datascience.com/blog/k-means-clustering)
* [Expectation-Maximization Algorithm](https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf)
* [Using K-Means Clustering with Python](https://code.likeagirl.io/finding-dominant-colour-on-an-image-b4e075f98097)

View File

@ -0,0 +1,37 @@
---
title: Dataset Splitting
---
## Dataset Splitting
Splitting up into Training, Cross Validation, and Test sets are common best practices.
This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data.
### Motivation
Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms.
Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data.
For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters.
#### The Training Set
The Training set is used to compute the actual model your algorithm will use when exposed to new data.
This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set).
#### The Cross Validation Set
Cross Validation sets are for model selection (typically ~20% of your data).
Use this dataset to try different parameters for the algorithm as trained on the Training set.
For example, you can evaluate differnt model parameters (polynomial degree or lambda, the regularization parameter) on the Cross Validation set to see which may be most accurate.
#### The Test Set
The Test set is the final dataset you touch (typically ~20% of your data).
It is the source of truth.
Your accuracy in predicting the test set is the accuracy of your ML algorithm.
#### More Information:
- [AWS ML Doc](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-the-data-into-training-and-evaluation-data.html)
- [A good stackoverflow post](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)
- [Academic Paper](https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf)

View File

@ -0,0 +1,29 @@
---
title: Gradient Descent
---
## Gradient Descent
Gradient descent is an optimization algorithm for finding the minimum of a function. In deep learning this optimization algorithm is very useful when the parameters cannot be calculated analytically.
![Gradient Descent](https://upload.wikimedia.org/wikipedia/commons/6/68/Gradient_descent.jpg)
What you want to do is to repeatedly update the value of the parameter theta until you minimize the value of the Cost Function J(θ) close to 0;
### Learning Rate
The size of a step is called the learning rate. A larger learning rate make iterating faster but it might overshoot the global minimum, the value that we are looking for. On the other hand we could prevent this overshooting by decreasing the learning rate; but beware that the smaller you make the learning rate, the more computationally intensive it gets. This could either prolong the computation unnecessarily, or you may not arrive at the global minimum altogether.
### Feature Scaling
A deep learning problem would require you to use multiple features for generating a predictive model. If for example if you are building a predictive model for house pricing, you would have to deal with features like the price itself, number of rooms, lot area, etc. And these features might extremely differ in range, like for example while the lot area might be between 0 and 2000 square feet, the other features like the number of rooms would be between 1 and 9.
This is where feature scaling, also called normalization, comes in handy, to make sure that your machine learning algorithm works properly.
### Stochastic Gradient Descent
Machine learning problems usually requires computations over a sample size in the millions, and that could be very computationally intensive.
In stochastic gradient descent you update the the parameter for the cost gradient of each example rather that the sum of the cost gradient of all the examples. You could arrive at a set of good parameters faster after only a few passes through the training examples, thus the learning is faster as well.
### Further Reading
* [A guide to Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/)
* [Gradient Descent For Machine Learning](https://machinelearningmastery.com/gradient-descent-for-machine-learning/)
* [Difference between Batch Gradient Descent and Stochastic Gradient Descent](https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1)

View File

@ -0,0 +1,52 @@
---
title: Deep Learning
---
## Deep Learning
Deep Learning refers to a technique in Machine Learning where you have a lots of artificial neural networks stacked together in some architecture.
To the uninitiated, an artificial neuron is basically a mathematical function of some sort. And neural nets are neurons conected to each other. So in deep learning, you have lots of mathematical functions stacked on top (or on the side) of each other in some architecture. Each of the mathematical functions may have its own parameters (for an instance, an equation of a line `y = mx + c` has 2 parameters `m` and `c`) which need to be learned (during training). Once learned for a given task (say for classifying cats and dogs), this stack of mathematical functions (neurons) is ready to do its work of classifying images of cats and dogs.
![Cat or a dog?](https://image.slidesharecdn.com/deeplearningfromanoviceperspective-150811155203-lva1-app6891/95/deep-learning-from-a-novice-perspective-3-638.jpg?cb=1439308391)
### Why is it a big deal?
Coming up with set of rules manually for some of the tasks can very tricky (though theoretically possible). For instance, if you try to write a manual set of rules in order to classify an image (basically bunch of pixel values) of whether it belongs to a cat or dog, you'll see why it is tricky. Add to that the fact that dogs and cats come in variety of shapes, sizes and colors, and, not to mention, the images can have different backgrounds. You can quickly understand why coding such a simple problem can be problematic.
Deep Learning helps tackle this problem of figuring out the set of rules that can classify an image as that of a cat or a dog, automatically! All it needs is bunch of images that are already correctly classified as that of a cat or a dog and it'll be able to learn the required set of rules. Magic!
Turns out that there are a lot of problems out there which are not image-related (like voice recognition), where finding the set of rules is very tricky. Deep Learning can help with that provided there is lot of labelled data already present.
### How to train a deep learning model?
Training a deep neural network (a.k.a. our stack of mathematical functions arranged in some architecture) is basically an art with lot of hyper-parameters. Hyper-parameters are basically things such as which mathematical function to use, or which architecture to use, that you need to manually figure until your network is able to successfully classify cats and dogs. In order to train, you need lots of labelled data (in this case lots of images already classified as cats or dogs) and lots computing power and patience!
In order to train, you provide a neural network with a loss function which basically says how different are the results of the neural network vs the correct answers. Depending on the value of the loss function, you change the parameters of the mathematical function in such a way that the next time your network tries to classify the same image, the value of loss function is lower. You keep on finding the value of the loss function and updating the parameters again and again across the entire training data set until the loss function values are within reasonable margins. Your massive neural network is now ready!
### Some standard Neural Network architectures
Over the past few years, some of the models (i.e. the combination of the mathematical functions, the architecture, and the parameters) have become standard for certain tasks. For instance, a model called Resnet-152 won the Imagenet Challenge in 2015 which involves trying to classify images into 1000 categories (cats and dogs included). If you are planning to do similar tasks, then the recommendation is to start with such standard models and tweak them if they don't meet your requirements.
A resnet-152 model looks like this (Don't worry if you don't understand it. It's just bunch of mathematical functions stacked on top of each other in some interesting fashion):
![Resnet-152 Model](https://adeshpande3.github.io/assets/ResNet.gif)
Google had its own neural network architecture that won the Imagenet challenged in 2014. Which can be seen in a <a href="https://adeshpande3.github.io/assets/GoogleNet.gif">gif here in more detail</a>.
### How to implement your own?
These days there are a variety of deep learning frameworks that allow you specify which mathematical function you want to use, which architecture for your functions, and which loss function to use for training. Since the training of such a model is very computationally intensive, most of these frameworks generate code optimized for whatever hardware you may have. Some of the famous frameworks are:
* <a href="https://mxnet.incubator.apache.org/">Apache MXNet</a>
* <a href="https://www.tensorflow.org/">Google's Tensorflow</a>
* <a href="http://pytorch.org//">Pytorch</a>
* <a href="https://keras.io/">Keras</a>
* <a href="https://caffe2.ai/">Caffe2</a>
* <a href="https://github.com/gluon-api/gluon-api/">Gluon</a>
* <a href="http://deeplearning.net/software/theano/">Theano</a>
### More Information:
* <a href="http://www.deeplearningbook.org">Deep Learning Textbook</a>
* <a href="https://en.wikipedia.org/wiki/Deep_learning">Deep Learning</a>
* <a href="https://github.com/freeCodeCamp/guides/blob/master/src/pages/machine-learning/neural-networks/index.md">FreeCodeCamp Guide to Neural Networks</a>
* <a href="http://image-net.org/">Imagenet</a>
* <a href="https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/">A Beginner's Guide To Understanding Convolutional Neural Networks</a>
* <a href="https://www.youtube.com/playlist?list=PLjJh1vlSEYgvGod9wWiydumYl8hOXixNu">Deep Learning SIMPLIFIED - DeepLearning.TV</a>
* <a href="http://neuralnetworksanddeeplearning.com"> Neural Networks and Deep Learning</a>

View File

@ -0,0 +1,20 @@
---
title: Music Classification
---
## Music Classification
Music classification is yet another field where deep learning strategies could be applied in order to attain higher classfication accuracies than traditional machine learning methods. Deep Neural Networks which were originally being used for image recognition and computer vision tasks could be employed for music classification through the use of spectrograms. A spectrogram is nothing but a visual representation of the spectrum of frequencies present in the music over a period of time. In other words, a music signal which is a resultant frequency, could be separated into its spectrum of frequencies and the loudness in terms of dB could be visually represented for each frequency. This image could be used for training a neural net that classifies such spectrograms. A great use-case is genre recognition.
### The follwing are examples of various spectrograms:
![Spectrogram1](http://deepsound.io/images/new_blues_00.png)
The above spectrogram is of a song from the blues genre. Frequencies are along the y-axis, and time on x-axis. The brighter colors represent that the sound of that frequency is loud whereas darker colors represents they are soft at those particular points in time. Such an image containing so much data could be used to train a neural network. We generally use a mel-scaled spectrogram for the purpose of genre recognition, which is a scale of pitches judged by listeners, i.e., how we perceive such frequencies to distinguish between components of various genres.
**Fourier transforms**
An important detail to know is that such spectrograms are created with the help of a mathematical concept known as Fourier transforms. The Fourier transform decomposes a function of time into the frequencies that make it up.
#### More information
If you are using python, there are many libraries for signal processing. [Librosa](https://librosa.github.io/librosa/) is a famous one, another is [scipy](https://scipy.org/) which could also be used for other scientific purposes. mel-spectrograms could be created be leveraging these libraries.
##### Please take a look at the following links for more info on the above topic:
- [Finding Genre](https://hackernoon.com/finding-the-genre-of-a-song-with-deep-learning-da8f59a61194)
- [Deepsound](http://deepsound.io/music_genre_recognition.html)

View File

@ -0,0 +1,15 @@
---
title: Optimization Algorithms for Gradient Descent
---
## Optimization Algorithms for Gradient Descent
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/deep-learning/optimization-algorithms-for-gradient-descent/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,33 @@
---
title: Dimension Reduction
---
## Dimension Reduction
Dealing with a lot of dimensions can be painful for machine learning algorithms. High dimensionality will increase the computational complexity, increase the risk of overfitting (as your algorithm has more degrees of freedom) and the sparsity of the data will grow. Hence, dimensionality reduction will project the data in a space with less dimension to limit these phenomena.
## Why is dimensionality reduction useful?
- Projection into two dimensions is often used to facilitate the visualization of high dimensional data sets.
- When the dimensions can be given a meaningful interpretation, projection along that dimension can be used to explain certain behaviors.
- In the supervised learning case, dimensionality reduction can be used to reduce the dimension of the features, potentially leading to better performance for the learning algorithm.
## Dimensionality Reduction Techniques
- Linear Discriminant Analysis [LDA](http://scikit-learn.org/stable/modules/lda_qda.html)
- Principal Components Analysis [PCA](http://setosa.io/ev/principal-component-analysis/)
- Kernel PCA
- Graph-based kernel PCA
- t-Distributed Stochastic Neighbor Embedding [t-SNE](https://lvdmaaten.github.io/tsne/)
- [Auto Encoders](https://medium.com/towards-data-science/reducing-dimensionality-from-dimensionality-reduction-techniques-f658aec24dfe)
- Generalized discriminant analysis (GDA)
- autoencoders
#### More Information:
* [A step by step tutorial to Principal Component Analysis](https://plot.ly/ipython-notebooks/principal-component-analysis/#introduction)
* [Dimensionality Reduction Techniques](https://medium.com/towards-data-science/reducing-dimensionality-from-dimensionality-reduction-techniques-f658aec24dfe)
* [Dimensionality Reduction Techniques: Where to Begin](https://blog.treasuredata.com/blog/2016/03/25/dimensionality-reduction-techniques-where-to-begin)

View File

@ -0,0 +1,9 @@
---
title: Expectation Maximization Algorithm
---
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
- http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
- http://www.cc.gatech.edu/~dellaert/em-paper.pdf

View File

@ -0,0 +1,85 @@
---
title: Feature Engineering
---
## Feature Engineering
Machine Learning works best with well formed data. Feature engineering describes certain techniques to make sure we're working with the best possible representation of the data we collected.
## Why is feature engineering useful?
* The quantity and quality of features impacts the predictive power of the model. More high-quality features results in a better model.
* Build better models by taking the data you have and augmenting it with additional subject-relevant information obtained elsewhere.
* New features can lead to 'breakthroughs' in the model's ability to predict a robust outcome.
## Caveats to feature engineering
* New feature creation based from known features can lead to multicollinearity, a situation where two features are linearly related. This amounts to 'double dipping' in a model and can lead to over fitting.
* More is not always better. Adding features with poor predictive capabilities can increase computational time without adding benefits to the model.
## Examples of feature engineering:
* If you have a 'date' feature, try subsetting it to 'day of the week', 'week of the year', or 'month of the year'. Similarly, create an AM/PM feature from 'time of day'.
* Perform a data reduction like PCA then add the vectors from the PCA to the data as new features.
* Produce new features by numerically transforming current features. Examples would be log transforming data or encoding categorical features as numbers (convert low/medium/high to 1/2/3).
* Use census data to create new features (such as average income), assuming your data set contains location information (city, state, county, etc.).
Following are two techniques of feature engineering: scaling and selection.
### Feature Scaling
Let's assume your data contains the weight and height of people. The raw numbers of these two features have a high difference (e.g. 80 kg and 180 cm, or 175 lbs vs 5.9 ft), which could influence the outcome of certain Machine Learning algorithm. This is especially the case for algorithms that use [distance functions](https://en.wikipedia.org/wiki/Euclidean_distance).
To fix this isse, we represent the raw numbers in a 0 to 1 range. We can achieve this using the formula: `(x - xMin) / (xMax - xMin)`.
Using this formula, we need to pay special attention to outliers, as these can affect the outcome drastically by pushing up xMax and pushing down xMin. That's why outliers are often eliminated prior to scaling.
### Feature Selection
It's all about identifying the subset of features that are responsible for the trends we observe in our data.
Why should we care? [Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) is a big enemy in times of Big Data. We can't use all of our tens to hundreds of features. This would not only raise the dimensionality of our data through the roof (2^n, where n is the number of features), but also often don't make any sense in specific use cases. Imagine wanting to predict the weather of tomorrow: It will be more likely that the weather trend of last days is more important in this scenario than the babies born in the last days. So you could easily just eliminate the babies-feature.
But forget babies for now, let's dive into more detail.
#### Filtering & Wrapping
Here we describe two general approaches. Filtering methods act independently from your chosen learning algorithm & wrapping methods incorporate your learner.
Filter methods select the subset of features before injecting the data into your ML algorithm. They use e.g. the correleation with the to-be-predicted variable to identify which subset of features to choose. These methods are relatively fast to compute, but don't take advantage of the [bias of the learner](https://en.wikipedia.org/wiki/Inductive_bias) because filtering is happening independent of your chosen ML model.
Wrapping search algorithms do take advantage of the learning bias, as they incorporate your chosen ML model. These methods function by removing the feature that has the lowest change in score when removed and repeating this process until the score changes significantly. This means running your learning algorithm over and over again, which can lead to significant computation times. These methods also have the danger of overfitting, as you're basically optimizing the feature set based on your chosen ML model.
#### Relevance
Another way of selecting features is using the [BOC (Bayes Optimal Classifier)](https://scholar.google.de/scholar?q=Bayes+Optimal+Classifier&hl=en&as_sdt=0&as_vis=1&oi=scholart&sa=X&ved=0ahUKEwiO16X0tIbXAhXiKsAKHbGrBzoQgQMIJjAA). The rule of thumbs here are:
* a feature is strongly relevant if removing it degrades the BOC
* a feature is weakly relevant if it is not strongly relevant & adding it in combination with other features improves the BOC
* otherwise a feature is irrelevant
Well, not always. It depends on the amount of data you have and the strength of competing signals. You can help your algorithm "focus" on what's important by highlighting it beforehand.
* Indicator variable from thresholds: Let's say you're studying alcohol preferences by U.S. consumers and your dataset has an age feature. You can create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking age.
* Indicator variable from multiple features: You're predicting real-estate prices and you have the features n_bedrooms and n_bathrooms. If houses with 2 beds and 2 baths command a premium as rental properties, you can create an indicator variable to flag them.
* Indicator variable for special events: You're modeling weekly sales for an e-commerce site. You can create two indicator variables for the weeks of Black Friday and Christmas.
* Indicator variable for groups of classes: You're analyzing website conversions and your dataset has the categorical feature traffic_source. You could create an indicator variable for paid_traffic by flagging observations with traffic source values of "Facebook Ads" or "Google Adwords".
## Interaction Features
The next type of feature engineering involves highlighting interactions between two or more features.
Have you ever heard the phrase, "the sum is greater than the parts?" Well, some features can be combined to provide more information than they would as individuals.
Specifically, look for opportunities to take the sum, difference, product, or quotient of multiple features.
*Note: We don't recommend using an automated loop to create interactions for all your features. This leads to "feature explosion."
* Sum of two features: Let's say you wish to predict revenue based on preliminary sales data. You have the features sales_blue_pens and sales_black_pens. You could sum those features if you only care about overall sales_pens.
* Difference between two features: You have the features house_built_date and house_purchase_date. You can take their difference to create the feature house_age_at_purchase.
* Product of two features: You're running a pricing test, and you have the feature price and an indicator variable conversion. You can take their product to create the feature earnings.
* Quotient of two features: You have a dataset of marketing campaigns with the features n_clicks and n_impressions. You can divide clicks by impressions to create click_through_rate, allowing you to compare across campaigns of different volume.
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
* [Paper exploring "Feature Engineering for Text Classification"](https://pdfs.semanticscholar.org/6e51/8946c59c8c5d005054af319783b3eba128a9.pdf)
* [Article "Discover Feature Engineering, How to Engineer Features and How to Get Good at It"](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
* [A comprehensive guide to data analysis](https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)
* [Data transformations](https://onlinecourses.science.psu.edu/stat501/node/318)
* [Feature engineering in data science](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/create-features)

View File

@ -0,0 +1,12 @@
---
title: Gaussian Process
---
## Gaussian Process
In probability theory and statistics, a Gaussian process is a particular kind of statistical model where observations occur in a continuous domain, e.g. time or space. In a Gaussian process, every point in some continuous input space is associated with a normally distributed random variable. Moreover, every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
Viewed as a machine-learning algorithm, a Gaussian process uses lazy learning and a measure of the similarity between points (the kernel function) to predict the value for an unseen point from training data. The prediction is not just an estimate for that point, but also has uncertainty information—it is a one-dimensional Gaussian distribution (which is the marginal distribution at that point).
#### More Information:
- [Gaussian Processes for Dummies](http://katbailey.github.io/post/gaussian-processes-for-dummies/)

View File

@ -0,0 +1,43 @@
---
title: Glossary
---
## Glossary
A quick one or two sentences describing common terms. See individual pages for
more details.
- **A/B testing** - A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.
- **Machine Learning** - Intersection of statistics and computer science in
order to teach computers to perform tasks without explicitly being programmed.
- **Deep Learning** - An umbrella term for machine learning methods based on learning data representations as opposed to algorithms based on fulfilling a given task. It includes architectures such as deep neural networks, deep belief networks and recurrent neural networks.
- **Neuroevolution** - An umbrella term for machine learning methods based on generating neural networks through weight, bias, and architecture through random mutations of the network. The most common forms of neuroevolution are Neuroevolution of Augmenting Topologies([NEAT](https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_topologies)) and Interactively Constrained Neuro-Evolution ([ICONE](http://ikw.uni-osnabrueck.de/~neurokybernetik/media/pdf/2012-1.pdf)).
- **Statistical Learning** - the use of machine learning with the goal of
statistical inference, whereby you make conclusions of the data rather than
focus on prediction accuracy
- **Supervised Learning** - Using historical data to predict the future. Example: Using historical data of prices at which houses were sold to predict the price in which your house will be sold. Regression and Classification come under supervised learning.
- **Unsupervised Learning** - Finding patterns in unlabelled data. Example: Grouping customers by purchasing behaviour. Clustering comes under unsupervised learning.
- **Reinforcement learning** - Using a simulated or real environment in which a machine learning algorithm is given input and sparse rewards to build a model to predict actions. Reinforcement learning has been used [to train virtual robots to balance themselves](https://blog.openai.com/competitive-self-play/) and [to beat games designed for humans](https://blog.openai.com/openai-baselines-dqn/).
- **Regression** - A machine learning technique used to predict continous values. Linear Regression is one of the most popular regression algorithm.
- **Classification** - A machine learning technique used to predict discrete values. Logistic Regression is one of the most popular classification algorithm.
- **Association Rule learning** - A rule-based machine learning method for discovering interesting relations between variables in large databases.
```
f: x -> y
Here 'f' is a function that takes 'x' as input and produces 'y' as output.
If the output value 'y' is a real number / continous value then the function
is a regression technique.
If the output value 'y' is a discrete / categorical value then the function is a classification technique.
```
- **Clustering** - Grouping of unlabelled data. Identifying patterns using statistics.
- **Dimensionality Reduction** - Reducing the number of random variables in the data to get more accurate predictions.
- **Random forests**- Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.
-**Bayesian networks**-A Bayesian network is a probabilistic graphical model which relates a set of random variables with their conditional independencies via a directed acyclic graph (DAG). In a simple way it relates the random variable with their conditional independencies for the event prediction.It plays a crucial role in clues-to-cause relation.
-**Bias-variance tradeoff**- Bias is helpful because it helps us determine the average difference in predicted values and actual values, whereas variance helps us determine how different predications on the same dataset are differ from each other. If bias increases, then the model has a high error in the predictions, which makes the model underperfomed. A high variance makes the model overfit as the model trains itself continuously at only the given dataset and performs poorly on the data that it hasn't seen yet. Finding a balance between bias and variance is the key to making a good model.
### More Information:
- [Glossary of Terms - Robotics](http://robotics.stanford.edu/~ronnyk/glossary.html)
- [Glossary of Terms - Machine Learning, Statistics and Data Science](https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/)

View File

@ -0,0 +1,60 @@
---
title: Machine Learning
---
## Machine Learning
Arthur Samuel, a pioneer in artificial intelligence, defined Machine Learning in 1959 as "the field of study that gives computers the ability to learn without being explicitly programmed."
A more formal definition of Machine Learning is provided by Prof Tom Mitchell of CMU:
> "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
Consider the example of a Machine Learning algorithm that plays chess. In this example, `E` refers to the experience of playing chess, `T` is the task of playing chess, and `P` denotes the probability that the program will win the next game of chess.
Machine learning is exactly like how a human being learns. For example if a human wants to learn how to play poker, it will firstly learn the rules. Then it will try to get experience by playing the game. This experience is nothing but a huge data set for a machine by using which it can make intelligent decisions reagrding the proposed problem.
In general, machine learning problems can be classified into supervised learning, and unsupervised learning. In supervised learning, you have the input and the labeled output, and you suspect that a relationship exists between the input and the labeled output. When you know neither what the labeled output is nor if a relationship exists, unsupervised learning will help you find structure in your data if there is one.
We've covered two main categories of machine learning, but there are four broad categories of machine learning:
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised Learning
4. Reinforcement Learning
### Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training
data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal). Further, the supervised learning can be taken as 2 paradigm, classification and regression.
#### Basic flowchart/steps for supervised learning
1. Collect training set.
2. Divide training set into input object (features) and output object (classes or value)
3. Decide what you will be applying, regression or classifier
4. Decide which algorithm will you be applying, SVM, deep net, etc
5. Run the algorithm on training set and use the model for predictions
#### Courses:
1. <a href='https://www.udacity.com/course/intro-to-machine-learning--ud120?autoenroll=true' target='_blank' rel='nofollow'>Intro to Machine Learning</a>
2. <a href='https://www.coursera.org/learn/machine-learning' target='_blank' rel='nofollow'>Machine Learning - Taught by: Andrew Ng</a>
3. <a href='https://www.udemy.com/data-science-and-machine-learning-with-python-hands-on/' target='_blank' rel='nofollow'>Data Science and Machine Learning with Python - Hands On!</a>
4. <a href='http://ciml.info/' target='_blank' rel='nofollow'>Machine Learning</a>
5. <a href='https://www.edx.org/course/the-analytics-edge' target='_blank' rel='nofollow'>The Analytics Edge - Taught by: MIT</a>
#### Video Resources:
1. <a href="https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A" target="_blank">Siraj Raval's Youtube channel</a>
2. <a href="https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ" target="_blank">Sentdex's Youtube channel</a>
#### More Information:
1. <a href='https://en.wikipedia.org/wiki/Machine_learning' target='_blank' rel='nofollow'>Machine Learning on Wikipedia</a>
2. <a href='https://www.youtube.com/watch?v=83uAOzhzs-U' target='_blank' rel='nofollow'>Machine Learning Demystified:Youtube</a>
3. If you want a brief introduction of machine learning, and you prefer videos, try this <a href='https://youtu.be/cKxRvEZd3Mw' target='_blank' rel='nofollow'>machine learning introduction video</a>
4. If you want to know how to proceed with learning machine learning, take a look at this <a href='https://youtu.be/nKW8Ndu7Mjw' target='_blank' rel='nofollow'> video</a>
## Lab
<a href="https://github.com/Microsoft/computerscience/blob/master/Labs/AI%20and%20Machine%20Learning/Azure%20Machine%20Learning/Azure%20Machine%20Learning%20(Node).md">Building Smart Apps with Azure Machine Learning Studio</a>

View File

@ -0,0 +1,99 @@
---
title: Latent Dirichlet Allocation
---
In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model.
Suppose you have the following set of sentences:
I ate a banana and spinach smoothie for breakfast
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
Latent Dirichlet allocation is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)
The question, of course, is: how does LDA perform this discovery?
LDA Model
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you
Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word in the document by:
....First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
....Then using the topic to generate the word itself (according to the topic's multinomial distribution). For instance, the food topic might output the word "broccoli" with 30% probability, "bananas" with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
Example
Let's make an example. According to the above process, when generating some particular document D, you might
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick 5 to be the number of words in D.
Pick the first word to come from the food topic, which then gives you the word "broccoli".
Pick the second word to come from the cute animals topic, which gives you "panda".
Pick the third word to come from the cute animals topic, giving you "adorable".
Pick the fourth word to come from the food topic, giving you "cherries".
Pick the fifth word to come from the food topic, giving you "eating".
So the document generated under the LDA model will be "broccoli panda adorable cherries eating" (note that LDA is a bag-of-words model).
Learning
So now suppose you have a set of documents. You've chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling*) is the following:
Go through each document, and randomly assign each word in the document to one of the K topics.
Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them, for each document d...
....Go through each word w in d...
........And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word's topic with this probability). (Also, I'm glossing over a couple of things here, such as the use of priors/pseudocounts in these probabilities.)
........In other words, in this step, we're assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you'll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).
Layman's Explanation
In case the discussion above was a little eye-glazing, here's another way to look at LDA in a different domain.
Suppose you've just moved to a new city. You're a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can't just ask, so what do you do?
Here's the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don't know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.
So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it's because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it's because the Z people in this city really like to watch movies; and so on.
Of course, your random guesses are very likely to be incorrect (they're random guesses, after all!), so you want to improve on them. One way of doing so is to:
Pick a place and a person (e.g., Alice at the mall).
Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.
Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it's a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and so eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:
For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the "basketball players" group).
For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.
Real-World Example
Finally, let's go through a real-world example. I applied LDA to a set of Sarah Palin's emails a little while ago (see http://blog.echen.me/2011/06/27/... for a blog post, or http://sarah-palin.heroku.com/ for an app that allows you to browse through the emails by the LDA-learned topics), so here are the some of the topics that the algorithm learned:
Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, ...
Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, ...
Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, ...
Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, ...
Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, ...
Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, ...
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- [Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
- [Introduction to Latent Dirichlet Allocation](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)

View File

@ -0,0 +1,65 @@
---
title: Linear Regression
---
## Linear Regression
Linear regression is a type of regression, or one of the several regression techniques which are used to find the best fitting line for the given set of points in the given dataset.
Linear regression helps us predict score of a variable X from the scores on other variables Y. When the variables Y are plotted, linear regression finds the best-fitting straight line through the points. The best-fitting line is called a regression line.
This is done by taking a line equation and comparing it with the points and the required result and then calibrated in such a way that the difference/distance between the points and the line, or simply error, is kept to the minimum. This way of calibrating is called Least Squared Error method.
[Online linear regression simulator](https://www.mladdict.com/linear-regression-simulator)
In Python:
```py
#Price of wheat/kg and the average price of bread
wheat_and_bread = [[0.5,5],[0.6,5.5],[0.8,6],[1.1,6.8],[1.4,7]]
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i][0]
y = points[i][1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, points, learning_rate)
return [b, m]
gradient_descent_runner(wheat_and_bread, 1, 1, 0.01, 100)
```
Code example is from <a href='http://blog.floydhub.com/coding-the-history-of-deep-learning/' target='_blank' rel='nofollow'>this article</a>. It also explains gradient descent and other essential concepts for deep learning.
It is important to note that not all linear regression is done with gradient descent. The normal equation can also be used for finding the linear regression coefficients, however, this uses matrix multiplication, and therefore can be very time consuming to use for more than 100,000 or 1,000,000 instances.
In Python:
Apply directly by using scikit library, thus making linear regression easy to use even on large datasets.
```py
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression as lr
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
X = train.iloc[:, 0:4].values
y = train.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
X_train
model = lr()
model.fit(X_train, y_train)
print(model.score(X_train,y_train))
y_pred_class = model.predict(X_test)
model.score(X_train,y_train)
print(model.coef_)
print(model.intercept_)
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
```

View File

@ -0,0 +1,67 @@
---
title: Logistic Regression
---
## Logistic Regression
![Logistic Function](https://qph.fs.quoracdn.net/main-qimg-7c9b7670c90b286160a88cb599d1b733)<br>
Logistic regression is very similar to linear regression in that it attempts to predict a response variable Y given a set of X input variables. It's a form of supervised learning, which tries to predict the responses of unlabeled, unseen data by first training with labeled data, a set of observations of both independent (X) and dependent (Y) variables. But while <a href='https://guide.freecodecamp.org/machine-learning/linear-regression' target='_blank'>Linear Regression</a> assumes that the response variable (Y) is quantitative or continuous, Logistic Regression is used specifically when the response variable is qualititative,or discrete.<br>
![Linear vs Logistic](http://www.saedsayad.com/images/LogReg_1.png)
#### How it Works
Logistic regression models the probability that Y, the response variable, belongs to a certain category. In many cases, the response variable will be a binary one, so logistic regression will want to model a function y = f(x) that outputs a normalized value that ranges from, say, 0 to 1 for all values of X, corresponding to the two possible values of Y. It does this by using the logistic function:
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
![Cost Function](https://cdn-images-1.medium.com/max/800/1*wHtYmENzug_W6fIE9xY8aw.jpeg)
<br>
Logistic regression is used to solve classification problems, where the output is of the form y∈{0,1}. Here, 0 is a negative class and 1 is a positive class. Let's say we have a hypothesis hθ(x), where x is our dataset(a matrix) of length m. θ is the parameteric matrix. We have : 0 < (x) < 1
In Logistic regression, (x) is a sigmoid function, thus (x) = g(θ'x).
g(θ'x) = 1/ 1 + e^('x)
Note: θ' is θ transpose.
#### Cost function
The cost function used for Logistic regression is :
J(θ)=(1/m)∑Cost((x(i)),y(i)) , where summation is from i=1 to m.
Cost((x),y)=log((x)) if y = 1
Cost((x),y)=log(1(x)) if y = 0
#### Predictions using logistic regression:
Logistic regression models the probability of the default class(i.e. the first class).
You can classify results given by :
y = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))
into two classes.Like for sigmoid function 0.5 is set as the decision boundary all x for which y0.5 are classified as class A and for which y<0.5 are classified as class B.
#### Multi class logistic regression:
Although you will see logistic regression usually being used in case of binary classification but you can also use it in case of classification into multiple classes by:
##### one vs one method:
Here a classifier for each class is created separately and the classifier with the highest score is considered as output.
##### one vs all method:
Here multiple(N*N(N-1)/2 where N=no. of classes) binary classifiers are made and then by comparing their scores the output is obtained.
#### Applications of logistic regression:
1) To classify mail as spam or not spam.<br>
2) To determine presence or absence of certain disease like cancer based on symptoms and other medical data.<br>
3) Classify images based on pixel data.
#### Logistic Regression Assumptions
Binary logistic regression requires the dependent variable to be binary.
For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
Only the meaningful variables should be included.
The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
The independent variables are linearly related to the log odds.
Logistic regression requires quite large sample sizes.
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
For further reading to build logistic regression step by step :
- Click <a href="https://medium.com/towards-data-science/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8" target='_blank' rel='nofollow'>here</a> for an article about building a Logistic Regression in Python.
- Click <a href="http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976" target='_blank' rel='nofollow'>here</a> for another article on building a Logical Regression.
- Click <a href="http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976" target='_blank' rel='nofollow'>here</a> for another article on mathematics and intuition behind Logical Regression.

View File

@ -0,0 +1,20 @@
---
title: Monte Carlo
---
## Monte Carlo
The Monte Carlo is a class of simulation techniques that allow you to explore the solution space of a problem that has inputs that can take on multiple values. By running simulations with randomized inputs and model parameters, you can observe outcomes that result from inputs that may have other not been tested. The method is useful for solving problems that may be too difficult to solve analytically. It is not an exact method, but a heuristical one, typically using randomness and statistics to get a result. The algorithm terminates with an answer that is correct with probability.
It is a computation process that uses random numbers to produce an outcome(s). Instead of having fixed inputs, probability distributions are assigned to some or all of the inputs. This will generate a probability distribution for the output after the simulation is run.
For example, a Monte Carlo algorithm can be used to estimate the value of π. The amount of area within a quarter-circle of radius 1 depends on the value of π. The probability that a randomly-chosen point will lie in that quarter-circle depends on the area of the circle. If points are placed randomly in a square with sides of length 1, the percentage of points that fall within a quarter-circle of radius 1 will depend on the value of π. A Monte Carlo algorithm would randomly place points in the square and use the percentage of points falling inside of the circle to estimate the value of π.This is an effective way for making approximations.
In modern communication systems, the quality of information exchange is determined by the presence of noise in the channel. The major source of noise - Additive White Gaussian Noise (AWGN) being random in nature can be characterized using the Monte Carlo algorithm in simulating a Communications System.
### More Information:
- [Wikipedia](https://en.wikipedia.org/wiki/Monte_Carlo_method)
- [Wolfram MathWorld](http://mathworld.wolfram.com/MonteCarloMethod.html)
- [Minitab article - Monte Carlo is not as difficult as you think](http://blog.minitab.com/blog/understanding-statistics/monte-carlo-is-not-as-difficult-as-you-think)
- [Monte Carlo Algorithm (4:41)](https://www.youtube.com/watch?v=Q2-FH36LuT0)

View File

@ -0,0 +1,49 @@
---
title: Natural Language Processing
---
## Natural Language Processing(NLP)
As the Wikipedia says, "Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."
In simpler terms, it is a process in which natural language generated by humans are made sense of by computers.
### Challenges in NLP
#### 1.Easy or mostly solved
*Spam detection
*Part of Speech Tagging
*Named Entity Recognition
#### 2.Intermediate or making good progress
*Sentiment analysis
*Coreference resolution
*Word sense disambiguation
*Parsing
*Machine Translation
*Information Translation
#### 3.Hard or still need lot of work
*Text Summarization
*Machine dialog system
### Common Techniques
*Structure extraction
*Identify and mark sentence, phrase, and paragraph boundaries
*Language identification
*Tokenization
*Acronym normalization and tagging
*Lemmatization / Stemming
*Entity extraction
*Phrase extraction
### Popularly Used Libraries
*NLTK, the most widely-mentioned NLP library for Python.
*SpaCy, an industrial-strength NLP library built for performance.
*Gensim, a library for document similarity analysis.
*TextBlob, a user-friendly and intuitive NLTK interface.
*CoreNLP from stanford group
*PolyGlot, a natural language pipeline that supports massive multilingual applications.
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
For further reading :
- Click <a href="https://medium.com/@gon.esbuyo/get-started-with-nlp-part-i-d67ca26cc828" target='_blank' rel='nofollow'>here</a> for an article about NLP intro.
- Click <a href="https://en.wikipedia.org/wiki/Natural_language_processing" target='_blank' rel='nofollow'>here</a> for the Wikipedia reference.

View File

@ -0,0 +1,11 @@
---
title: Convolutional Neural Networks
---
Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.
### Suggested links :
- Stanford CS231n [Lecture 5 Convolutional Neural Networks](https://www.youtube.com/watch?v=bNb2fEVKeEo)
- Stanford CS231n [Lecture 9 CNN Architectures](https://www.youtube.com/watch?v=DAOcjicFr1Y&t=2384s)
- Udacity Deep learning : [Convolutional netwoks](https://www.youtube.com/watch?v=jajksuQW4mc)
- Andrew Ng's DeepLearning.ai: [Convulational Neural Networks](https://www.coursera.org/learn/convolutional-neural-networks/)

View File

@ -0,0 +1,19 @@
---
title: Generative Adversarial Networks
---
## Generative Adversarial Networks
## Overview
Generative adversarial networks (GANs) are a class of [artificial intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence) algorithms used in [unsupervised machine learning](https://en.wikipedia.org/wiki/Unsupervised_machine_learning), implemented by a system of two [neural networks](https://en.wikipedia.org/wiki/Neural_network) contesting with each other in a zero-sum game framework. They were introduced by Ian Goodfellow et al. in 2014. This technique can generate photographs that look at least superficially authentic to human observers, having many realistic characteristics (though in tests people can tell real from generated in many cases).
## Method
One network generates candidates (generative) and the other [evaluates them](https://en.wikipedia.org/wiki/Turing_test)(discriminative). Typically, the generative network learns to map from a [latent space](https://en.wikipedia.org/wiki/Latent_variable) to a particular data distribution of interest, while the discriminative network discriminates between instances from the true data distribution and candidates produced by the generator. The generative network's training objective is to increase the error rate of the discriminative network (i.e., "fool" the discriminator network by producing novel synthesised instances that appear to have come from the true data distribution).
In practice, a known dataset serves as the initial training data for the discriminator. Training the discriminator involves presenting it with samples from the dataset, until it reaches some level of accuracy. Typically the generator is seeded with a randomized input that is sampled from a predefined latent space (e.g. a [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)). Thereafter, samples synthesized by the generator are evaluated by the discriminator. [Backpropagation](https://en.wikipedia.org/wiki/Backpropagation) is applied in both networks so that the generator produces better images, while the discriminator becomes more skilled at flagging synthetic images. The generator is typically a deconvolutional neural network, and the discriminator is a [convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network).
The idea to infer models in a competitive setting (model versus discriminator) was proposed by Li, Gauci and Gross in 2013. Their method is used for behavioral inference. It is termed Turing Learning, as the setting is akin to that of a [Turing test](https://en.wikipedia.org/wiki/Turing_test). Turing Learning is a generalization of GANs. Models other than neural networks can be considered. Moreover, the discriminators are allowed to influence the processes from which the datasets are obtained, making them active interrogators as in the Turing test. The idea of adversarial training can also be found in earlier works, such as Schmidhuber in 1992.
## Application
GANs have been used to produce samples of [photorealistic](https://en.wikipedia.org/wiki/Photorealistic) images for the purposes of visualizing new interior/industrial design, shoes, bags and clothing items or items for computer games' scenes. These networks were reported to be used by Facebook. Recently, GANs have modeled patterns of motion in video. They have also been used to reconstruct 3D models of objects from images and to improve astronomical images. In 2017 a fully convolutional feedforward GAN was used for image enhancement using automated texture synthesis in combination with perceptual loss. The system focused on realistic textures rather than pixel-accuracy. The result was a higher image quality at high magnification.

View File

@ -0,0 +1,66 @@
---
title: Neural Networks
---
## Neural Networks
![Feed-forward neural network](http://ufldl.stanford.edu/tutorial/images/SingleNeuron.png)
An artificial neural network is a computing system. They are like biological neural networks that constitute animal brains.
To train a neural network, we need an input vector and a corresponding output vector.
The training works by minimizing an error term. This error can be the squared difference between the predicted output and the original output.
The basic principle which underlies the remarkable success of neural networks is 'The Universal Approximation Theorem'. It has been mathematically proven thet the neural networks are universal approximation machines which are capable of approximating any mathematical function between the given input and output.
Neural networks initially became popular in the 1980s, but limitations in computational power prohibited their widespread acceptance until the past decade.
Innovations in CPU size and power allow for neural network implementation at scale, though other machine learning paradigms still outrank neural networks in terms of efficiency.
The most basic element of a neural network is a neuron. It's input is a vector, say `x`, and its output is a real valued variable, say `y`. Thus, we can conclude that the neuron acts as a mapping between the vector `x` and a real number `y`.
Neural networks perform regression iteratively across multiple layers, resulting in a more nuanced prediction model.
A single node in a neural network computes the exact same function as [logistic regression](../logistic-regression/index.md).
All these layers, aside from the input and output, are hidden, that is, the specific traits represented by these layers are not chosen or modified by the programmer.
![Four Layered Neural Network](http://cs231n.github.io/assets/nn1/neural_net2.jpeg)
In any given layer, each node takes all values stored in the previous layer as input and makes predictions on them based on a logistic regression analysis.
The power of neural networks lies in their ability to "discover" patterns and traits unseen by programmers.
As mentioned earlier, the middle layers are "hidden," meaning the weights given to the transitions is determined exclusively by the training of the algorithm.
Neural networks are used on a variety of tasks. These include computer vision, speech recognition, translation, social network filtering, playing video games, and medical diagnosis among other things.
### Visualization
There's an awesome tool to help you grasp the idea of neural networks without any hard math: <a href='http://playground.tensorflow.org' target='_blank' rel='nofollow'>TensorFlow Playground</a>, a web app that lets you play with a real neural network running in your browser and click buttons and tweak parameters to see how it works.
### Problems solved using Neural Networks
- Classification
- Clustering
- Regression
- Anomaly detection
- Association rules
- Reinforcement learning
- Structured prediction
- Feature engineering
- Feature learning
- Learning to rank
- Grammar induction
- Weather prediction
- Generating images
### Common Neural Network Systems
The most common Neural Networks used today fall into the [deep learning](https://github.com/freeCodeCamp/guides/blob/master/src/pages/machine-learning/deep-learning/index.md) category. Deep learning is the process of chaining multiple layers of neurons to allow a network to create increasingly abstract mappings between input and output vectors. Deep neural networks will most commonly use [back propogation](https://github.com/freeCodeCamp/guides/blob/master/src/pages/machine-learning/backpropagation/index.md) in order to converge upon the most accurate mapping.
The second most common form of neural networks is nueroevolution. In this system multiple neural networks are randomly generated as initial guesses. Then multiple generations of combining the accurate most networks and random permutations are used to converge upon a more accurate mapping.
### Types of Neural Networks
- Recurrent Neural Network (RNN)
- Long-short Term Memory (LSTM), a type of RNN
- Convolutional Neural Network (CNN)
### More Information:
- [Neural Networks - Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network#Components_of_an_artificial_neural_network)
- [Daniel Shiffman's Nature of Code](http://natureofcode.com/book/chapter-10-neural-networks/)
- [Stanford University, Multilayer Neural Networks](http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/)
- [3Blue1Brown, Youtube Channel with Neural Network content](https://youtu.be/aircAruvnKk)
- [Siraj Raval, Youtube CHannel with Neural Network content](https://youtu.be/h3l4qz76JhQ)
- [Neuroevolution - Wikipedia](https://en.wikipedia.org/wiki/Neuroevolution)

View File

@ -0,0 +1,15 @@
---
title: Multi Layer Perceptron
---
## Multi Layer Perceptron
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/neural-networks/multi-layer-perceptron/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,13 @@
---
title: Perceptron
---
## Perceptron
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/neural-networks/perceptron/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,15 @@
---
title: Recurrent Neural Networks
---
## Recurrent Neural Networks
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/neural-networks/recurrent-neural-networks/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,21 @@
---
title: One-Shot Learning
---
# One-Shot Learning
Humans learn new concepts with very little need for repetition e.g. a child can generalize the concept
of a “monkey” from a single picture in a book, yet our best deep learning systems need hundreds or
thousands of examples to grasp any object even upto a point of decent accuracy. This motivates the setting we are interested in: “one-shot” learning, which
consists of learning a class from a single (or very few) labelled example.
There are various approaches to One-Shot learning such as [similarity functions](https://www.coursera.org/lecture/convolutional-neural-networks/one-shot-learning-gjckG),
[Bayes' probability theorem](https://www.youtube.com/watch?v=FIjy3lV_KJU), DeepMind has come up with it's own version of Neural Networks using the One-Shot learning approach!
### More information:
* [Siraj Raval on YouTube](https://www.youtube.com/watch?v=FIjy3lV_KJU&feature=youtu.be)
* [Andrew Ng (Deeplearning.ai)](https://www.coursera.org/lecture/convolutional-neural-networks/one-shot-learning-gjckG)
* [Scholarly article](http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.pdf)
* [Wikipedia](https://en.wikipedia.org/wiki/One-shot_learning)

View File

@ -0,0 +1,35 @@
---
title: Eigen Faces
---
## Eigen Faces
### Outline
* Problem
* Solution Approach
* Dataset
* Mathematical Analysis
* Image Reconstruction
### Problem
We typically use the eigenvalues and eigenvectors of the covariance matrix of the data to compute our principal components. What if you are not able to calculate covariance matrix due to memory issues?
### Solution Approach
We now use a trick. Instead of using image dimensions for the covariance matrix, we use number of images. This opens up another advantage. Now that we have the feature vectors of all our images, all we need is these m images to be able to reconstruct any image in the world.
### Defining the dataset
Consider we have m greyscale images of size n x n. m is of the order of 100 and n is of the order 10000. Our goal is to select k components which correctly represent all the features of the image.
We now create a matrix X, where we store the flattened images (n^2 x 1) row wise. Therefore X is of dimension n^2 x m.
### Mathematical Analysis
Computing the covariance of this matrix is where things get interesting.
Covariance of a matrix X is defined as dot(X, X.T) , the dimension of which is n^2 x n^2.This will obviously go out of memory for such a large dataset.
Nowconsider the following set of equations.
dot(X.T, X) V = λ V where V is the Eigenvector and λ is the corresponding Eigenvalues.
Pre-multiplying with X,
dot(dot(X , X.T) , dot(X,V)) = λ dot(X . V)
Thus we find that Eigenvector of the covariance matrix is simply the dot product of the image matrix and the Eigenvector of dot(X.T , X).
We therefore compute dot(X.T , X) , whose dimension is just m x m, and use the Eigenvector of this matrix to construct the Eigenvector of
the original matrix.
The m eigenvalues of dot(X.T, X) (along with their corresponding eigenvectors) correspond to the m largest eigenvalues of dot(X, X.T) (along with their corresponding eigenvectors). Our required
eigenvectors are just the first k eigenvectors and their corresponding eigenvalues. We now compute a matrix of eigenfaces, which is nothing but the images weighted against their
eigenvectors. Weights for every k image will now be dot(X.T , eigenfaces(first k values)).
### Image Reconstruction
This method helps us represent any image using just k features of m images. Any image can be reconstructed using these weights.
To get any image i,
Image(i) = dot(eigenface(k) , weight[i,:].T)

View File

@ -0,0 +1,65 @@
---
title: Principal Component Analysis
---
## Principal Component Analysis (PCA)
### What is it?
Principal Component Analysis (PCA) is an algorithm used in machine learning to reduce the dimensions of a dataset. You might reduce a dataset containing hundreds of different features to a new dataset containing only two.
For example, imagine you want to measure a pilot's ability. There are many different factors involved in this. Two relevant features to take into account might be the pilot's skill and the pilot's enjoyment. This would be a two-dimensional dataset, since it contains two features. PCA could reduce these features into one by fusing them together. You might call this new feature "pilot aptitude". This new feature gives you a simpler metric to measure a pilot's ability.
If you plot pilot skill against pilot enjoyment, you might get something like this:
![Plotting pilot skill versus pilot enjoyment](https://github.com/DHDaniel/guides/blob/master/src/pages/machine-learning/principal-component-analysis/plot-skill-vs-enjoyment.png?raw=true)
Intuitively, what PCA does is it finds the best straight line (or plane, in higher dimensional situations) on which to project these two features. Projection is done by drawing a perpendicular line between the point and the line. You can see an illustration of this below:
![Projection onto line](https://github.com/DHDaniel/guides/blob/master/src/pages/machine-learning/principal-component-analysis/projection.png?raw=true)
You can think of PCA as finding the best line for the dataset so that each point's information is better preserved. It does this by minimizing the sum of the squared projection errors of each point. The projection error is the length of the perpendicular lines projecting each point onto the line. By minimizing these, you ensure that the chosen straight line is the best combination of these two features.
Below are examples of a good line on which to project the data, and a bad one. The good line's resulting projections are more representative of the original data, and have smaller errors. The bad line's resulting projections are clearly a worse representation, and the projection errors are much larger.
![Good versus bad projection of points](https://github.com/DHDaniel/guides/blob/master/src/pages/machine-learning/principal-component-analysis/good-vs-bad-projection.png?raw=true)
**Important**: It is worth noting that PCA is different from [linear regression](https://en.wikipedia.org/wiki/Linear_regression). Their optimization objectives (what they aim to minimize) are different.
If you run PCA on the pilot dataset, you may get a new feature, "pilot aptitude", that looks something like this:
![Transforming the pilot dataset using PCA](https://github.com/DHDaniel/guides/blob/master/src/pages/machine-learning/principal-component-analysis/PCA-on-dataset.png?raw=true)
The mathematics behind PCA is somewhat complicated, but you don't have to be an expert on it to be able to use it. Even though there is a lot of linear algebra behind it, using it is relatively easy. This is because there are plenty of code libraries with ready-made PCA implementations. A few examples include:
- [A JavaScript PCA implementation](https://github.com/mljs/pca).
- [Python scikit-learn implementation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- [MATLAB implementation](https://www.mathworks.com/help/stats/pca.html).
### Why use it?
There are many reasons to use the PCA algorithm. One very important one is to visualize data. It is easy to visualize 1-D, 2-D, and even 3-D data, but beyond that, it becomes hard. In machine learning, it is often very useful to visualize the data before beginning to work on it. But a high-dimensional dataset is very hard to visualize. By using PCA, a hundred-dimensional dataset might be reduced to a 2 dimensional one.
This is especially useful in real-world situations, where datasets are often high-dimensional. For example, you might be able to combine economic performance metrics like GDP, HDI, etc, into a single feature. This enables you to make better comparisons between countries, and it also allows you to visualize the data using a graph.
Another reason for using the PCA algorithm is to make your dataset smaller. For problems involving hundreds of thousands of features (like image processing), machine learning algorithms can take a long time to run. By reducing the number of features, you might improve the speed of these algorithms without sacrificing accuracy. You might also save a lot of disk space, especially if you are working with huge datasets.
### Limitations
Since you are basically simplifying a dataset when you run PCA, some details may be lost in the process. This is especially the case with data points that are very spread out and do not have a very strong correlation.
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- https://www.reddit.com/r/datascience/comments/668pp1
- https://en.wikipedia.org/wiki/Principal_component_analysis
- http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
- http://setosa.io/ev/principal-component-analysis/ (Interactive)
Wikipedia Says, "Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation)."
Principal component analysis(PCA) is a statistical technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables. PCA usually reduces the number of features from N-Dimensional to k-Dimensional where k < N
PCA has following applications :
1) Compression: Increase the computational speed and also to reduce storage space
2) Visualization: PCA can reduce the data to two or three dimensional data for visualization purpose

View File

@ -0,0 +1,23 @@
---
title: Correlation Does not Imply Causation
---
## Correlation Does not Imply Causation
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
Many Fitness and Health related websites often miss this point about research that tends to happen in these fields. They report the scientific research as Causation other than what it really is, Correlation. For eg. researchers found that early risers have lower BMI and are found to less obese. This correlation can be misrepresented as 'Waking up early can reduce chances of Obesity'. We do not know that just waking up early 'caused' the outcome - lower obesity. What we have found here is Correlation.
Informal definition of Correlation goes as - when event A happens, event B also tends to happen and vice-versa. Or people that wake up early tend to be towards the lower end of the weight spectrum. Both events tend to happen together. But it is not necessary that one event caused the other.
Causality means that event A 'caused' or lead to the happening of event B. For eg. if I stand in the sun, I would get tanned. Here then second event occurs because of the first.
In statistics, there is a lot of talk about **correlated variables**. A correlation is a relationship between two variables. **Causation** refers to a relationship where a change in one variable **is responsible for** the change of another variable. This is also known as a **causal relationship**.
When there is a causal relationship between two variables, there is also a correlation between them. But, a correlation between two variables does not imply a causal relationship between them. This is a <a href='https://en.wikipedia.org/wiki/Formal_fallacy' target='_blank' rel='nofollow'>logical fallacy</a>.
This is because a correlation between two variables can be explained by many reasons:
- One variable influences the other. This _would_ be a causal relationship. For example, there is a correlation between household salary and number of cars owned.
- Both variables influence each other. This _would_ be a two-way causal relationship. For example, a correlation between education level and the wealth of a person.
- There is another variable that is influencing both variables under examination. This would _not_ be a causal relationship. For example, number of cars owned and size of the house may be correlated, but these two variables are influenced by another variable: salary. An increase in the number of cars owned does not influence the size of the house.
- Correlation could be a random accident. This would _not_ be a causal relationship. This is the explanation for the previous example of margarine consumption and the divorce rate in Maine.
In machine learning, correlation suffices for making a predictive model. However, just because two variables are correlated does not mean one variable influences the other. In other words, although machine learning may help find a relationship between two variables, it does not necessarily help find the reason for the relationship.

View File

@ -0,0 +1,33 @@
---
title: Data Alone Is not Enough
---
## Data Alone Is not Enough
Without accounting for changing machine learning algorithms or other aspects of
training the model, data alone is not enough to help your learner do better.
> Every learner must embody some knowledge or assumptions beyond the data it's
> given in order to generalize beyond it (Domingos, 2012).
What this statement is essentially saying is that if you blindly choose a
learner just because you've heard it does well, collecting more data won't
necessarily help you in your machine learning goals.
For example, say you have data which depends on time (e.g. time series data)
and you want to use a binary classifier (e.g. logistic regression). Collecting
more time series data might not be the best to help your learner. This is
because a binary classifier isn't designed for time series.
This is not to say that once you've chosen the best machine learning algorithm
based on your problem that adding more data does you no good. In this case, it
will help you.
> Machine learning is not magic; it can't get something from nothing. What it
> does is get more from less...Learners combine knowledge with data to grow
> programs (Domingos, 2012).
#### More Information:
- <a href='https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf' target='_blank' rel='nofollow'>A Few Useful Things to Know about Machine Learning</a>
- <a href='http://www.kdnuggets.com/2015/06/machine-learning-more-data-better-algorithms.html' target='_blank' rel='nofollow'>In Machine Learning, What is Better: More Data or better Algorithms?</a>
- <a href='https://www.quora.com/In-machine-learning-is-more-data-always-better-than-better-algorithms/answer/Xavier-Amatriain?srid=Tds3' target='_blank' rel='nofollow'>In machine learning, is more data always better than better algorithms?</a>

View File

@ -0,0 +1,15 @@
---
title: Feature Engineering Is the Key
---
## Feature Engineering Is the Key
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/feature-engineering-is-the-key/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,15 @@
---
title: Principles
---
## Principles
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,30 @@
---
title: Intuition Fails in High Dimensions
---
## Intuition Fails in High Dimensions
#### Imagine
A 2D plane with `X` and `Y` axis. On it you mark points `(1,0)` and `(0,1)`. And through them you draw a straight line. Even without looking at the image below, one can get an idea about how the graph would look like.
![X-Y plane with your imaginary line](https://ka-perseus-graphie.s3.amazonaws.com/466568bad0126c402380ff2ea57aad004f36172b.svg)
Now let's imagine a 3D plane with `X`, `Y` and `Z` axis. Through this 3D structure, a plane passes that intersects the `X`axis at `(2, 0, 0)`, `Y` axis at `(0, 3, 0)` and `Z` Axis at `(0, 0, 6)`. Such a plane is tough to imagine in our heads, but if we try we would end up with something that looks like this.
![X-Y-Z with our plane](http://tutorial.math.lamar.edu/Classes/CalcIII/SurfaceArea_files/image001.gif)
With that, we get to step into the next higher dimesion. Planes in dimensions higher than `3` are refered to as hyperplanes. But first let us address just where does the fourth axis even point to. Let's call it `W` axis. And similar to previous cases of new Axis creation, this `W` axis would be perpendicular to pre-existing axes (`X`, `Y` and `Z`). Just like `Z` was perpendicular to `X` and `Y` axes.
![X-Y-Z-W Axes](http://eusebeia.dyndns.org/4d/vis/4d-axes.png)
> It is important to understand that the W-axis as depicted here is perpendicular to all of the other coordinate axes. We may be tempted to try to point in the direction of W, but this is impossible because we are confined to 3-dimensional space.
Because we live in a world that is 3-dimensional, its difficult for us to comprehend the world that has dimensions higher than 3. This is the reason for our intuition and imagination to be of limited help in higher dimensions.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
* <a href="http://eusebeia.dyndns.org/4d/vis/01-intro">4D Visualization and Why It Matters</a>
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,28 @@
---
title: Its Generalization That Counts
---
## Its Generalization That Counts
The power of machine learning comes from not having to hard code or explicitly
define the parameters that describe your training data and unseen data. This is
the essential goal of machine learning: to generalize a learner's findings.
To test a learner's generalizability, you'll want to have a separate test data
set that is not used in any way in training the learner. This can be created by
either splitting your entire training data set into a training and test set, or
to just collect more data. If the learner were to use data found in the test
data set, this would create a sort of bias in your learner to do better than in
reality.
One method to get a sense on how your learner will do on a test data set is to
perform what is called **cross-validation**. This randomly splits up your
training data into a given number of subsets (for example, ten subsets) and
leaves one subset out while the learner trains on the rest. And then once the
learner has been trained, the left out data set is used for testing. This
training, leaving one subset out, and testing is repeated as you rotate through
the subsets.
#### More Information:
- <a href='https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf' target='_blank' rel='nofollow'>A Few Useful Things to Know about Machine Learning</a>
- <a href='https://stats.stackexchange.com/a/153058/132399' target='_blank' rel='nofollow'>"How do you use test data set after Cross-validation"</a>

View File

@ -0,0 +1,15 @@
---
title: Learn Many Models not Just One
---
## Learn Many Models not Just One
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/learn-many-models-not-just-one/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,31 @@
---
title: Learning Equals Representation Evaluation Optimization
---
## Learning Equals Representation Evaluation Optimization
The field of machine learning has exploded in recent years and researchers have
developed an enormous number of algorithms to choose from. Despite this great
variety of models to choose from, they can all be distilled into three
components.
The three components that make a machine learning model are representation,
evaluation, and optimization. These three are most directly related to
supervised learning, but it can be related to unsupervised learning as well.
**Representation** - this describes how you want to look at your data.
Sometimes you may want to think of your data in terms of individuals (like in
k-nearest neighbors) or like in a graph (like in Bayesian networks).
**Evaluation** - for supervised learning purposes, you'll need to evaluate or
put a score on how well your learner is doing so it can improve. This
evaluation is done using an evaulation function (also known as an *objective
function* or *scoring function*). Examples include accuracy and squared error.
**Optimization** - using the evaluation function from above, you need to find
the learner with the best score from this evaluation function using a choice of
optimization technique. Examples are a greedy search and gradient descent.
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- <a href='https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf' target='_blank' rel='nofollow'>A Few Useful Things to Know about Machine Learning</a>

View File

@ -0,0 +1,15 @@
---
title: More Data Beats a Cleverer Algorithm
---
## More Data Beats a Cleverer Algorithm
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/more-data-beats-a-cleverer-algorithm/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,39 @@
---
title: Overfitting Has Many Faces
---
## Overfitting Has Many Faces
If a learning algorithm fits a given training set well, this does not simply indicate a good hypothesis. Overfitting occurs when the hypothesis function J(Θ) fits your training set too closely having a high variance and low error on the training set while having a high test error on any other data.
In other words, overfitting occrus if the error of the hypothesis as measured on the data set that was used to train the parameters happens to be lower than the error on any other data set.
### Choosing an Optimal Polynomial Degree
Choosing the right degree of polynomial for the hypothesis function is important in avoiding overfitting. This can be achieved by testing each degree of polynomial and observing the effect on the error result over various parts of the data set. Hence, we can break down our data set into 3 parts that can be used in optimizing the hypothesis' theta and polynomial degree.
A good break-down ratio of the data set is:
- Training set: 60%
- Cross validation: 20%
- Test set: 20%
The three error values can thus be calculatted by the following method:<sup>1</sup>
1. Use the training set for each polynomial degree in order to optimize the parameters in `Θ`
2. Use the cross validation set to find the polynomial degree with the lowest error
3. Use the test set to estimate the generalization error
### Ways to Fix Overfitting
These are some of the ways to address overfitting:
1. Getting more training examples
2. Trying a smaller set of features
3. Increasing the parameter `λ lambda`
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
[Coursera's Machine Learning Course](https://www.coursera.org/learn/machine-learning)
### Sources
1. [Ng, Andrew. "Machine Learning". *Coursera* Accessed January 29, 2018](https://www.coursera.org/learn/machine-learning)

View File

@ -0,0 +1,15 @@
---
title: Representable Does not Imply Learnable
---
## Representable Does not Imply Learnable
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/representable-does-not-imply-learnable/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,15 @@
---
title: Simplicity Does not Imply Accuracy
---
## Simplicity Does not Imply Accuracy
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/simplicity-does-not-imply-accuracy/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,15 @@
---
title: Theoretical Guarantees Are not What They Seem
---
## Theoretical Guarantees Are not What They Seem
This is a stub. <a href='https://github.com/freecodecamp/guides/tree/master/src/pages/machine-learning/principles/theoretical-guarantees-are-not-what-they-seem/index.md' target='_blank' rel='nofollow'>Help our community expand it</a>.
<a href='https://github.com/freecodecamp/guides/blob/master/README.md' target='_blank' rel='nofollow'>This quick style guide will help ensure your pull request gets accepted</a>.
<!-- The article goes here, in GitHub-flavored Markdown. Feel free to add YouTube videos, images, and CodePen/JSBin embeds -->
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->

View File

@ -0,0 +1,55 @@
---
title: Random Forest
---
## Random Forest
A Random Forest is a group of decision trees that make better decisions as a whole than individually.
### Problem
Decision trees by themselves are prone to **overfitting**. This means that the tree becomes so used to the training data that it has difficulty making decisions for data it has never seen before.
### Solution with Random Forests
Random Forests belong in the category of **ensemble learning** algorithms. This class of algorithms use many estimators to yield better results. This makes Random Forests usually **more accurate** than plain decision trees. In Random Forests, a bunch of decision trees are created. Each tree is **trained on a random subset of the data and a random subset of the features of that data**. This way the possibility of the estimators getting used to the data (overfitting) is greatly reduced, because **each of them work on the different data and features** than the others. This method of creating a bunch of estimators and training them on random subsets of data is a technique in *ensemble learning* called **bagging** or *Bootstrap AGGregatING*. To get the prediction, the each of the decision trees vote on the correct prediction (classification) or they get the mean of their results (regression).
### Example of Boosting in Python
In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand.
The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal.
https://www.kaggle.com/c/flavours-of-physics/data
```python
#Data Cleaning
import pandas as pd
data_test = pd.read_csv("test.csv")
data_train = pd.read_csv("training.csv")
data_train = data_train.drop('min_ANNmuon',1)
data_train = data_train.drop('production',1)
data_train = data_train.drop('mass',1)
#Cleaned data
Y = data_train['signal']
X = data_train.drop('signal',1)
#adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
seed = 9001 #this ones over 9000!!!
boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME",
n_estimators=50, random_state = seed)
model = boosted_tree.fit(X, Y)
predictions = model.predict(data_test)
print(predictions)
#Note we can't really validate this data since we don't have an array of "right answers"
#stochastic gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed)
model2 = gradient_boosted_tree.fit(X,Y)
predictions2 = model2.predict(data_test)
print(predictions2)
```
#### More Information:
- <a href='https://www.wikiwand.com/en/Random_forest' target='_blank' rel='nofollow'>Random Forests (Wikipedia)</a>
- <a href='https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/' target='_blank' rel='nofollow'>Introduction to Random Forests - Simplified</a>
- <a href='https://www.youtube.com/watch?v=loNcrMjYh64' target='_blank' rel='nofollow'>How Random Forest algorithm works (Video)</a>

View File

@ -0,0 +1,31 @@
---
title: Reinforcement Learning
---
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html)
#### Reinforcement Learning
<!-- Please add your working draft below in GitHub-flavored Markdown -->
Reinforcement Learning refers to a field of Machine Learning that applies to agents that you reinforce by giving them reward and punishment. It gives a nice gradual learning and can simplify the learning of agent in tasks where you cannot determine a proper error value.
Example:
A bot is given a task to play Space Invaders, it tries to learn to play it by interacting with game and in return getting a reward for the points that it scored at end of the game. Greater the reward, greater are its chances of doing the similar gameplay. In that way, it learns how to play the game and perform in the best possible way.
In industries robot uses deep reinforcement learning to pick a device from one box and putting it in a container. Whether it succeeds or fails, it memorizes the object and gains knowledge and trains itself to do this job with great speed and precision. Learning on its own is a kind of reinforcement learning provided the learning is in positive dimension.
The best example, and one which you will hear a lot in this field, is AlphaGo developed by Google. This uses reinforcement learning to learn the patterns, rules and semantics of the board game, Go. This bot defeated the World No. 1 Go player, Lee Sedol, in what was the first time a computer program defeated a professional player. AlphaGo won by 4-1 in a five game series. This was a huge victory for AI and kickstarted the field of Reinforcement learning.
## List of Common Algorithms
Q-Learning
Temporal Difference (TD)
Deep Adversarial Networks
## Use cases:
Some applications of the reinforcement learning algorithms are computer played board games (Chess, Go), robotic hands, and self-driving cars.
## More information:
* [David Silver's RL course](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

View File

@ -0,0 +1,15 @@
---
title: Stochastic Process
---
## Stochastic Process
A Stochastic Process is basically a process which is non-deterministic. Note that it may have a probability distribution and may not be completely random. For instance the process where a 2-sided _unfair_ coin is flipped again and again. Or for instance, direction a drop of water may take on a roughly plain surface (non-determinism here may arise due to occassional roughness which makes it hard to determine the path of the droplet).
Stochastic processes are widely used as mathematical models of systems and phenomena that appear to vary in a random manner. They have applications in many disciplines including the natural sciences such as biology, chemistry and physics as well as technology and engineering fields such as image processing, signal processing, information theory, computer science, cryptography and telecommunications. Furthermore, seemingly random changes in financial markets have motivated the extensive use of stochastic processes in finance.
### More Information:
* <a href="https://en.wikipedia.org/wiki/Stochastic_process">Stochastic Process</a>

View File

@ -0,0 +1,38 @@
---
title: Supervised Learning
---
## Supervised Learning
In supervised learning, we know what the correct output should be. Supervised learning problems can be categorized into regression and classification. A regression problem is where you map input to a continuous output. A classification problem, in contrast, is where you map (group) inputs into discrete categories.
### Regression
Given data about used cars such as their mileage, you can predict their market prices. Since price is a continuous variable, this is a regression problem. In another example, Microsoft released a web app that predicts age based on picture. Again, as age is continuous rather than discrete or categorical, this is also a regression problem.
### Classification
The regression problems above can be turned into classification problems. Suppose you want to look for a used car less than X dollars. Then the output would be if the used car fits the price that you set. Similarly, age prediction can be a classification problem if we are looking to predict if a submitted picture belongs to someone under 18, and therefore should not be allowed to buy cigarettes.
### Discussion points:
- What is special about supervised learning?
- In what scenario would you use it in?
- Caveats or traps to think about?
- What are some example models?
#### Example 1:
> Given data about the size of houses on the real estate market, try to predict their price.
Price as a function of size is a continuous output, so this is a regression problem.
#### Example 2:
(a) Regression - For continuous-response values. For example given a picture of a person, we have to predict their age on the basis of the given picture
(b) Classification - for categorical response values, where the data can be separated into specific “classes”. For example given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
#### Suggested Reading:
- https://en.wikipedia.org/wiki/Supervised_learning
- https://stackoverflow.com/a/1854449/6873133

View File

@ -0,0 +1,175 @@
---
title: Support Vector Machine
---
## Support Vector Machine
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. It does this by minimizing the margin between the data points near the hyperplane.
![SVM vs logistic regression](https://i.imgur.com/KUeOSK3.png)
A SVM cost function seeks to approximate the logistic function with a piecewise linear. This ML algorithm is used for classification problems and is part of the subset of supervised learning algorithms.
### The Cost Function
![SVM Cost Function](https://i.imgur.com/SOhv2jZ.png)
The Cost Function is used to train the SVM. By minimizing the value of J(theta), we can ensure that the SVM is as accurate as possible. In the equation, the functions cost1 and cost0 refer to the cost for an example where y=1 and the cost for an example where y=0. Cost, for SVMs, is determined by kernel (similarity) functions.
### Kernels
Polynomial features are possibly computationally expensive and may slow down runtime with large datasets.
Rather than adding more polynomial features, add "landmarks" against which you test the proximity of other datapoints.
Each member of the training set is a landmark.
A kernel is the "similarity function" that measures how close an input is to a certain marker.
### Large Margin Classifier
An SVM will find the line (or hyperplane in the more general case) that splits the data with the largest margin.
While outliers may sway the line to one direction, a small enough C value will enforce regularization.
This new regularizing works the same with 1/\lambda, as seen in linear and logistic regression, but here we modify the cost component.
#### More Information:
[Andrew Ng's ML Course](https://www.coursera.org/learn/machine-learning/)
[Standalone Video Lecture](https://www.youtube.com/watch?v=1NxnPkZM9bc)
[SVM on Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine)
The following is the code written for training, predicting and finding accuracy for SVM in python. This is done using Numpy, however, we can also write using scikit-learn in just a function call.
```Python
import numpy as np
class Svm (object):
"""" Svm classifier """
def __init__ (self, inputDim, outputDim):
self.W = None
# - Generate a random svm weight matrix to compute loss #
# with standard normal distribution and Standard deviation = 0.01. #
sigma =0.01
self.W = sigma * np.random.randn(inputDim,outputDim)
def calLoss (self, x, y, reg):
"""
Svm loss function
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: A numpy array of shape (batchSize, D).
- y: A numpy array of shape (N,) where value < C.
- reg: (float) regularization strength.
Returns a tuple of:
- loss as single float.
- gradient with respect to weights self.W (dW) with the same shape of self.W.
"""
loss = 0.0
dW = np.zeros_like(self.W)
# - Compute the svm loss and store to loss variable. #
# - Compute gradient and store to dW variable. #
# - Use L2 regularization #
#Calculating score matrix
s = x.dot(self.W)
#Score with yi
s_yi = s[np.arange(x.shape[0]),y]
#finding the delta
delta = s- s_yi[:,np.newaxis]+1
#loss for samples
loss_i = np.maximum(0,delta)
loss_i[np.arange(x.shape[0]),y]=0
loss = np.sum(loss_i)/x.shape[0]
#Loss with regularization
loss += reg*np.sum(self.W*self.W)
#Calculating ds
ds = np.zeros_like(delta)
ds[delta > 0] = 1
ds[np.arange(x.shape[0]),y] = 0
ds[np.arange(x.shape[0]),y] = -np.sum(ds, axis=1)
dW = (1/x.shape[0]) * (x.T).dot(ds)
dW = dW + (2* reg* self.W)
return loss, dW
def train (self, x, y, lr=1e-3, reg=1e-5, iter=100, batchSize=200, verbose=False):
"""
Train this Svm classifier using stochastic gradient descent.
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: training data of shape (N, D)
- y: output data of shape (N, ) where value < C
- lr: (float) learning rate for optimization.
- reg: (float) regularization strength.
- iter: (integer) total number of iterations.
- batchSize: (integer) number of example in each batch running.
- verbose: (boolean) Print log of loss and training accuracy.
Outputs:
A list containing the value of the loss at each training iteration.
"""
# Run stochastic gradient descent to optimize W.
lossHistory = []
for i in range(iter):
xBatch = None
yBatch = None
# - Sample batchSize from training data and save to xBatch and yBatch #
# - After sampling xBatch should have shape (batchSize, D) #
# yBatch (batchSize, ) #
# - Use that sample for gradient decent optimization. #
# - Update the weights using the gradient and the learning rate. #
#creating batch
num_train = np.random.choice(x.shape[0], batchSize)
xBatch = x[num_train]
yBatch = y[num_train]
loss, dW = self.calLoss(xBatch,yBatch,reg)
self.W= self.W - lr * dW
lossHistory.append(loss)
# Print loss for every 100 iterations
if verbose and i % 100 == 0 and len(lossHistory) is not 0:
print ('Loop {0} loss {1}'.format(i, lossHistory[i]))
return lossHistory
def predict (self, x,):
"""
Predict the y output.
Inputs:
- x: training data of shape (N, D)
Returns:
- yPred: output data of shape (N, ) where value < C
"""
yPred = np.zeros(x.shape[0])
# - Store the predict output in yPred #
s = x.dot(self.W)
yPred = np.argmax(s, axis=1)
return yPred
def calAccuracy (self, x, y):
acc = 0
# - Calculate accuracy of the predict value and store to acc variable
yPred = self.predict(x)
acc = np.mean(y == yPred)*100
return acc
```
#### More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
<a href='http://scikit-learn.org/stable/modules/svm.html' target='_blank' rel='nofollow'>Scikit-learn SVM</a>

View File

@ -0,0 +1,15 @@
---
title: Machine Learning using Tensorflow
---
#### What is TensorFlow?
"TensorFlow is an open-source machine learning library for research and production. TensorFlow offers APIs for beginners and experts to develop for desktop, mobile, web, and cloud."
Tensorflow allows users to create dataflow graphs. These are structures that describe how a data moves through a graph. These nodes are connected to each other, with nodes representing the mathematical operations and the connections representing the multi dimensional arrays or tensors.
Tensorflow allows developers to concentrate more on the logic of the application than getting stuck in the complex algorithms or figuring out the most optimal way to implement those algorithms. While creating a Deep Network, it can also become tedious to determine which node needs to be connected to which. Tensorflow makes it easier to make connections between the layers. Thus, it becomes easier for the developers to concentrate more on making the application better.
#### More Information:
* [TensorFlow](https://www.tensorflow.org)
* [TensorFlow GitHub Repository](https://github.com/tensorflow)
* [Wikipedia—TensorFlow](https://en.wikipedia.org/wiki/TensorFlow)

View File

@ -0,0 +1,39 @@
---
title: Unsupervised Learning
---
#### Suggested Reading:
<!-- Please add any articles you think might be helpful to read before writing the article -->
- https://en.wikipedia.org/wiki/Unsupervised_learning
- https://stackoverflow.com/a/1854449/6873133
- http://mlg.eng.cam.ac.uk/zoubin/papers/ul.pdf
#### Draft of Article:
<!-- Please add your working draft below in GitHub-flavored Markdown -->
<!--
Discussion points:
- Unsupervised learning doesn't have a correct answer i.e. you can't be more or less accurate in the output
- Learn "hidden" structure in data
- Clustering is classical example
- Group like things together
- Example use case: movie database with people's preferences, you want to cluster and see different types of people
- Example use case: grouping documents or articles of similar content
-->
What is Unsupervised Learning?
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
Types:
Clustering: Is used for exploratory data analysis to find hidden patterns or grouping in data. Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.
Approaches to unsupervised learning include:
clustering. k-means. mixture models. hierarchical clustering,anomaly detection.Neural Networks. Hebbian Learning. Generative Adversarial Networks.Approaches for learning latent variable models such as. Expectationmaximization algorithm (EM) Method of moments
Few more Examples:
Suppose you have data for an E-commerce site. You have a list of people and things they have ordered online this last week. You can now use Clustering Algorithms and find the pattern in the data, predict the buying trend and formulate the business strategy as per the trend.

View File

@ -0,0 +1,12 @@
---
title: YOLO
---
## YOLO - You Only Look Once method for real-time object detection
YOLO is a combined classification and detection framework that is capable of making predictions in real-time, and is on par with state of art detection frameworks.
#### More Information:
- YOLO [paper](https://arxiv.org/abs/1506.02640)
- Author [website](https://pjreddie.com/darknet/yolo/)
- Demo [video](https://www.youtube.com/watch?v=VOC3huqHrss)