fix(guide): update latest copy from guide repo

This commit is contained in:
Mrugesh Mohapatra
2018-10-12 04:30:38 +05:30
committed by mrugesh mohapatra
parent 73a97354e1
commit 7a860204af
833 changed files with 21403 additions and 3264 deletions

View File

@ -1,28 +1,44 @@
---
title: Clustering Algorithms
---
## Clustering Algorithms
---
# Clustering Algorithms
Clustering is the process of dividing data into separated groups (clusters), while ensuring that:
- Each cluster contains similar objects
- Objects which do not belong to the same clusters are not similar
The difficulty lies in the definition of a similarity measure that can separate the data in the way we want. For instance, a group of persons can be separated by gender, hair color, weight, race...etc.
Clustering algorithms help find structure in a collection of unlabelled data and fall in the category of unsupervised learning.
Clustering algorithms have the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
The difficulty lies in the definition of a similarity measure that can separate the data in the way you want. For instance, a group of persons can be separated by gender, hair color, weight, race, etc.
Clustering algorithms have the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis. It's used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Some applications of clustering algorithms include:
* Grouping consumers according to their purchase patterns
* Grouping photos of animals of the same kind together
* Classification of different types of documents
## Types of Clustering Algorithms:
1. Connectivity-based clustering (hierarchical clustering)
2. Centroid-based clustering (k-means clustering)
2. Centroid-based or point assignment clustering (k-means clustering)
3. Distribution-based clustering
4. Density-based clustering
## Examples
Some examples of clustering algorithms are:
1. Alogmerative clustering
2. k-means clustering
3. k-mediods clustering
2. K-means clustering
3. K-mediods clustering
4. Partition Clustering
### Hierarchical Clustering
There are methods for clustering that only use similarities of instances, without any other requirement on the data; the aim is to find groups such that instances in a group are more similar to each other than instances in different groups. This is the approach taken by hierarchical clustering.
This needs the use of a similarity, or equivalently a distance, measure defined between instances. Generally Euclidean distance is used, where one has to make sure that all attributes have the same scale.
### Point Assignment
This method maintains a set of clusters, and it places points to nearest clusters.
## Specific Clustering Algorithms
### K-Means Clustering
K-means algorithm is a popular clustering algorithm since it is relatively simple and fast, as opposed to other clustering algorithms. The algorithm is defined as the following:
@ -33,10 +49,9 @@ K-means algorithm is a popular clustering algorithm since it is relatively simpl
4. Once all data points have been classified, compute the midpoint of all points for each cluster and assign as new centroid
5. Repeat steps 3 and 4 until the centroids converge upon certain k points.
Since we only need to calculate k x n distances (rather than n(n-1) distances for knn algorithm), this algorithm is quite scalable.
Let's cluster on the Iris Dataset https://www.kaggle.com/uciml/iris
Here's a clustering example in Python that uses the [Iris Dataset](https://www.kaggle.com/uciml/iris)
```python
import pandas as pd
@ -80,10 +95,29 @@ plt.show()
Since the data points belong usually to a high-dimensional space, the similarity measure is often defined as a distance between two vectors (Euclidean, Manhathan, Cosine, Mahalanobis...)
#### More Information:
### Mixture Density
We can write *mixture density* as:
![mixture density](https://latex.codecogs.com/gif.latex?p%28x%29%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp%28x%7CG_%7Bi%7D%29p%28G_%7Bi%7D%29)
where Gi are the mixture components. They are also called group or clusters. p(x|Gi) are the component densities and P(Gi) are the mixture proportions. The number of components, k, is a hyperparameter and should be specified beforehand.
### Expectation-Maximization (EM)
In this approach is probabilistic and we look for the component density parameters that maximize the likelihood of the sample.
The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood (ML) estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate the model parameter(s) for which the observed data are the most likely.
Each iteration of the EM algorithm consists of two processes: The E-step, and the M-step.
1. In the expectation, or E-step, the missing data are estimated given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation, explaining the choice of terminology.
2. In the M-step, the likelihood function is maximized under the assumption that the missing data are known. The estimate of the missing data from the E-step are used in lieu of the actual missing data.
Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration.
## More Information:
<!-- Please add any articles you think might be helpful to read before writing the article -->
* [Cluster Analysis](https://en.wikipedia.org/wiki/Cluster_analysis)
* [Wikipedia Cluster Analysis article](https://en.wikipedia.org/wiki/Cluster_analysis)
* [Introduction to Clustering and related algorithms](https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/)
* https://www.datascience.com/blog/k-means-clustering
* http://fromdatawithlove.thegovans.us/2013/05/clustering-using-scikit-learn.html
* http://datascienceis.life
* [Clustering Algorithms-Stanford University Slides](https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf)
* [Clustering Algorithms: From Start To State Of The Art](https://www.toptal.com/machine-learning/clustering-algorithms)
* [Cluster Analysis: Basic Concepts and Algorithms](https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
* [K-means Clustering](https://www.datascience.com/blog/k-means-clustering)
* [Expectation-Maximization Algorithm](https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf)
* [Using K-Means Clustering with Python](https://code.likeagirl.io/finding-dominant-colour-on-an-image-b4e075f98097)