K Means Clustering

Tags:  

Clustering is a technique that partitions the data set into groups of similar observations.  The resultant clusters are more homogenous within each group and heterogeneous between each group.

These two world maps were created using K-means clustering.  Each colour represents a specific cluster.   The top map was created using 2 clusters and the bottom map was created using 5 clusters.  The data set contained 9 different attributes.  These included data such as GDP growth, gender equality, number of years of school and median age.

These two world maps were created using K-means clustering. Each colour represents a specific cluster. The top map was created using 2 clusters and the bottom map was created using 5 clusters. The data set contained 9 different attributes. These included data such as GDP growth, gender equality, number of years of school and median age.

Unsupervised Learning

Clustering is an example of unsupervised learning where there is no specific outcome attached to each observation.  The algorithm simply seeks to derive patterns of similarity between the observations and their respective attributes.

Euclidean Distance

The K-means algorithm typically uses Euclidean distance as a measure of similarity and finds k centroids where the total deviation from this centroid is minimised.  Therefore, the algorithm attempts to minimise the following cost:

\it{cost} = \frac{1}{m}\sum\limits_{i = 1}^m \|x_i - \mu_c\|^2

The equation above means that the total cost (deviation from the centroid) can be reduced simply by adding clusters.  At the limit, where the (number of clusters) = (number of observations), the deviation from the centroids will be zero.

Cohesiveness

In the map above, the blue clusters (i.e. Western Europe and the United States…etc) have the smallest average deviation from their cluster centroid which implies that this cluster has the most cohesiveness.

Interpreting Clusters

The K-means algorithm results in the data set being partitioned into K-clusters but the cluster number assigned to each observation has no ordinal meaning.  This means that cluster number 1 is not necessarily better or worse than cluster number 5.  However, the algorithm will return centroids for each cluster and using some judgment and knowledge of the data set the analyst can then assign some meaning each respective cluster.

This image focuses on Europe only. The map of the left has 2 clusters, while the map of the right has 5 clusters. Although latitude / longitude information was not used, the clusters seem to represent a north / south, east / west split. Its difficult to see Switzerland but interestingly, in the right diagram, it is a member of the purple cluster.

This image focuses on Europe only. The map of the left has 2 clusters, while the map of the right has 5 clusters. Although latitude / longitude information was not used, the clusters seem to represent a north / south, east / west split. Its difficult to see Switzerland but interestingly, in the right diagram, it is a member of the purple cluster.

Benefits of Clustering

Breaking a large dataset into more homogenous groups has uses in marketing where a large number of customers are grouped into similar segments and slightly different products and / or advertising messages and created for each segment.  Clustering is also used in disease classification, data compression and outlier / anomaly  / fraud detection.

How many clusters

When using K-means clustering, the number of clusters is pre-determined before running the algorithm.  If the analyst decides that the data set is best partitioned into 3 clusters, then the algorithm will return 3 clusters.  Sometimes the economics of a situation will determine the number of clusters. For example, if a dress maker can only afford to manufacture 4 different dress sizes, then this puts a ceiling on the number of clusters.

Using the cost function

The total cost (see equation above) will decrease as more clusters are added.  But sometime this cost may show a pronounced reduction between two different cluster sizes before flattening.  The graph below shows the reduction in variance when more clusters are added to the world data set.

As the number of clusters increases, the total variation decreases. In the graph above the steepness of the curve decreases after the first few clusters, however, there is no radical change in the slope of the curve between a pair of specific clusters.

As the number of clusters increases, the total variation decreases. In the graph above the steepness of the curve decreases after the first few clusters, however, there is no radical change in the slope of the curve between a pair of specific clusters.

Implementation using R

The kmeans() function is implemented by supplying a matrix of observations x features as an input.  This input should be first scaled to mean zero, unit variance to ensure that the units of measurement are standardized.  The number of clusters are provided as a second argument and finally the number of iterations is specified using “nstart”.

# normalise the data
data.Mat <- scale(data.Mat)
# run k-mans with 5 clusters and 100 iterations
kmOutput <- kmeans(data.Mat, 5, nstart = 100)
#output cluster numbers for items 40..45
kmOutput$cluster[40:45]
ARG HRV BHR BHS URY KWT
3 5 1 4 5 1

Non  Deterministic Algorithm

The K-means algorithm is non-deterministic which means that it can provide a different solution each time that it is run. So to ensure some consistent results, the algorithm is typically run about 50 times and the solution that provides the lowest cost is used as the best solution. This is shown above by the second parameter.

Github Code and Data

The code and data from the above examples is available on Github here.

OctocatSmall