Clustering is a technique that partitions the data set into groups of similar observations. The resultant clusters are more homogenous within each group and heterogeneous between each group.
Clustering is an example of unsupervised learning where there is no specific outcome attached to each observation. The algorithm simply seeks to derive patterns of similarity between the observations and their respective attributes.
The K-means algorithm typically uses Euclidean distance as a measure of similarity and finds k centroids where the total deviation from this centroid is minimised. Therefore, the algorithm attempts to minimise the following cost:
The equation above means that the total cost (deviation from the centroid) can be reduced simply by adding clusters. At the limit, where the (number of clusters) = (number of observations), the deviation from the centroids will be zero.
In the map above, the blue clusters (i.e. Western Europe and the United States…etc) have the smallest average deviation from their cluster centroid which implies that this cluster has the most cohesiveness.
The K-means algorithm results in the data set being partitioned into K-clusters but the cluster number assigned to each observation has no ordinal meaning. This means that cluster number 1 is not necessarily better or worse than cluster number 5. However, the algorithm will return centroids for each cluster and using some judgment and knowledge of the data set the analyst can then assign some meaning each respective cluster.
Benefits of Clustering
Breaking a large dataset into more homogenous groups has uses in marketing where a large number of customers are grouped into similar segments and slightly different products and / or advertising messages and created for each segment. Clustering is also used in disease classification, data compression and outlier / anomaly / fraud detection.
How many clusters
When using K-means clustering, the number of clusters is pre-determined before running the algorithm. If the analyst decides that the data set is best partitioned into 3 clusters, then the algorithm will return 3 clusters. Sometimes the economics of a situation will determine the number of clusters. For example, if a dress maker can only afford to manufacture 4 different dress sizes, then this puts a ceiling on the number of clusters.
Using the cost function
The total cost (see equation above) will decrease as more clusters are added. But sometime this cost may show a pronounced reduction between two different cluster sizes before flattening. The graph below shows the reduction in variance when more clusters are added to the world data set.
Implementation using R
The kmeans() function is implemented by supplying a matrix of observations x features as an input. This input should be first scaled to mean zero, unit variance to ensure that the units of measurement are standardized. The number of clusters are provided as a second argument and finally the number of iterations is specified using “nstart”.
# normalise the data data.Mat <- scale(data.Mat) # run k-mans with 5 clusters and 100 iterations kmOutput <- kmeans(data.Mat, 5, nstart = 100) #output cluster numbers for items 40..45 kmOutput$cluster[40:45] ARG HRV BHR BHS URY KWT 3 5 1 4 5 1
Non Deterministic Algorithm
The K-means algorithm is non-deterministic which means that it can provide a different solution each time that it is run. So to ensure some consistent results, the algorithm is typically run about 50 times and the solution that provides the lowest cost is used as the best solution. This is shown above by the second parameter.
Github Code and Data
The code and data from the above examples is available on Github here.