You can edit almost every page by Creating an account and confirming your email.

K-modes clustering

From EverybodyWiki Bios & Wiki

k-modes clustering is a modified version of the standard k-means clustering process optimized to cluster categorical data. It does so by using the simple matching dissimilarity measure, also referred to as the Hamming distance, instead of the Euclidean distance to calculate the distance between two objects. Furthermore, it uses modes instead of means to represent the cluster centroids.

Reasoning

While the k-means algorithm is a very popular choice when clustering numerical data, it performs poorly when applied to categorical data. The reason is that in order to cluster categorical data, the categorical values first have to be transformed into numerical values, which distorts the clustering due to the usage of the Euclidean distance that leads the k-means algorithm to consider two distant values as close, simply based on the proximity of their numerical representations. One solution to this problem, although not suitable for high dimensional categorical data, is to convert the different categories to dummy binary variables and assign a 1 if the categorical value is present and a 0 otherwise. A solution that works even with high dimensional categorical data is k-modes clustering.

Algorithm

Simple Matching Dissimilarity Measure

The simple matching dissimilarity measure, also referred to as the Hamming distance, calculates the distance between two objects X and Y defined by m categorical attributes with

d(X,Y)=i=1mδ(xi,yi)

where

δ(xi,yi)={0,xi=yi1,xiyi

and xi and yi are the categorical values of attribute i in X and Y. The more mismatches there are between X and Y, the bigger the Hamming distance and thus the bigger the dissimilarity between the two objects. Applying this distance function to k-modes clustering, one obtains a new function that measures the distance and therefore the dissimilarity between an object X and a cluster centroid Z1 defined as

δ(xi,zi)={1n1rn1,xi=zi1,xizi

where zi is the categorical value of attribute i in Z1, n1 is the number of objects in cluster 1 and nir is the number of objects in cluster 1 whose attribute value is r. Hence, if X and Z1 have the same categorical value, meaning xi=zi, the distance between the two depends on the number of objects in cluster 1 containing this value.

Modes as Cluster Centroids

A cluster centroid in k-modes clustering is represented by the vector of modes of categorical attributes. The mode is a statistical term that stands for the most frequently occurring value inside a set of values. For example, the mode of a given set {3, 4, 6, 4, 9, 4} is the value 4. Using the vector of modes to represent the cluster centroid minimizes the sum of distances between each object and the centre of the corresponding cluster. It minimizes their dissimilarities.

Clustering Procedure

The first step when clustering a categorical data set into k clusters using k-modes is to transform the categorical values into numerical values or dummy binary variables. The next step is to randomly select k different objects out of the data set as the initial cluster centroids. Thirdly, we have to calculate the distance between each object and centroid and assign each object to the cluster containing its closest centroid. Lastly, we select the new mode of each cluster centroid and compare it with the previous one. If the two modes differ, we repeat the last two steps. Thus, the modes get updated with each iteration of the k-modes clustering process.

References

[1]


This article "K-modes clustering" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:K-modes clustering. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

  1. Huang, Joshua Z. "Clustering Categorical Data with k-Modes". Encyclopedia of Data Warehousing and Mining, Hershey PA: Information Science Reference, 2008, p.246-250, ISBN 978-1-60566-010-3 Search this book on .