# K-modes clustering

** k-modes clustering** is a modified version of the standard k-means clustering process optimized to cluster categorical data. It does so by using the simple matching dissimilarity measure also referred to as the Hamming distance instead of the Euclidean distance to calculate the distance between two objects. Furthermore, it uses modes instead of means to represent the cluster centroids.

## Contents

## Reasoning[edit]

While the *k*-means algorithm is a very popular choice when clustering numerical data, it performs poorly when applied to categorical data. The reason is that in order to cluster categorical data, the categorical values first have to be transformed into numerical values which distorts the clustering due to the usage of the Euclidean distance that leads the *k*-means algorithm to consider two distant values as close, simply based on the proximity of their numerical representations. One solution to this problem, although not suitable for high dimensional categorical data, is to convert the different categories to dummy binary variables and assign a 1 if the categorical value is present and a 0 else. A solution that works even with high dimensional categorical data is *k*-modes clustering.

## Algorithm[edit]

### Simple Matching Dissimilarity Measure[edit]

The simple matching dissimilarity measure also referred to as the Hamming distance calculates the distance between two objects *X* and *Y* defined by *m* categorical attributes with

where

and * and ** are the categorical values of attribute **i* in *X* and *Y*. The more mismatches there are between *X* and *Y*, the bigger the Hamming distance and thus the bigger the dissimilarity between the two objects. Applying this distance function to *k*-modes clustering, one obtains a new function that measures the distance and therefore the dissimilarity between an object *X* and a cluster centroid * defined as
*

where * is the categorical value of attribute **i* in *, ** is the number of objects in cluster 1 and ** is the number of objects in cluster 1 whose attribute value is **r*. Hence, if *X* and * have the same categorical value, meaning **, the distance between the two depends on the number of objects in cluster 1 containing this value.
*

### Modes as Cluster Centroids[edit]

A cluster centroid in *k*-modes clustering is represented by the vector of modes of categorical attributes. The mode is a statistical term that stands for the most frequent occurring value inside a set of values. For example, the mode of a given set {3, 4, 6, 4, 9, 4} is the value 4. Using the vector of modes to represent the cluster centroid minimizes the sum of distances between each object and the centre of the corresponding cluster. It minimizes their dissimilarities.

### Clustering Procedure[edit]

The first step when clustering a categorical data set into *k* clusters using *k*-modes is to transform the categorical values into numerical values or dummy binary variables. The next step is to randomly select *k* different objects out of the data set as the initial cluster centroids. Thirdly, we have to calculate the distance between each object and centroid and assign each object to the cluster containing its closest centroid. Lastly, we select the new mode of each cluster centroid and compare it with the previous one. If the two modes differ, we repeat the last two steps. Thus, the modes get updated with each iteration of the *k*-modes clustering process.

## References[edit]

^{[1]}

This article "K-modes clustering" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:K-modes clustering. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

- ↑ Huang, Joshua Z. "Clustering Categorical Data with k-Modes".
*Encyclopedia of Data Warehousing and Mining*, Hershey PA: Information Science Reference, 2008, p.246-250, ISBN 978-1-60566-010-3 Search this book on .