Welcome to EverybodyWiki ! Sign in or create an account to improve, watchlist or create an article like a company page or a bio (yours ?)...

# K-modes clustering

k-modes clustering is a modified version of the standard k-means clustering process optimized to cluster categorical data. It does so by using the simple matching dissimilarity measure also referred to as the Hamming distance instead of the Euclidean distance to calculate the distance between two objects. Furthermore, it uses modes instead of means to represent the cluster centroids.

## Reasoning

While the k-means algorithm is a very popular choice when clustering numerical data, it performs poorly when applied to categorical data. The reason is that in order to cluster categorical data, the categorical values first have to be transformed into numerical values which distorts the clustering due to the usage of the Euclidean distance that leads the k-means algorithm to consider two distant values as close, simply based on the proximity of their numerical representations. One solution to this problem, although not suitable for high dimensional categorical data, is to convert the different categories to dummy binary variables and assign a 1 if the categorical value is present and a 0 else. A solution that works even with high dimensional categorical data is k-modes clustering.

## Algorithm

### Simple Matching Dissimilarity Measure

The simple matching dissimilarity measure also referred to as the Hamming distance calculates the distance between two objects X and Y defined by m categorical attributes with

$d(X,Y)=\sum _{i=1}^{m}\delta (x_{i},y_{i})$ where

$\delta (x_{i},y_{i})={\begin{cases}0,&x_{i}=y_{i}\\1,&x_{i}\neq y_{i}\\\end{cases}}$ and $x_{i}$ and $y_{i}$ are the categorical values of attribute i in X and Y. The more mismatches there are between X and Y, the bigger the Hamming distance and thus the bigger the dissimilarity between the two objects. Applying this distance function to k-modes clustering, one obtains a new function that measures the distance and therefore the dissimilarity between an object X and a cluster centroid $Z_{1}$ defined as

$\delta (x_{i},z_{i})={\begin{cases}1-{\frac {n_{1}^{r}}{n_{1}}},&x_{i}=z_{i}\\1,&x_{i}\neq z_{i}\end{cases}}$ where $z_{i}$ is the categorical value of attribute i in $Z_{1}$ , $n_{1}$ is the number of objects in cluster 1 and $n_{i}^{r}$ is the number of objects in cluster 1 whose attribute value is r. Hence, if X and $Z_{1}$ have the same categorical value, meaning $x_{i}=z_{i}$ , the distance between the two depends on the number of objects in cluster 1 containing this value.

### Modes as Cluster Centroids

A cluster centroid in k-modes clustering is represented by the vector of modes of categorical attributes. The mode is a statistical term that stands for the most frequent occurring value inside a set of values. For example, the mode of a given set {3, 4, 6, 4, 9, 4} is the value 4. Using the vector of modes to represent the cluster centroid minimizes the sum of distances between each object and the centre of the corresponding cluster. It minimizes their dissimilarities.

### Clustering Procedure

The first step when clustering a categorical data set into k clusters using k-modes is to transform the categorical values into numerical values or dummy binary variables. The next step is to randomly select k different objects out of the data set as the initial cluster centroids. Thirdly, we have to calculate the distance between each object and centroid and assign each object to the cluster containing its closest centroid. Lastly, we select the new mode of each cluster centroid and compare it with the previous one. If the two modes differ, we repeat the last two steps. Thus, the modes get updated with each iteration of the k-modes clustering process.