Robust clustering using k-Means

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Classification

Clustering is relevant to many organizations because it is an important discipline of data mining and is used to analyze data sets. The goal of cluster analysis is to identify new groups in data, as opposed to classification, where data is assigned to existing classes and given labels.

The k-means algorithm is a vector quantization method that is also used in cluster analysis. This algorithm is introduced below.

Basics of the k-Means algorithm

Originally derived from signal processing, $$k$$-Means aims at dividing $$n \in \mathbb{N}$$ (different) observations into $$k \in \mathbb{N}$$ partitions, which should be as homogeneous and compact as possible. For the partitioning, so-called centroids are used, which represent the (weighted) center of a partition.

In the context of unsupervised learning, the partitions created here are called clusters.

The resulting partitioning ranges can also be visualized, e.g. using a Veronoi diagram.

The new groups created here in the respective clusters can eventually be used, for example, for automatic classification, pattern recognition in image processing, or market segmentation (or any other process that relies on such prior knowledge). ¹ ²

Mathematical problem behind k-Means

When assigning $$n$$ observations to $$k$$ clusters, we want to satisfy two different criteria as well as possible:

the similarity of the data points within each cluster should be as high as possible.
the similarity of the data points between the clusters should be as low as possible.

These two criteria are optimized so that the data points are actually assigned to different partitions. The basic idea behind k-means is to assign each observation $$x_i \in \mathbb{R}^d$$ with the dimension of the dataset $$d \in \mathbb{N}$$ to a cluster using centroids with coordinates $$m_j \in \mathbb{R}^d, j \in \{ 1, \dots, k\}$$. To achieve the cluster assignment, we use the coordinates of the centroids $$m_1, \dots, m_k$$ and compute the centroid with the minimum distance for each data point $$x_i$$:

$$ \min_{l \in \{ 1, \dots, k \}} {\Vert x_i - m_l \Vert}_2. $$

Here, $${\Vert \cdot \Vert}_2$$ denotes the euclidean distance.

In principle, other distance measures are possible to calculate the distance between points, but the Euclidean distance is the most commonly used.

These steps should be applied to all points. This results in the following minimization problem:

$$ J = \sum_{i=1}^{k} \sum_{j = 1}^{n} {\Vert x_j - m_i \Vert}_2^2. $$

Where $$J$$ is the total value of the cost function to be minimized. This cost function is based on the least squares method. To minimize this cost function algorithmically as simply as possible, we use the following steps of the k-means algorithm:

randomly assign each observation $$x_i, i \in \{ 1, \dots, n \}$$ to a cluster.
for each of the clusters, compute the centroid $$m_j, j \in \{ 1, \dots, k\}$$.
assign each observation to the cluster whose centroid it is closest to. Repeat steps 2 and 3 if there are any changes in the current iteration of step 3.

In this way, we iteratively obtain a solution that approaches a better solution as the number of iterations increases. Once the termination criterion is satisfied, we obtain a solution based on the currently used parameter $$k$$. Note that this may be a local optimum instead of a global optimum, so the algorithm should be run several times with the same value of $$k$$. Among these solutions, the one with the lowest value of the cost function should be used. ¹ ³

Application example

To perform a simple cluster analysis using the k-means algorithm, I generated a series of random data points. Around each of these data points, a total of $$300$$ data points were randomly generated with small spacing. The result looks like this:

In this visualization, it is easy to see the different clusters that result from modeling the randomly generated data. By applying the cluster analysis with $$k=3$$, it is easy to see that the clusters are formed by a group of data points around the starting points.

We can also change our choice of $$k$$, for example to $$k=4$$. Here, for example, it would be possible for a centroid to form in the center of the plane, possibly taking some points from all clusters into the cluster, or a number of other possibilities.

The visualization shows that the previous cluster $$3$$ has been split into two groups. It is not entirely clear why this is the case, which is a good example of how unsupervised learning solutions are not always obvious or understandable. Therefore, when applying the k-means algorithm, background knowledge should be incorporated into the choice of the parameter $$k$$, in order to eliminate solutions with little realism in advance.

Advantages and disadvantages of k-Means

Advantages

The great advantage of k-means in this context is the simplicity of the algorithm. Both the individual steps and the entire algorithm are not very complex, both in conception and implementation, which explains the popularity of the algorithm. Moreover, its simplicity makes it applicable to many unsupervised learning problems. ² ³

Disadvantages

The k-means algorithm has a number of drawbacks that need to be considered: ¹ ³

It is necessary to determine the number of clusters kk in advance to perform clustering, although the optimal number of clusters is not necessarily known in advance.

To find a good number of clusters experimentally, the elbow method is often used. This is a method based on the fact that the global cost function for a certain number of clusters shows a smaller decrease in the cost function than for a higher number of clusters. Since I have never worked with this method myself, I refer to a good article that discusses this topic in more detail: stackabuse.com.

Sometimes preprocessing of the data is necessary to find meaningful results in the data. However, it is not always clear in advance whether this is necessary and how it should be done.
In its basic form, k-means randomly assigns the initial clusters to the individual data points in the starting step. This is often inefficient and leads to a higher number of iterations until the termination criterion is reached.

An extension to k-means that specifically improves this situation is k-Means++. This is included in most standard implementations (such as Scikit-Learn). ⁴

In particular, the results of unsupervised learning do not always match expectations. This makes it difficult to interpret the results of unsupervised learning algorithms.

Areas of application

There are a number of good examples to which the algorithm can be applied: ² ⁵ ³

Customer segmentation: This is where we segment customer data according to characteristics that distinguish them from other groups. This enables us to deliver targeted advertising, for example, while increasing both the cost and efficiency of specified efforts.
Geodata optimization: The algorithm can be used in combination with Ant Colony Optimization to find the optimal number of starting points for stores.
Tournament analysis: To analyze tournaments in any sport or eSports, the corresponding data can be divided into clusters for better individual analysis.

These are just a few examples, in the real world (and in the sources) you can find many more examples of the application.