Robust clustering using k-Means

To identify similar groups in unknown data and reduce complexity, clustering can be used. Here we describe the k-means algorithm that can be used for clustering.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:

Classification

Clustering is relevant to many organizations because it is an important discipline of data mining and is used to analyze data sets. The goal of cluster analysis is to identify new groups in data, as opposed to classification, where data is assigned to existing classes and given labels.

The k-means algorithm is a vector quantization method that is also used in cluster analysis. This algorithm is introduced below.

Basics of the k-Means algorithm

Originally derived from signal processing, kk-Means aims at dividing nNn \in \mathbb{N} (different) observations into kNk \in \mathbb{N} partitions, which should be as homogeneous and compact as possible. For the partitioning, so-called centroids are used, which represent the (weighted) center of a partition.

In the context of unsupervised learning, the partitions created here are called clusters.

The resulting partitioning ranges can also be visualized, e.g. using a Veronoi diagram.

The new groups created here in the respective clusters can eventually be used, for example, for automatic classification, pattern recognition in image processing, or market segmentation (or any other process that relies on such prior knowledge). 1 2

Mathematical problem behind k-Means

When assigning nn observations to kk clusters, we want to satisfy two different criteria as well as possible:

  1. the similarity of the data points within each cluster should be as high as possible.
  2. the similarity of the data points between the clusters should be as low as possible.

These two criteria are optimized so that the data points are actually assigned to different partitions. The basic idea behind k-means is to assign each observation xiRdx_i \in \mathbb{R}^d with the dimension of the dataset dNd \in \mathbb{N} to a cluster using centroids with coordinates mjRd,j{1,,k}m_j \in \mathbb{R}^d, j \in \{ 1, \dots, k\}. To achieve the cluster assignment, we use the coordinates of the centroids m1,,mkm_1, \dots, m_k and compute the centroid with the minimum distance for each data point xix_i:

minl{1,,k}ximl2.\min_{l \in \{ 1, \dots, k \}} {\Vert x_i - m_l \Vert}_2.

Here, 2{\Vert \cdot \Vert}_2 denotes the euclidean distance.

In principle, other distance measures are possible to calculate the distance between points, but the Euclidean distance is the most commonly used.

These steps should be applied to all points. This results in the following minimization problem:

J=i=1kj=1nxjmi22.J = \sum_{i=1}^{k} \sum_{j = 1}^{n} {\Vert x_j - m_i \Vert}_2^2.

Where JJ is the total value of the cost function to be minimized. This cost function is based on the least squares method. To minimize this cost function algorithmically as simply as possible, we use the following steps of the k-means algorithm:

  1. randomly assign each observation xi,i{1,,n}x_i, i \in \{ 1, \dots, n \} to a cluster.
  2. for each of the clusters, compute the centroid mj,j{1,,k}m_j, j \in \{ 1, \dots, k\}.
  3. assign each observation to the cluster whose centroid it is closest to. Repeat steps 2 and 3 if there are any changes in the current iteration of step 3.

In this way, we iteratively obtain a solution that approaches a better solution as the number of iterations increases. Once the termination criterion is satisfied, we obtain a solution based on the currently used parameter kk. Note that this may be a local optimum instead of a global optimum, so the algorithm should be run several times with the same value of kk. Among these solutions, the one with the lowest value of the cost function should be used. 1 3

Application example

To perform a simple cluster analysis using the k-means algorithm, I generated a series of random data points. Around each of these data points, a total of 300 data points were randomly generated with small spacing. The result looks like this:

In this visualization, it is easy to see the different clusters that result from modeling the randomly generated data. By applying the cluster analysis with k=3k=3, it is easy to see that the clusters are formed by a group of data points around the starting points.

We can also change our choice of kk, for example to k=4k=4. Here, for example, it would be possible for a centroid to form in the center of the plane, possibly taking some points from all clusters into the cluster, or a number of other possibilities.

The visualization shows that the previous cluster 33 has been split into two groups. It is not entirely clear why this is the case, which is a good example of how unsupervised learning solutions are not always obvious or understandable. Therefore, when applying the k-means algorithm, background knowledge should be incorporated into the choice of the parameter kk, in order to eliminate solutions with little realism in advance.

Advantages and disadvantages of k-Means

Advantages

The great advantage of k-means in this context is the simplicity of the algorithm. Both the individual steps and the entire algorithm are not very complex, both in conception and implementation, which explains the popularity of the algorithm. Moreover, its simplicity makes it applicable to many unsupervised learning problems. 2 3

Disadvantages

The k-means algorithm has a number of drawbacks that need to be considered: 1 3

  1. It is necessary to determine the number of clusters kk in advance to perform clustering, although the optimal number of clusters is not necessarily known in advance.

To find a good number of clusters experimentally, the elbow method is often used. This is a method based on the fact that the global cost function for a certain number of clusters shows a smaller decrease in the cost function than for a higher number of clusters. Since I have never worked with this method myself, I refer to a good article that discusses this topic in more detail: stackabuse.com.

  1. Sometimes preprocessing of the data is necessary to find meaningful results in the data. However, it is not always clear in advance whether this is necessary and how it should be done.

  2. In its basic form, k-means randomly assigns the initial clusters to the individual data points in the starting step. This is often inefficient and leads to a higher number of iterations until the termination criterion is reached.

An extension to k-means that specifically improves this situation is k-Means++. This is included in most standard implementations (such as Scikit-Learn). 4

  1. In particular, the results of unsupervised learning do not always match expectations. This makes it difficult to interpret the results of unsupervised learning algorithms.

Areas of application

There are a number of good examples to which the algorithm can be applied: 2 5 3

  1. Customer segmentation: This is where we segment customer data according to characteristics that distinguish them from other groups. This enables us to deliver targeted advertising, for example, while increasing both the cost and efficiency of specified efforts.
  2. Geodata optimization: The algorithm can be used in combination with Ant Colony Optimization to find the optimal number of starting points for stores.
  3. Tournament analysis: To analyze tournaments in any sport or eSports, the corresponding data can be divided into clusters for better individual analysis.

These are just a few examples, in the real world (and in the sources) you can find many more examples of the application.

Sources

Footnotes

  1. wikipedia.org 2 3

  2. datascientest.com 2 3

  3. towardsmachinelearning.org 2 3 4

  4. towardsdatascience.com

  5. towardsdatascience.com