Classification via k-nearest-neighbor classifier

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Classification

Classification is a relevant task area within machine learning and supervised learning. In this area, methods such as decision trees or random forests are frequently used. However, there is also another algorithm: the K-Nearest-Neighbors classifier.

k-Nearest-Neighbor is often also referred to as kNN or k-NN.

This blog post explains how it works, how the algorithm classifies data points and how to use it.

Basics

The k-NN algorithm is a method from the supervised learning class. The basic idea of the method is to find the $$k \in \mathbb{N}$$ neighbours of a data point to be classified using the data points from the training data set and a distance measure and to perform a classification by a majority decision using the classes of the neighbours. Here, the algorithm assumes that similar data can usually be found in the vicinity. What is special about this procedure is that no training phase is necessary. The data of the training data set are simply loaded into the algorithm instead of having to process them further and calculate corresponding parameters (iteratively).

Since the model does not contain a training phase and no parameters that have to be calculated by training, this method is also called a non-parametric method.

k-NN thus belongs to the “lazy learning” models, which only store training data instead of processing them and carrying out appropriate training with the data.

Due to the property that no training has to be carried out, the corresponding phase of loading the data is correspondingly fast. However, this also means that all calculations are only carried out when a prediction has to be made about a class. This is especially problematic with large training data sets.

For large data sets, an approximate version of k-NN can be used to reduce the calculation time slightly. This is accompanied by a small loss of accuracy, but this is acceptable in practice. Many later versions of the algorithm, which were presented as improvements, mainly try to reduce the number of evaluations for the distance measure.

In principle, k-NN can also be used for regression problems, but this is not dealt with here. A brief insight into this can be found here. ¹ ²

Application examples

The k-NN method is often used in the following areas:

recommender systems
pattern recognition
data mining
predictions for the financial market
intruder detection

and many other fields of application.

Selecting the parameters

Selecting the distance metric

Since k-NN is looking for the $$k \in \mathbb{N}$$ nearest neighbours, it must also be possible to measure them. For this purpose, a mathematical distance measure is used. ² These distance measures can later be used to create decision boundaries for the given training data set, which assigns different regions to different classes. ¹

In principle, there are many different ways to use distance measures. Classically, the Euclidean norm (also called $$2$$-norm) is used. This is defined on the $$m$$-dimensional space $$\mathbb{R}^m$$ as follows:

$${\Vert x \Vert}_2 := \sqrt{\sum_{i=1}^{m} x_i^2}.$$

A generalisation can be achieved through the $$p$$ norm:

$${\Vert x \Vert}_p := \sqrt[p]{\sum_{i=1}^{m} \vert x_i \vert^p}.$$

Here, the $$2$$ from above was replaced by a number $$p \in \mathbb{R}, \enspace p \geq 1$$. This offers the advantage that assignment regions are shaped differently regarding the distance. This can have advantages in practice and is usually introduced into projects through trial and error or experience. An example of this can affect the shaping can be seen here.

A list of possible distance measures can be found here.

Choice of k

Like many other machine learning algorithms, this algorithm has a numerical parameter that controls our quality of results. In this case, it is our parameter $$k \in \mathbb{N}$$. As a simple example, we can understand the influence as follows:

for $$k = 1$$ the classification is carried out with the nearest point. Here the problem arises that random noise or disturbances in the data can falsify the classification.

$$k = 1$$ also denotes the nearest-neighbour algorithm. This algorithm suffers from strong instability problems and should - if possible - be avoided.

for $$k \rightarrow \infty$$, all points in the dataset are included in the majority decision. In this example, the result will always be the same; each point (regardless of its input data) will be mapped to the class that occurs most frequently in the data set.

It turns out that the right choice of $$k$$ is a relevant factor that influences our results in a relevant way. The best choice of $$k$$ depends on the data that are given. In practice, the optimal $$k$$ is determined by heuristic methods such as hyperparameter optimisation or by using bootstrapping. ² ¹

It is advisable to choose $$k$$ odd. This avoids two classes reaching a draw in the majority decision. This is a relevant factor especially in binary classification problems (with only two classes). This reduces or completely avoids the situation of a tie in the majority decision.

It is recommended not to choose $$k$$ too small (Overfitting](https://www.ibm.com/topics/overfitting)), as otherwise the class allocation can fluctuate strongly with small changes in the input data.

It is recommended not to choose $$k$$ too large (Underfitting), otherwise random background noise will be taken from the data.

Example results of a k-NN implementation gave the following accuracies for the dataset imbd-reviews:

Even though this is only a simple example, you can see that the parameter $$k$$ has an influence on the accuracy of the algorithm.

Advantages and disadvantages of the algorithm

Advantages

The k-NN algorithm offers a number of advantages: ¹ ² ³ ⁴

Simple implementation: The idea behind this algorithm is very simple and requires little prior knowledge of machine learning or mathematics. Coupled with the property that k-NN produces comparatively good results, this is a good algorithm to start with classification using classical machine learning algorithms.
Changes in the data can be made quickly, as the algorithm does not require a training phase.
k-NN only needs the $$k$$ parameter as a relevant hyperparameter and a distance metric to be chosen. This is a very small number, especially in comparison with neural networks. this is a very small number and can be tried out relatively quickly if necessary.
Theoretically, k-NN is a very accurate algorithm. If the amount of data tends towards infinity, it can theoretically be shown that a very small error rate can be achieved. ²

Disadvantages

The k-NN algorithm has a number of practical disadvantages: ¹ ² ³ ⁴

Since k-NN has no training phase, the algorithm scales very poorly and requires correspondingly more memory and CPU resources. This quickly becomes noticeable with large data sets.

In principle, this problem has been improved by modified data structures such as Ball-Tree. Nevertheless, with larger datasets - depending on the the application - it is better to switch to other algorithms.

Like many other “machine learning”-algorithms, k-NN also suffers from the curse of dimensionality. Large dimensions in the input data may distort the performance of the algorithm (sometimes severely).

To reduce this problem, a dimensionality reduction can be carried out in the preprocessing phase in order to sort out input parameters with low importance and not to consider them further. This is a comparison scenario between the speed of the algoithm (fewer parameters) vs. its accuracy (more parameters), since reducing the data sets inputs often removes information from the data set.

Just as often in various other algorithms, the problem of overfitting or underfitting comes into play when $$k$$ is chosen incorrectly or the data set is chosen too small.
If a class is particularly frequently represented in the training data set, this can strongly distort the accuracy of the algorithm. This is because it can happen under certain circumstances that corresponding data points of the class (if they have a certain noise in the data) occur too frequently over the space of data points and thus form a neighbour more frequently.

A relatively simple solution to this is to use a weighted majority vote, which takes into account the distance to the surrounding neighbours.

Because this is a “lazy loading” model, predictions for different data points can take a long time to complete. This problem is exacerbated as the size of the training data set.