The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:
Classification
Machine Learning has been playing an increasingly important role for a few years now. Important here: The ability to divide data relatively simply and reliably into learned classes, socalled classification. Especially neural networks play a more and more important role, because they are versatile and flexible; furthermore they are also able to map strong nonlinearities in data. This post will give an introduction to classification as a task in Supervised Learning and present a simple implementation using artificial neural networks.
Classification as a task
Given two sets, the set of input variables $I$ and the set of classes $C$. We are now looking for a function $f$, which maps the input data to suitable classes:
$f: I \rightarrow C.$
In the task, both the input variables and the corresponding classes must be defined, and a certain basic set of data must be available; otherwise, convergence of the algorithm cannot be expected. algorithm is not to be expected. The number of data points varies with the complexity of the task and the number of input variables.
Loss metric
Equivalent to almost all machine learning models, a corresponding loss metric is used for training in order to calculate the deviation between the available training data and the predictions of the model. For a classification, the categorical crossentropy can be used, which represents deviations between classes well.
If we move towards artificial neural networks, we usually use the discrete formulation. This is for two discrete probability distributions $p, q$ as follows:
$H(P, Q):=  \sum_{x \in X} p(x) log(q(x)).$
Accordingly, the goal of our model is to minimize the deviation between the predictions and data, which is equivalent to minimizing the value of the loss function.
Activation functions
In a neural network, the activation functions are relevant for obtaining a meaningful result. In intermediate layers, the Rectified Linear Unit (ReLU) is usually a sensible choice, but the choice of the last activation function is usually the most relevant.
Since this is a classification, it makes sense to obtain a probability distribution as output, i.e. a vector $\vec{y} = (y_1, ..., y_m)^T$, which satisfies
$\sum_{i = 1}^m v_i = 1.$
There are two popular approaches to guarantee this:
 a simple transformation to a unit vector by means of the vector norm. This is the approach that is typically taught in high school mathematics.
$\vec{v}^{'} \rightarrow v_i^{'} := \frac{v_i}{\sqrt{\sum_{i = 1}^m v_i^2}} = \frac{v_i}{\Vert \vec{v} \Vert}$
In this approach $\Vert \cdot \Vert$ represents the Euclidean norm. The result $\vec{v}^{'}$ here satisfies the necessary condition defined above and thus represents a probability distribution. ^{1}
 transformation to a probability distribution using the softmax activation function: In this approach the softmax activation function $\sigma(\cdot)$ ^{2} is used:
$\vec{v}^{'} \rightarrow v_i^{'} = \sigma(v_i) := \frac{e^{v_i}}{\sum_{i = 1}^m e^{v_i}}$
This also achieves a corresponding probability distribution $\vec{v}^{'}$. The advantage of this function is that small errors are not scaled proportionally, but transformed to larger values, which speeds up the training process.
As another option, an arbitrary basis $b \in \mathbb{R}$ can be used instead of $e$ in order to be able to determine a suitable scaling by oneself.
As a last activation function ReLU can be chosen again, but then the values have to be converted into a probability distribution before further use, in order to be able to train with them. For a prediction of a class the typical
argmax()
is sufficient to calculate it.
Implementation using Tensorflow
Generation of the dataset
In this post we will use the dataset CIFAR10, which lets us classify various objects and animals. As input variables here it gives us a $(32, 32)$ array representing pixel intensity.
The necessary imports are as follows:
The dataset can be downloaded easily via the Tensorflow DatasetsAPI:
In this code, we first create a builder
which provides an interface for the download. Then a traintest split of $80 \%$ to $20 \%$ is performed here.
Next, we generate the data for ourselves in a format so that it can be used meaningfully in the rest of the process. Checking, we can see that we have $50000$ data points in the training dataset and $10000$ in the test dataset.
Looking at the dataset we can see images like the following image:
For classification, a Convolutional Neural Network is used here, which is characterized by the use of Conv2D, MaxPool2D and Flatten layers. Here, a filtering and unification operation is applied to the image, which gives good results in practice.
Note: The softmax activation is used as the activation function of the last layer.
The training is then executed using the model.fit()
method:
After 20 epochs alone, a categorical accuracy of $\approx 70 \%$ is achieved on the training set and $\approx 65 \%$ on the test set. Longer training improves the precision, but also the overfitting. Also a source of error can be the network architecture, which can be improved by hyperparameter optimization.
In the following the history of the loss function can be seen here
and the visualization of the categorical accuracy can be seen here:
Both histories refer to values across different episodes during training.
Notes

As an alternative to
model.fit()
an own fitting can be programmed, which works with automatic differentiation bytensorflow.GradientTape
. See here tensorflow.com or in the examples from Deep Q Learning. 
For any kind of neural network application, it is useful to normalize input and output data.
Normalization means to transform the values of the corresponding variable to the value range $[0, 1]$. This can be done simply and efficiently with a linear transformation, for example, see microsoft.com.