Graphics from Pawel Czerwinski –

Classification using neural networks

There are several classical algorithms to perform a classification. Here I describe an implementation using neural networks and my experiences with it.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:


Machine Learning has been playing an increasingly important role for a few years now. Important here: The ability to divide data relatively simply and reliably into learned classes, so-called classification. Especially neural networks play a more and more important role, because they are versatile and flexible; furthermore they are also able to map strong non-linearities in data. This post will give an introduction to classification as a task in Supervised Learning and present a simple implementation using artificial neural networks.

Classification as a task

Given two sets, the set of input variables II and the set of classes CC. We are now looking for a function ff, which maps the input data to suitable classes:

f:IC.f: I \rightarrow C.

In the task, both the input variables and the corresponding classes must be defined, and a certain basic set of data must be available; otherwise, convergence of the algorithm cannot be expected. algorithm is not to be expected. The number of data points varies with the complexity of the task and the number of input variables.

Loss metric

Equivalent to almost all machine learning models, a corresponding loss metric is used for training in order to calculate the deviation between the available training data and the predictions of the model. For a classification, the categorical crossentropy can be used, which represents deviations between classes well.

If we move towards artificial neural networks, we usually use the discrete formulation. This is for two discrete probability distributions p,qp, q as follows:

H(P,Q):=xXp(x)log(q(x)).H(P, Q):= - \sum_{x \in X} p(x) log(q(x)).

Accordingly, the goal of our model is to minimize the deviation between the predictions and data, which is equivalent to minimizing the value of the loss function.

Activation functions

In a neural network, the activation functions are relevant for obtaining a meaningful result. In intermediate layers, the Rectified Linear Unit (ReLU) is usually a sensible choice, but the choice of the last activation function is usually the most relevant.

Since this is a classification, it makes sense to obtain a probability distribution as output, i.e. a vector y=(y1,...,ym)T\vec{y} = (y_1, ..., y_m)^T, which satisfies

i=1mvi=1. \sum_{i = 1}^m v_i = 1.

There are two popular approaches to guarantee this:

  1. a simple transformation to a unit vector by means of the vector norm. This is the approach that is typically taught in high school mathematics.

vvi:=vii=1mvi2=viv\vec{v}^{'} \rightarrow v_i^{'} := \frac{v_i}{\sqrt{\sum_{i = 1}^m v_i^2}} = \frac{v_i}{\Vert \vec{v} \Vert}

In this approach \Vert \cdot \Vert represents the Euclidean norm. The result v\vec{v}^{'} here satisfies the necessary condition defined above and thus represents a probability distribution. 1

  1. transformation to a probability distribution using the softmax activation function: In this approach the softmax activation function σ()\sigma(\cdot) 2 is used:

vvi=σ(vi):=evii=1mevi\vec{v}^{'} \rightarrow v_i^{'} = \sigma(v_i) := \frac{e^{v_i}}{\sum_{i = 1}^m e^{v_i}}

This also achieves a corresponding probability distribution v\vec{v}^{'}. The advantage of this function is that small errors are not scaled proportionally, but transformed to larger values, which speeds up the training process.

As another option, an arbitrary basis bRb \in \mathbb{R} can be used instead of ee in order to be able to determine a suitable scaling by oneself.

As a last activation function ReLU can be chosen again, but then the values have to be converted into a probability distribution before further use, in order to be able to train with them. For a prediction of a class the typical argmax() is sufficient to calculate it.

Implementation using Tensorflow

Generation of the dataset

In this post we will use the dataset CIFAR-10, which lets us classify various objects and animals. As input variables here it gives us a (32,32)(32, 32) array representing pixel intensity.

The necessary imports are as follows:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

from collections import deque
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, InputLayer, Conv2D, MaxPool2D, Flatten
from tensorflow.python.keras.optimizer_v2.adam import Adam
from tensorflow.python.keras.losses import sparse_categorical_crossentropy, categorical_crossentropy
from tensorflow.python.keras.metrics import sparse_categorical_accuracy, categorical_accuracy

The dataset can be downloaded easily via the Tensorflow Datasets-API:
builder = tfds.builder('cifar10')

info =

train_ds, test_ds = builder.as_dataset(split=['train', 'test'], shuffle_files=True, as_supervised=True)

In this code, we first create a builder which provides an interface for the download. Then a train-test split of 80%80 \% to 20%20 \% is performed here.
train_x = [example.numpy() for example, label in train_ds]
train_y = [label.numpy() for example, label in train_ds]

test_x = [example.numpy() for example, label in test_ds]
test_y = [label.numpy() for example, label in test_ds]

Next, we generate the data for ourselves in a format so that it can be used meaningfully in the rest of the process. Checking, we can see that we have 5000050000 data points in the training dataset and 1000010000 in the test dataset.

Looking at the dataset we can see images like the following image:

Flugzeug aus dem Trainingsdatensatz
model = Sequential([
    Conv2D(filters=64, kernel_size=(3, 3), activation="relu"),
    MaxPool2D(pool_size=(3, 3)),
    Conv2D(filters=64, kernel_size=(3, 3), activation="relu"),
    MaxPool2D(pool_size=(3, 3)),
    Dense(units=128, activation="relu"),
    Dense(units=64, activation="relu"),
    Dense(units=info.features["label"].num_classes, activation="softmax")

# model.summary()

optimizer = Adam(learning_rate=1e-4)

For classification, a Convolutional Neural Network is used here, which is characterized by the use of Conv2D, MaxPool2D and Flatten layers. Here, a filtering and unification operation is applied to the image, which gives good results in practice.

Note: The softmax activation is used as the activation function of the last layer.

The training is then executed using the method:
model.compile(optimizer=optimizer, loss=categorical_crossentropy, metrics=[categorical_accuracy])

train_x_ds = tf.convert_to_tensor(train_x, dtype=tf.float64) / 256
train_y_ds = tf.one_hot(tf.convert_to_tensor(train_y, dtype=tf.int64), depth=info.features["label"].num_classes)

history =, y=train_y_ds, epochs=50)

After 20 epochs alone, a categorical accuracy of 70%\approx 70 \% is achieved on the training set and 65%\approx 65 \% on the test set. Longer training improves the precision, but also the overfitting. Also a source of error can be the network architecture, which can be improved by hyperparameter optimization.

In the following the history of the loss function can be seen here

and the visualization of the categorical accuracy can be seen here:

Both histories refer to values across different episodes during training.


  1. As an alternative to an own fitting can be programmed, which works with automatic differentiation by tensorflow.GradientTape. See here or in the examples from Deep Q Learning.

  2. For any kind of neural network application, it is useful to normalize input and output data.

Normalization means to transform the values of the corresponding variable to the value range [0,1][0, 1]. This can be done simply and efficiently with a linear transformation, for example, see