Graphics from Jan Huber –

Dimensionality Reduction with Random Projections

Large amounts of data can be challenging for many reasons. Today, I will present an algorithm that can be used to reduce the size of a data set: Random Projections.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:


Whether ChatGPT (with the underlying GPT4) or research projects on autonomous driving, these projects have one thing in common: large amounts of data. Large amounts of data pose several challenges to developers of deep learning models, such as enormous hardware requirements, difficult or impossible visualization, (extremely) long training times, or poor training results due to the curse of dimensionality. In order to reduce the dimensionality of a dataset and still obtain relevant information, dimensionality reduction has been developed using various algorithms. In this article I would like to introduce the principle of random projections.

Mathematical basics

To talk about the mathematical foundations of random projections, let’s assume we have data from a high-dimensional Euclidean space, say Rm\mathbb{R}^m, where mNm \in \mathbb{N}.

The Euclidean property of vector space is necessary to compute distances between two (or more) points in our vector space.

We further assume that mm is large enough and we want to bring data from Rm\mathbb{R}^m into a space of small dimension, i.e. Rn\mathbb{R}^n with n<mn < m and nNn \in \mathbb{N}. In this case we can use the Johnson-Lindenstrauss lemma.

The Johnson-Lindenstrauss lemma tells us that we can implement the process described above and then there exists such a nNn \in \mathbb{N} that can embed the points from Rm\mathbb{R}^m in Rn\mathbb{R}^n and approximately retain the distances between the points from Rm\mathbb{R}^m approximately in Rn\mathbb{R}^n. 1 2 3 4

Preserving the distances between the points is very important here, as this describes the differences between the data points. If the distances were not (roughly) preserved, two (potentially) very different data points could appear very similar in Rn\mathbb{R}^n, even though they are not objectively similar.

The remaining step in reducing the data set is to select the data points. This is done with Random Projections by generating a random matrix RRN×nR \in \mathbb{R}^{N \times n}, where NNN \in \mathbb{N} is the number of data points in the data set.

In short, a random matrix RR is a matrix whose elements rijr_{ij} are random variables.

Many approaches can be used to generate the elements rijr_{ij}. Often rijr_{ij} are assumed to be elements of a Gaussian Normal Distribution. There are efficient methods to generate these elements. Alternatively, it has been shown that the generation can be done by a much simpler distribution: 1 5 6 7

rij=3{1, with Probability 160, with Probability 231, with Probability 16r_{ij} = \sqrt{3} * \begin{cases} 1, & \text{ with Probability } \frac{1}{6} \\ 0, & \text{ with Probability } \frac{2}{3} \\ -1, & \text{ with Probability } \frac{1}{6} \end{cases}

With the random method presented here, greater efficiency can be achieved in terms of time, even if the results may be slightly worse.

Application example

As an application example, we will use a data set that can ideally have a relatively large number of input dimensions. Images are ideal for this because they already have a high number of input dimensions. In the context of this work, we have chosen to work with the colorectal_histology dataset. This dataset contains images of different tissue types, with a focus on colorectal cancer. An algorithm is used to classify this dataset into a total of eight different tissue types. 8

The images in the dataset are 150150 pixels high, 150150 pixels wide, and distributed over three different color channels, i.e. they are available as RBG images. This gives us a dataset with a total input dimension of 6750067500.

Our goal is to show that random projections can improve classification. To write an example, we start with all the necessary imports:
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np

import math

from sklearn.random_projection import GaussianRandomProjection

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import InputLayer, Dense
from tensorflow.python.keras.optimizer_v2.adam import Adam
from tensorflow.python.keras.losses import sparse_categorical_crossentropy

Note: This script runs on Tensorflow version 2.12.0.

We then load the dataset from the Tensorflow dataset and process it so that we have all the images and all the labels:
hist_data = tfds.load("colorectal_histology")

df_train = tfds.as_dataframe(hist_data["train"])
images = df_train["image"].values

temp = []
for element in images:
    element = np.reshape(element, [-1])

images = np.array(temp)

labels = df_train["label"].values

To get interesting results, the next step is to create a train test split with a ratio of 80%:20%80\%:20\%.
train, test = 0.8, 0.2
amount_train = math.ceil(images.shape[0] * train)
amount_test = math.floor(images.shape[0] * test)

# Randomly Sample Elements from the images Array
indices_train = np.random.choice(images.shape[0], amount_train, replace=False)
indices_train = np.sort(indices_train)

indices_test = []
for i in range(images.shape[0]):
    if ((i in indices_train) == False):

indices_test = np.array(indices_test)

# Generate all necessary Variables to store Training Information
train_x, train_y, test_x, test_y = [], [], [], []

train_x = images[indices_train]
train_y = labels[indices_train]

train_x, train_y = tf.convert_to_tensor(train_x), tf.convert_to_tensor(train_y)

test_x = images[indices_test]
test_y = labels[indices_test]

test_x, test_y = tf.convert_to_tensor(test_x), tf.convert_to_tensor(test_y)

Note: The tag replace=False within np.random.choice(...) prevents the duplication of indices. Without this addition, we will not get a correct train-test split.

The next step is to create a model that uses all the input data to perform a classification based on the data set. This is not a specially optimized model, but a simple neural network to illustrate the process. In reality, this would be a model optimized for the task.
model = Sequential([
    Dense(units=128, activation="relu"),
    Dense(units=128, activation="relu"),
    Dense(units=64, activation="relu"),
    Dense(units=8, activation="softmax")

optimizer = Adam(learning_rate=10e-4)

We then train the model. The model can be automatically evaluated.
history =, y=train_y, 
                    epochs=20, batch_size=64, 
                    validation_data=(test_x, test_y))

From this graph we can see that our training is unfortunately not very stable. There are several reasons for this:

  • Incorrect setting of the training epochs,
  • Incorrect setting of the batch size,
  • Incorrect architecture of the neural network,
  • A data set that is too small or
  • Input dimensions of the data set are too high.

Ultimately, unfortunately, this can be due to many factors. However, we can influence the input dimensions of the data set very well. To do this, we now use the Random Projections already discussed.
proj_gauss = GaussianRandomProjection(random_state=0)
reduced_images = proj_gauss.fit_transform(images)

A closer look at the data of the reduced dataset shows that the number of features could be reduced from a total of 6750067500 to 73007300 features. This corresponds to a reduction to 11\approx 11% of the data. To demonstrate the effect of dimensionality reduction, we use a neural network with identical features and retrain it completely on the reduced dataset:
reduced_train_x = reduced_images[indices_train]
reduced_test_x = reduced_images[indices_test]
model = Sequential([
    Dense(units=128, activation="relu"),
    Dense(units=128, activation="relu"),
    Dense(units=64, activation="relu"),
    Dense(units=8, activation="softmax")

optimizer = Adam(learning_rate=10e-4)
model.compile(optimizer=optimizer, loss=sparse_categorical_crossentropy, metrics=["accuracy"])
history =, y=train_y, 
                    epochs=20, batch_size=64, 
                    validation_data=(reduced_test_x, test_y))

The new course of the training can now be visualized as follows:

We will then be able to compare all the results produced here:

NameFull data setReduced data set
Maximum training accuracy49.72%49.72\%69.75%69.75\%
Final training accuracy43.13%43.13\%67.12%67.12\%
Maximum test accuracy46.8%46.8\%54.3%54.3\%
Final test accuracy40.3%40.3\%51.8%51.8\%
Number of input pixels675006750073007300
Execution time-2m\approx 2m
Training time240s\approx 240s23s23s

All computations have been done on the Google Colab hardware. Experiments can be repeated there, although e.g. the execution time may vary due to the availability of hardware at Google Colab.

The parameter Execution time specifies the amount of time required for the dimensionality reduction with the random projections algorithm.

All in all, it can be seen that dimensionality reduction in the context of preprocessing can very well help to improve the accuracy of a model. In principle, this is not limited to neural networks, but can also be helpful for other machine learning techniques. Equally important is the reduction of training time, which saves costs.

The improvement in training and testing behavior when training with the reduced data set can be attributed to the curse of dimensionality.

Advantages and disadvantages of random projections

Random projections have a number of advantages and disadvantages due to the stochastic nature of the algorithm. Principal Component Analysis (PCA)]( is often mentioned as a counterpart to this algorithm for comparison.


The process has a number of positive attributes: 1 7 9

  1. Random projections are incredibly fast. The time complexity of random projections is O(Nnm)O(Nnm), which is significantly lower than that of principal component analysis.

  2. Random projections are not as affected by outliers. This is probably due to the stochastic nature of the method.


Compared to principal component analysis, random projections have a number of disadvantages: 9 10.

  1. Random projections are not as efficient as principal component analysis. In theory, this is the optimal method for reducing the dimension of a data set.

Areas of application

Random projections are already used in some areas of big data. Here are some examples: 9 11

  1. Speech and language processing: In this area, sound recordings can introduce noise into the input data. Projecting the data onto a random lower-dimensional subspace yields results comparable to conventional dimension reduction methods such as Principal Component Analysis.

  2. Image and video processing: Both image and video processing deal with very large amounts of data, where the efficiency of Random Projections is advantageous.

  3. Bioinformatics: Bioinformatics: Gene expression profiles can be viewed as matrices with more than tens of thousands of continuous values. High dimensionality leads to computational burden and dimensionality curse problems. Therefore, dimensionality reduction techniques are often used in machine learning tasks to mitigate these problems.



  1. 2 3






  7. 2


  9. 2 3