Dimensionality Reduction with Random Projections

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Classification

Whether ChatGPT (with the underlying GPT4) or research projects on autonomous driving, these projects have one thing in common: large amounts of data. Large amounts of data pose several challenges to developers of deep learning models, such as enormous hardware requirements, difficult or impossible visualization, (extremely) long training times, or poor training results due to the curse of dimensionality. In order to reduce the dimensionality of a dataset and still obtain relevant information, dimensionality reduction has been developed using various algorithms. In this article I would like to introduce the principle of random projections.

Mathematical basics

To talk about the mathematical foundations of random projections, let’s assume we have data from a high-dimensional Euclidean space, say $$\mathbb{R}^m$$, where $$m \in \mathbb{N}$$.

The Euclidean property of vector space is necessary to compute distances between two (or more) points in our vector space.

We further assume that $$m$$ is large enough and we want to bring data from $$\mathbb{R}^m$$ into a space of small dimension, i.e. $$\mathbb{R}^n$$ with $$n < m$ and $n \in \mathbb{N}$$. In this case we can use the Johnson-Lindenstrauss lemma.

The Johnson-Lindenstrauss lemma tells us that we can implement the process described above and then there exists such a $$n \in \mathbb{N}$$ that can embed the points from $$\mathbb{R}^m$$ in $$\mathbb{R}^n$$ and approximately retain the distances between the points from $$\mathbb{R}^m$$ approximately in $$\mathbb{R}^n$$. ¹ ² ³ ⁴

Preserving the distances between the points is very important here, as this describes the differences between the data points. If the distances were not (roughly) preserved, two (potentially) very different data points could appear very similar in $$\mathbb{R}^n$$, even though they are not objectively similar.

The remaining step in reducing the data set is to select the data points. This is done with Random Projections by generating a random matrix $$R \in \mathbb{R}^{N \times n}$$, where $$N \in \mathbb{N}$$ is the number of data points in the data set.

In short, a random matrix $R$ is a matrix whose elements $$r_{ij}$$ are random variables.

Many approaches can be used to generate the elements $$r_{ij}$$. Often the elements $$r_{ij}$$ are assumed to be elements of a Gaussian Normal Distribution. There are efficient methods to generate these elements. Alternatively, it has been shown that the generation can be done by a much simpler distribution: ¹ ⁵ ⁶ ⁷

$$r_{ij} = \sqrt{3} * \begin{cases} 1, & \text{ with Probability } \frac{1}{6} \\ 0, & \text{ with Probability } \frac{2}{3} \\ -1, & \text{ with Probability } \frac{1}{6} \end{cases}$$

With the random method presented here, greater efficiency can be achieved in terms of time, even if the results may be slightly worse.

Application example

As an application example, we will use a data set that can ideally have a relatively large number of input dimensions. Images are ideal for this because they already have a high number of input dimensions. In the context of this work, we have chosen to work with the colorectal_histology dataset. This dataset contains images of different tissue types, with a focus on colorectal cancer. An algorithm is used to classify this dataset into a total of eight different tissue types. ⁸

The images in the dataset are $$150$$ pixels high, $$150$$ pixels wide, and distributed over three different color channels, i.e. they are available as RBG images. This gives us a dataset with a total input dimension of $$67500$$.

Our goal is to show that random projections can improve classification. To write an example, we start with all the necessary imports:

import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np

import math

from sklearn.random_projection import GaussianRandomProjection

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import InputLayer, Dense
from tensorflow.python.keras.optimizer_v2.adam import Adam
from tensorflow.python.keras.losses import sparse_categorical_crossentropy

random-projections.py

Note: This script runs on Tensorflow version 2.12.0.

We then load the dataset from the Tensorflow dataset and process it so that we have all the images and all the labels:

hist_data = tfds.load("colorectal_histology")

df_train = tfds.as_dataframe(hist_data["train"])
images = df_train["image"].values

temp = []
for element in images:
    element = np.reshape(element, [-1])
    temp.append(element)

images = np.array(temp)

labels = df_train["label"].values

random-projections.py

To get interesting results, the next step is to create a train test split with a ratio of $$80\%:20\%$$.

train, test = 0.8, 0.2
amount_train = math.ceil(images.shape[0] * train)
amount_test = math.floor(images.shape[0] * test)

# Randomly Sample Elements from the images Array
indices_train = np.random.choice(images.shape[0], amount_train, replace=False)
indices_train = np.sort(indices_train)

indices_test = []
for i in range(images.shape[0]):
    if ((i in indices_train) == False):
      indices_test.append(i)

indices_test = np.array(indices_test)

# Generate all necessary Variables to store Training Information
train_x, train_y, test_x, test_y = [], [], [], []

train_x = images[indices_train]
train_y = labels[indices_train]

train_x, train_y = tf.convert_to_tensor(train_x), tf.convert_to_tensor(train_y)

test_x = images[indices_test]
test_y = labels[indices_test]

test_x, test_y = tf.convert_to_tensor(test_x), tf.convert_to_tensor(test_y)

random-projections.py

Note: The tag replace=False within np.random.choice(...) prevents the duplication of indices. Without this addition, we will not get a correct train-test split.

The next step is to create a model that uses all the input data to perform a classification based on the data set. This is not a specially optimized model, but a simple neural network to illustrate the process. In reality, this would be a model optimized for the task.

model = Sequential([
    InputLayer(input_shape=(67500,)),
    Dense(units=128, activation="relu"),
    Dense(units=128, activation="relu"),
    Dense(units=64, activation="relu"),
    Dense(units=8, activation="softmax")
])

optimizer = Adam(learning_rate=10e-4)
model.compile(optimizer=optimizer, 
              loss=sparse_categorical_crossentropy, 
              metrics=["accuracy"])
model.summary()

random-projections.py

We then train the model. The model can be automatically evaluated.

history = model.fit(x=train_x, y=train_y, 
                    epochs=20, batch_size=64, 
                    validation_data=(test_x, test_y))

random-projections.py

From this graph we can see that our training is unfortunately not very stable. There are several reasons for this:

Incorrect setting of the training epochs,
Incorrect setting of the batch size,
Incorrect architecture of the neural network,
A data set that is too small or
Input dimensions of the data set are too high.

Ultimately, unfortunately, this can be due to many factors. However, we can influence the input dimensions of the data set very well. To do this, we now use the Random Projections already discussed.

proj_gauss = GaussianRandomProjection(random_state=0)
reduced_images = proj_gauss.fit_transform(images)

random-projections.py

A closer look at the data of the reduced dataset shows that the number of features could be reduced from a total of $$67500$$ to $$7300$$ features. This corresponds to a reduction to $$\approx 11\%$$ of the data. To demonstrate the effect of dimensionality reduction, we use a neural network with identical features and retrain it completely on the reduced dataset:

reduced_train_x = reduced_images[indices_train]
reduced_test_x = reduced_images[indices_test]

random-projections.py

model = Sequential([
    InputLayer(input_shape=(7300,)),
    Dense(units=128, activation="relu"),
    Dense(units=128, activation="relu"),
    Dense(units=64, activation="relu"),
    Dense(units=8, activation="softmax")
])

optimizer = Adam(learning_rate=10e-4)
model.compile(optimizer=optimizer, loss=sparse_categorical_crossentropy, metrics=["accuracy"])
model.summary()

random-projections.py

history = model.fit(x=reduced_train_x, y=train_y, 
                    epochs=20, batch_size=64, 
                    validation_data=(reduced_test_x, test_y))

random-projections.py

The new course of the training can now be visualized as follows:

We will then be able to compare all the results produced here:

Name	Full data set	Reduced data set
Maximum training accuracy	$$49.72%$$	$$69.75%$$
Final training accuracy	$$43.13%$$	$$67.12%$$
Maximum test accuracy	$$46.8%$$	$$54.3%$$
Final test accuracy	$$40.3%$$	$$51.8%$$
Number of input pixels	$$67500$$	$$7300$$
Execution time	$$-$$	$$\approx 2m$$
Training time	$$\approx 240s$$	$$23s$$

All computations have been done on the Google Colab hardware. Experiments can be repeated there, although e.g. the execution time may vary due to the availability of hardware at Google Colab.

The parameter Execution time specifies the amount of time required for the dimensionality reduction with the random projections algorithm.

All in all, it can be seen that dimensionality reduction in the context of preprocessing can very well help to improve the accuracy of a model. In principle, this is not limited to neural networks, but can also be helpful for other machine learning techniques. Equally important is the reduction of training time, which saves costs.

The improvement in training and testing behavior when training with the reduced data set can be attributed to the curse of dimensionality.

Advantages and disadvantages of random projections

Random projections have a number of advantages and disadvantages due to the stochastic nature of the algorithm. [Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) is often mentioned as a counterpart to this algorithm for comparison.

Advantages

The process has a number of positive attributes: ¹ ⁷ ⁹

Random projections are incredibly fast. The time complexity of random projections is $$O(Nnm)$$, which is significantly lower than that of principal component analysis.
Random projections are not as affected by outliers. This is probably due to the stochastic nature of the method.

Disadvantages

Compared to principal component analysis, random projections have a number of disadvantages: ⁹ ¹⁰.

Random projections are not as efficient as principal component analysis. In theory, this is the optimal method for reducing the dimension of a data set.

Areas of application

Random projections are already used in some areas of big data. Here are some examples: ⁹ ¹¹

Speech and language processing: In this area, sound recordings can introduce noise into the input data. Projecting the data onto a random lower-dimensional subspace yields results comparable to conventional dimension reduction methods such as Principal Component Analysis.
Image and video processing: Both image and video processing deal with very large amounts of data, where the efficiency of Random Projections is advantageous.
Bioinformatics: Bioinformatics: Gene expression profiles can be viewed as matrices with more than tens of thousands of continuous values. High dimensionality leads to computational burden and dimensionality curse problems. Therefore, dimensionality reduction techniques are often used in machine learning tasks to mitigate these problems.

Dimensionality Reduction with Random Projections

Large amounts of data can be challenging for many reasons. Today, I will present an algorithm that can be used to reduce the size of a data set: Random Projections.

Classification

Mathematical basics

Application example

Advantages and disadvantages of random projections

Advantages

Disadvantages

Areas of application

Sources