# Enabling the Power of Dimensionality Reduction: An Introduction to Principal Component Analysis

Dimension reduction is an increasingly important part of the learning process of machine learning programs as data sets grow larger. Today we will look at a linear transformation method on low dimensional vector spaces to potentially improve the learning process. Principal Component Analysis will be introduced for this purpose.

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:

## Introduction

Data reduction is an important step in data analysis and machine learning, revealing hidden patterns and relationships in data. One of the most popular and effective techniques for unsupervised dimensionality reduction is Principal Component Analysis (PCA). PCA is a linear transformation that projects the original data into a lower-dimensional space defined by the eigenvectors of the covariance matrix. This allows PCA to capture the most important features of the data while minimizing its dimensionality. This makes it an ideal tool for data visualization, feature extraction, and data compression. In addition, PCA is a deterministic algorithm, which means that it will always produce the same results given the same input data. This makes it easier to reproduce and validate the results of PCA, which is not always the case with other unsupervised dimensionality reduction techniques. In the following blog entry, we want to introduce this algorithm.

## Mathematical Basics

The basic idea of Principal Component Analysis is relatively simple: we try to find out whether certain variables in our input data are redundant, i.e. whether a part of this data cannot be represented by another part of our data and is therefore “hidden”. This property is also called correlation. We also examine how much information is contained in an input variable. We measure this by the variance, which allows us to eliminate irrelevant information from the mean if it has a low information content. 1 2

To understand these mathematical properties, they are briefly introduced here. 1 Suppose we have two (or more) data points of the form $x_1, x_2 \in \mathbb{R}^n$ and n with $n \in \mathbb{N}$. In this case, the covariance of the two data points can be defined as follows:

$\kappa(x_1, x_2) := \frac{1}{n} (x_1 - \bar{x_1})^T (x_2 - \bar{x_2}).$

The expression $\bar{x}$ for a vector $x \in \mathbb{R}^n$ represents the mean of the vector over all its components.

With this definition, we can now introduce the standard deviation $\sigma(x)$ with $x \in \mathbb{R}^n$.

$\sigma(x) := \sqrt{\kappa(x, x)}.$

We have used the term variance above. The variance is the square of our standard deviation $\sigma(x)$, which would cost us another computational step. 3 Accordingly, it is typically not used directly in algorithms, but the standard deviation is used instead.

In the next step, we still need the correlation between two data points, which we can introduce using the last two definitions:

$\chi(x_1, x_2) := \frac{\kappa(x_1, x_2)}{\sigma(x_1)\sigma(x_2)}.$

To find uncorrelated principal components, a number of properties from linear algebra are needed. We start with the properties of eigenvalues and eigenvectors: 1 4

A vector $v \in \mathbb{R}^m$ with $m \in \mathbb{N}$ is said to be an eigenvector of a matrix $A \in \mathbb{R}^{m \times m}$ if there exists a scalar value $\lambda \in \mathbb{R}$ such that

$Av = \lambda v$

is satisfied. In this case, we denote $\lambda$ as the eigenvalue of $A$ for the eigenvector $v$.

Explained clearly, this concept means that multiplying a given vector for a given matrix (which can generate both stretch and rotation components) simply stretches our given vector by a factor $\lambda$.

Such a definition can be used to perform what is known as an eigenspace decomposition. This can be described as follows:

For a given matrix $A \in \mathbb{R}^{m \times m}$, let $v_1, \dots, v_m \in \mathbb{R}^m$ be the eigenvectors and the corresponding eigenvalues $\lambda_1, \dots, \lambda_m$. In this case we can summarize them in the matrices$V \in \mathbb{R}^{m \times m}$ and $\Lambda \in \mathbb{R}^{m \times m}$:

$V = (v_1, \dots, v_m),$

$\Lambda = diag(\lambda_1, \dots, \lambda_m).$

In this instance, we can decompose the eigenspace with

$A V = V \Lambda.$

This is a useful property for principal component analysis. After all, we are looking for linear combinations of input variables that are correlated with each other. This means that they can in principle be represented by other variables and thus represent redundancy in our data. An eigenspace decomposition allows us to find such an analysis for all of our data. Data with particularly high eigenvalues therefore have a high variance and thus a high information content, which we are interested in. Data with low eigenvalues have a low impact on the information in the data set and therefore have a low variance. The eigenvectors are the main components here, representing a transformed coordinate system in which we can more efficiently display our data.

By searching for linear combinations of the input variables, the algorithm is limited to linear dependencies between each variable and therefore cannot map non-linear relationships.

To get all this information, we need to do an eigenspace decomposition on our covariance matrix. We can represent our covariance matrix as follows:

$K = \begin{pmatrix} \kappa(x_1, x_1), & \cdots, & \kappa(x_1, x_m) \\ \vdots & & \vdots \\ \kappa(x_m, x_1) & \cdots, & \kappa(x_m, x_m) \end{pmatrix}.$

We can now apply an eigenspace decomposition to this matrix. Then, by filtering for the largest eigenvalues (and possibly determining the number of subsequent features), we can use the largest $l \in \mathbb{N}$ eigenvectors to represent our reduced data. We now call these eigenvectors principal components to represent the connection between these eigenvectors and the algorithm. These principal components are uncorrelated and orthogonal to each other, i.e., they are not redundant and cannot be represented by a combination of other principal components. 1 5

## Application Example

In the following, we use similar training data as in our blog post on the random projection algorithm to demonstrate comparability within this discipline. Again, we specifically use version $2.12.0$ of Tensorflow to avoid errors in the code. The first step is to import all the necessary libraries:

Next, we download the colorectal_histology dataset. This contains image data with an image size of $150 \times 150 \times 3$. We want to reduce the amount of this input data in our code in the hope of improving the training accuracy of the neural networks we train for classification. The image data is transformed directly so that all features are contained in a single vector for each image. This simplifies later dimension reduction.

The next step is to set a percentage for a train test split. In this case, we set $80\%-20\%$, which are common values.

In connection with this, it is necessary to extract the corresponding training and test data from all the data. This is done in the following code section:

We then define a neural network to classify our data.

In the next step, we train this with our training data and display the results during training.

Finally, we can perform a specific test run of the neural network on our test data.

From this graph we can see that our training is unfortunately not very stable. There could be several reasons for this:

• Incorrect training epoch setting,
• Incorrect batch size setting,
• Faulty neural network architecture,
• A data set that is too small or
• Data set input dimensions too high.

Now we get to the really interesting part: Dimensionality reduction using principal component analysis. We can implement this in a few lines of code using scikit-learn:

The reduced output data is then used to create the reduced training and test data sets.

Now we need to redefine our neural network so that it can correctly process the reduced input data and not use the results of the last training run, which could distort the final result.

Finally, we train and test the neural network again to get the data we need.

As an illustration, we have the training accuracy and the test accuracy:

We can compare all the results generated here after applying the above code.

NameComplete DatasetReduced Dataset
Maximum Training Accuracy$59\%$$99.2\%$
Final Training Accuracy$58.38\%$$97\%$
Maximum Test Accuracy$53.7\%$$57\%$
Final Test Accuracy$42.4\%$$56.5\%$
Amount of Pixels$67500$$5000$
Execution Time$-$$\approx 13m$
Training Time$\approx 240s$$13s$

All computations were performed on hardware at Google Colab. Experiments can be repeated there, although the execution time may vary depending on the availability of hardware at Google Colab.

The Execution Time parameter indicates how long the dimensionality reduction using Principal Component Analysis took.

Overall, it can be seen that dimensionality reduction as part of preprocessing can be very helpful in improving the accuracy of a model. In principle, this is not limited to neural networks, but can also be helpful for other machine learning techniques. Equally important is the reduction of training time, which leads to cost savings.

The improvement in training and testing behavior when training with the reduced data set can be attributed to the curse of dimensionality.

In the following, we examine the advantages and disadvantages of the method used here. In particular, we compare Principal Component Analysis with Random Projections, which is also a popular algorithm in this area.

1. Distortions: Due to the property of Principal Component Analysis to maintain orthogonality by using eigenspaces, we do not introduce distortions into the data. Random projections are only approximately orthogonal, which can introduce bias. 6

This property is only satisfied if the dependencies between the individual characteristics are indeed linear. If non-linearities occur, this property is not necessarily met.

1. Optimality: The use of Principal Component Analysis leads to optimal results because we use mathematically exact properties (as much as possible) instead of relying on approximations, as is the case with random projections. This usually means that the dimension of the resulting data is smaller than it would be with random projections. 6

1. Runtime: Although Principal Component Analysis provides better results, this comes at the cost of a longer runtime. 6 7 8 In our example, Principal Component Analysis requires a runtime that is $\approx 6.5$ times longer than Random Projections. This is mainly due to the numerically complex computation of eigenvalues and eigenvectors. This can lead to problems, especially with increasingly large data sets. Random projections are therefore more suitable for very high-dimensional data.

In principle, it is possible to find a compromise between optimal results and computation time by using algorithms to approximate the eigenvalues and eigenvectors under certain conditions. Examples of such algorithms are the power iteration method or the Lanczos method. 6

1. Similarity: In some examples, random projections are better than principal component analysis at maintaining similarity when transformed to a lower dimensional vector space. 6

2. Linearity: Principal component analysis requires that the data be linearly dependent. If the data are not linearly dependent, this algorithm cannot produce good results.

3. Outliers: The presence of outliers in the data can cause the data to be significantly worse than if the outliers were not in the data set. This is because the main components are altered by the presence of the outliers and no longer necessarily reflect the actual structure of the data set. 9

As a workaround, outlier analysis can be applied to the data set, or the variables can be normalized. Normalizing the variables is recommended for Principal Component Analysis. 10 11

## Fields of Application

Principal component analysis can theoretically be used in many areas. Here are some examples of where this algorithm is used: 6 9

1. Marketing: When analyzing customer data, PCA can be used to filter out irrelevant factors from data sets that only have a minor influence on purchasing behaviour. This allows advertising to be more targeted and sales to be increased.

2. Statistical Analysis: In a statistical model with many features, the complexity caused by many features can reduce the quality of the model. Reducing the model parameters through PCA can increase the quality and the probability of correct prediction.

3. Image Processing: When analyzing satellite imagery, there are often large data sets that are difficult to analyze in their raw form. Reducing the data using PCA helps to simplify the analysis and improve the results.

4. Text Analysis: Due to the high dimensionality of text data, reduction is sometimes necessary. PCA can perform such a reduction.