Random Forests

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Introduction

Random forests are a powerful machine learning technique that has gained popularity in recent years due to its versatility and effectiveness. This algorithm is used to solve both regression and classification problems and is based on the concept of ensemble learning. In this article, we will explore the basics of random forests, how they work and their applications in various fields.

The foundation - Ensemble learning

Basically, every machine learning model is based on some previously made assumption. Be it the importance of a certain input parameter, that certain trends can be represented in a simplified way or have to be represented in a more complex way, or even other assumptions. Even when assumptions are made that classically give relatively good results for a particular problem domain, finding a really good assumption is often complex. In principle, such a set of hypotheses can additionally become relatively large. In ensembles, hypotheses (and associated models) are combined and trained together to find a potentially better solution. For example, in classification problems, a majority vote is used to identify the appropriate classes that are most likely.

The use of multiple individual models in an ensemble model automatically leads to an increase in computation times for predictions compared to simpler models. In principle, the idea behind ensemble learning can be understood as allowing poorer learning algorithms to perform better if they have performed additional computations. Alternatively, for non-ensemble algorithms, one could instead perform more computations to achieve corresponding results.

Ensemble systems, however, have the advantage of increasing overall accuracy more efficiently than non-ensemble systems as computer resources increase. In addition, ensemble methods have the advantage of producing (significantly) better results than non-ensemble systems when the individual models are very different from each other. This property is attempted to be promoted in many ensemble learning algorithms. In addition to these properties, ensemble systems also have the advantage that they (strongly) reduce the variance in the predictions by using many classifiers. ¹ ²

There are a number of other ensemble methods. Examples for this are Bootstrap Aggregation or Boosting.

The basic idea

The idea behind Random Forests is to train many different and uncorrelated decision trees to generate an ensemble model from the trained decision trees. This is tried to be achieved with the help of two different methods:

Feature Randomness and
Bootstrap Aggregation.

Feature Randomness involves generating a random set of input variables and training decision trees based on these variables. This is an important difference to the decision tree, which basically trains using all input variables. No pruning is applied to the resulting decision trees.

Author’s note: Personally, instead of using this method, I have worked with a method that draws random data points from the training data set (with backfilling) to generate a data set for training each individual decision tree. This also produces good results.

Bootstrap Aggregation is a deliberate distortion of the training dataset for each classifier. A training dataset is created for each classifier by taking a sample of data from the original dataset and allowing individual data points to be used multiple times. This (strongly) reduces the variance. ³ ⁴

Feature Randomness and Bootstrap Aggregation are two important factors, since the accuracy of the random forest depends mainly on the strength of the individual decision trees and the dependency between them. ⁴

In addition, about one third of the data is also removed from the training dataset as a test dataset. This test data set is used, for example, to optimize the corresponding parameters of the algorithm by means of Cross-Validation. ²

The parameters

For the base algorithm, three parameters must be set before training:

The number of features/data points that will be used for training from the decision trees.
The number of decision trees to be trained.

Usually, several hundred to several thousand decision trees are trained, depending on the complexity of the problem. ⁵

Applications

The random forest method can now be used in the following ways: ² ⁵ ⁶

Classification: by means of a majority decision over the decision trees, a decisive prediction can be made about the class membership of a data point.
Regression: By means of a classification over different numerical values and averaging over the decision trees, regression results are provided.

Advantages and Disadvantages of Random Forests

Advantages

Random Forests models possess a number of practically relevant advantages: ² ⁴

The risk of overfitting is (significantly) reduced by the ensemble system compared to decision trees when a robust number of decision trees are used. The risk of overfitting decreases as the number of decision trees increases. In this case, the classifier will not fit the model too closely to the data because averaging uncorrelated trees reduces the overall variance and prediction errors.
The random forests method is a very flexible method due to its ability to work with both classification and regression problems. Through Feature Bootstrap Aggregation, random forests can also be used to estimate missing values, since accuracy does not suffer when part of the data is missing.
Random forests make it relatively easy to determine the importance of a particular parameter on the performance of the model.

A corresponding evaluation of the performance of a parameter under exclusion from the data set can be measured by the Gini Importance or Mean Decrease in Impurity. Via Mean Decrease in Accuracy the decrease in accuracy due to random permutation of the parameter values is measured. Further information on this topic can be found here.

Overfitting is a problem that does not play such a large role in random forests. This can be shown using the strong law of large numbers. Furthermore, it can be shown that more trees reduce the prediction error. ⁴

Disadvantages

“Random Forests” models have a number of disadvantages in addition to their advantages, however: ²

Training random forests is a time-consuming process that is done by training many different decision trees. The result is usually a more accurate classifier, but one that requires more time to predict.
The use of different classifiers requires more computer resources for processing and (intermediate) storage of the corresponding data.

This problem can be (partially) compensated, at least for computation time, by possible parallelization. ⁴

When using a random forest instead of a decision tree, the clarity of the decision tree is lost, which can be helpful for debugging.

Fields of application of Random Forests

Random Forests can be used in a number of practical application areas: ²

In finance, this method can be used to evaluate customers with high credit risk, to detect fraud, or for pricing. The most important factors here are shortened execution time and the reduced need for data preprocessing play a role here.
In the healthcare sector, computational biology can help. The large amounts of data available there can be used to predict how accurately medications will work for certain patients.
In e-commerce, models such as random forests are used to represent recommendation engines to increase customer satisfaction.

Sources

wikipedia.org ↩
ibm.com ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
wikipedia.org ↩
“Random Forests” von Leo Breiman (2001) via link.springer.com ↩ ↩² ↩³ ↩⁴ ↩⁵
wikipedia.org ↩ ↩²
towardsdatascience.com ↩