# Logistic Regression - Prediction of Event Occurrences

Event occurrence probabilities are often a complex problem to model. In this article, a basic method for this is presented: Logistic Regression.

Henrik Bartsch

The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:

## Classification

Machine Learning has basically many application areas. Often these deal with classification, clustering or other problems. Regressions also play a role. In this article, a special type of regression is considered: logistic regression.

## Definition of logistic regression

Logistic regression is a statistical model that acts to predict the occurrence of an event based on the linear combination of different variables. Classically, a binary dependent variable is represented by a set of independent variables. Here, the independent variables can be binary (values $0$ or $1$ only) or continuous (all numbers in the interval $[0, 1]$). 1 2 3

In simple terms, linear combination is a summation, which also works over elements of higher dimensions.

### Requirements for the variables

There are two relevant requirements that should be placed on the occurring variables in order for logistic regression to be successful: 4 5

1. The dependent variable must be a binary variable,
2. the independent variables should not be multi-collinear.

Multi-colinearity describes the concept that two or more stochastic variables are correlated with each other, i.e. can be linearly intercorrelated. Ver

Simplified, multi-colinearity means for us that no information can be gained by one or more variables, because we can also represent them by one or more other variables. Such a relationship is later represented in the logistic function.

### Mathematical basis of the model

The basis of the statistical model is the logistic function, defined by

$p(x) := \frac{1}{1 + exp(-(x - \mu)/s)}.$

This definition contains the parameters $\mu$, which must be calculated so that $p(\mu) = \frac{1}{2}$ and the scale parameter $s$. The function $p(x)$ represents the probability with which the corresponding event occurs. Since this is a probability, the function is limited to $p(x) \in [0, 1]$, since an event cannot have a probability of occurrence $> 100 \%$. 4

Assumption to the problem: There is only one dependent variable.

The expression $L(\mu) = \frac{1}{2}$ means graphically that the center of the curve is given by the logistic function.

By redefining the parameters, we obtain a classical linear function in the argument of the exponential function:

$p(x) := \frac{1}{1 + exp(-(\beta_0 + \beta_1 x))}.$

The new parameters $\beta_0, \beta_1$ arise here as follows: $\beta_0 = \frac{\mu}{s}, \beta_1 = \frac{1}{s}.$

The task of the mathematical model is now to find corresponding parameters $\beta_0, \beta_1$ so that the most accurate prediction possible can be made.

### Training the model

Basically, the two parameters in the expression cannot be chosen arbitrarily; optimal parameters must be determined. However, due to the nonlinearity of the exponential function, it is not directly possible to determine a parameter by, for example, a simple mathematical operation such as averaging, as was the case with linear regression.

Classically, it is necessary to use iterative numerical methods like Gradient Descent. With this a loss function is minimized, which acts as a measure of the uncertainty of the statistical model on the training data.

The gradient method is based on the idea that a function is minimized exactly if one chooses a descent direction and follows this step by step. To do this, the gradient (multidimensional derivative) is calculated and a descent direction is obtained from the gradient information. 6

### The loss function

In the example above, a cross entropy is usually used. For a data set $\{ x_i, y_i \}_{i=1}^N$ with input variables $x_i$, corresponding labels $y_i$ and a total of $N$ data points, the binary cross entropy is defined as follows: 7

$L := -\frac{1}{N} \sum_{i=1}^N y_i log(p(x_i)) + (1-y_i)log(1 - p(x_i)).$

Note: The factor $\frac{1}{N}$ does not change anything in the minimization, represents here only a normalization factor.

This function is then minimized to find optimal parameters. Such a procedure is also called Maximum Likelihood Estimation. 1

## Decision boundary

For a classification, whether the event occurs or not, a decision boundary must be defined. This boundary is used to decide whether the result of $p(x_i)$ should be set to $0$ or $1$.

As an example, in a spam detection, the message $x_i$ can be assumed to be spam if $p(x_i) \geq 0.5$. There are different types of such decision bounds, they have to be adapted according to the application and user requirement. 3

The following properties speak for an implementation of logistic regression for some use cases: 5

1. logistic regression is easier to implement than some other methods.
2. logistic regression works well when the data set is linearly separable.
3. logistic regression provides information about the importance of a particular independent variable. The importance is reflected in the magnitude of the corresponding coefficient. In addition, the sign indicates whether the independent and dependent variables are positively or negatively correlated.

Linear separability in this case means that when visualizing the data set, the classes can be separated by a line.

Note: Correlation does not always indicate a causal relationship, especially if the corresponding sample is very small!

In addition to the advantages of logistic regression, there are also a number of disadvantages that come with its implementation: 5

1. logistic regression cannot be used to predict a continuous variable, only a binary variable.
2. the logistic variable assumes a linear relationship between the independent variables. This is a limitation because our environment is often non-linear, and while a linear function often approximates such relationships correctly, it also often introduces error - and thus cannot necessarily make perfect predictions.
3. like all machine learning methods, a larger set of data points is necessary. The necessary amount depends on various parameters of the data set, such as the number of input variables.

## Versions of logistic regression

In logistic regression it is possible to make many different classifications. These will be characterized in the following. 2 3

### Binary logistic regression

Binary logistic regression is a dependent variable that can only take on the values $0$ and $1$.

### Multinomial logistic regression

A multinomial logistic regression involves several dependent variables that do not have a sequence.

Example: Preference of food types vegan, vegatarian or non-vegetarian.

### Ordinal logistic regression

An ordinal logistic regression involves several dependent variables that are related to each other and have a sequence.

Example: Rating of a video from one to five stars.

## Application examples of logistic regression

Logistic regression has many real world applications: 8 9

1. medicine: understanding of causes when patients suffer from various diseases
2. text editing: editing documents to transform them into a consistent format
3. hotel bookings: Predicting which hotels are likely to be booked with a particular user.