The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
Introduction
In today’s society, we often work in digital social media applications or communicate via digital communication interfaces such as email. This generates large amounts of data that can often be classified or from which information can be extracted. For this purpose, special text classifiers are used, such as the Naive Bayes classifier. In this article I would like to talk about it and introduce this algorithm.
Mathematical Basics
The Naive Bayes classifier is based on the results of stochastics. Our first assumption, which gives the algorithm its name, is based directly on the mathematical field of stochastics: Our algorithm assumes that all features from the data set lead to the label independently of each other. Such an assumption is called “naive”, which is why this algorithm is also called the Naive Bayes Classifier. ^{1} ^{2}
This assumption is rarely true in reality. Despite this limitation, the algorithm achieves good results in many applications.
Bayes’ theorem is the other part of the algorithm that gave it its name. It states that we can calculate the probability of event $B$, given that event $A$ has occurred, as follows:
$P(AB) = \frac{P(BA) \cdot P(A)}{P(B)}.$
For the theorem to be valid, $P(B)>0$ must be satisfied. ^{3} ^{4}
For the classification with the Naive Bayes Classifier, we further assume that we have a set of classes $C := \{ c_1, c_2, \dots \}$ and our individual data points as a feature vector $X \in \mathbb{R}^n$. The number of classes in our data set is denoted as $\vert C \vert$. The algorithm first computes the probability for each class from the feature values in $X$, represented as $P(c \vert X)$. This probability is computed using Bayes’ theorem:
$P(cX) = \frac{P(Xc) \cdot P(c)}{P(X)}.$
In this equation, we have the following probabilities:

$P(Xc)$ is the probability that the feature vector $X$ belongs to the class $c \in C$,

$P(c)$ is the prior probability of class $c$, and

$P(X)$ is a normalization constant, so that a summation over $\{ c_1, \dots, c_n \}$ would result in a total value of $1$.
To calculate the probability $P(X \vert c)$, we now need to assume that all features are fundamentally different from each other. With this assumption, we can then calculate the probability of
$P(Xc) = \prod_{i=1}^n P(x_ic).$
We multiply the individual feature probabilities to obtain the total probability of the feature values. In this equation, the ith component of the feature vector $X$ is represented by $x_i$.
An alternative approach to discrete feature classification is based on a multinomial distribution to model the probabilities of the feature values. In this case, we also need the value $k_i$, which represents the number of occurrences of the $i$th feature value, and the binomial coefficient. By utilizing this approach, we are left with
$P(Xc) = \prod_{i=1}^n \binom{x_i}{k_i} \cdot P(k_ic).$
For our a priori probability $P(c)$ with $c \in C$, the classes are generally equally distributed. This results in an equal distribution of the probabilities of each class, so we can calculate the probability $P(c)$ according to
$P(c) = \frac{1}{C}.$
To be able to calculate $P(cX)$ in the next step, we still need the normalization constant $P(X)$. This can be calculated with the following equation:
$P(X) = \sum_{c \in C} P(Xc) \cdot P(c)$
Finally, to make a prediction about the class membership of a feature vector $X^∗$, we simply perform the following calculation and obtain our most likely class $c$: ^{1} ^{5}
$c = \argmax_{c' \in C} P(c'X^*)$
Application example
In the following, I would like to introduce the Naive Bayes classifier using a natural language processing example. We use the dataset ag_news_subset, which contains news articles from ~2000 news sources and divides them into four different classes. The classification is to be performed by an algorithm trained by us with the highest possible accuracy. ^{6} For this purpose we use a Naive Bayes classifier.
We start with all relevant imports for the following code:
Then we download the dataset. Immediately after the download, we can edit the dataset to access all the features and labels.
In the next step, we use a vectorizer to convert all the documents in the data set into a numerical representation. This is necessary because machine learning algorithms cannot process words directly. They are often embedded in highdimensional vector spaces by different vectorizers and can be processed in this way.
It is important to note that we adapt the vectorizer to all documents. The reason for this is that it is theoretically possible to have words in the test dataset that are not in the training dataset  this would
either lead to errors or ignore potentially interesting data points (depending on the implementation). Since we do not want this, we first combine all data and then pass it to the
vectorizer.fit_transform([...])
function.
More information regarding the CountVectorizer can be found here.
Before training the classifier, the features are converted to an array representation. In this case, due to the implementation of the vectorizer, this always returns sparse matrices. However, since our classifier does not accept sparse matrices, it is necessary to convert them to a nonsparse matrix.
In many cases, it makes sense to return a sparse matrix because the results are often very highdimensional and therefore require large amounts of memory. To avoid this, sparse matrices are used, which in principle store only the nonzero elements with their corresponding positions instead of storing all elements.
In the second to last step, we train the classifier on the training data set. This can be done in the following way:
In the last step, the accuracy of the classifier can be determined for both the training and the test data set.
When the code was executed, the accuracy was $98,63 \%$ on the training data set and $78,75 \%$ on the test data set. This shows that the algorithm can handle more complex tasks.
If users were to run this algorithm through the identical script, it is likely that different results would be obtained. Due to the limitations of our hardware, it was only possible to perform the classification on part of the dataset. Memory in particular is a limiting factor in being able to process the entire dataset.
Advantages and disadvantages of the Naive Bayes classifier
There are a number of advantages and disadvantages to consider when using a Naive Bayes classifier.
Advantages
The advantages of the Naive Bayes classifier can be described as follows: ^{2}

Efficient training: The algorithm’s low computational complexity allows it to process large amounts of data quickly. By assuming that each feature is independent, the amount of training data required can theoretically be reduced.

Fast execution: Due to the simplicity of the algorithm, class membership can be predicted quickly after training.

Scalability: The Naive Bayes algorithm is well suited to handle and perform well on highdimensional data. It can also handle both discrete and continuous data.

Robustness: The Naive Bayes classifier is an algorithm that is not as affected by outliers as other algorithms.
Disadvantages
The great advantage of Naive Bayes lies in the simplicity and efficiency of the learning process. This is based on the assumption that all features are independent of each other. Although this leads to robust results in practice, it is far from optimal  especially in areas that work with natural language, e.g. where word sequences are relevant. An independent consideration of the features in this sense is possible, but will generally not lead to an optimal solution.
If complex correlations in the data are to be learned, it may make sense to switch to other algorithms. Examples of algorithms with potentially better solutions include: ^{2} ^{7}

Decision Trees

Random Forests

Support Vector Machines or

Neural Networks.
Areas of application
The Naive Bayes classifier can be used in datadriven environments in various domains. Some examples in this regard are:

spam detection,

message classification, or

sentiment analysis.
Due to the low complexity of the algorithm, it quickly finds good solutions in the data. ^{1} ^{2}