Graphics from Cristofer Maximilian - https://unsplash.com/de/@cristofer

Text Classification with the Naive Bayes Text Classifier

The Naive Bayes classifier is a simple and efficient algorithm for classification tasks that assumes independence of the features. This post aims to introduce this algorithm to the reader.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:

Introduction

In today’s society, we often work in digital social media applications or communicate via digital communication interfaces such as email. This generates large amounts of data that can often be classified or from which information can be extracted. For this purpose, special text classifiers are used, such as the Naive Bayes classifier. In this article I would like to talk about it and introduce this algorithm.

Mathematical Basics

The Naive Bayes classifier is based on the results of stochastics. Our first assumption, which gives the algorithm its name, is based directly on the mathematical field of stochastics: Our algorithm assumes that all features from the data set lead to the label independently of each other. Such an assumption is called “naive”, which is why this algorithm is also called the Naive Bayes Classifier. 1 2

This assumption is rarely true in reality. Despite this limitation, the algorithm achieves good results in many applications.

Bayes’ theorem is the other part of the algorithm that gave it its name. It states that we can calculate the probability of event BB, given that event AA has occurred, as follows:

P(AB)=P(BA)P(A)P(B).P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}.

For the theorem to be valid, P(B)>0P(B)>0 must be satisfied. 3 4

For the classification with the Naive Bayes Classifier, we further assume that we have a set of classes C:={c1,c2,}C := \{ c_1, c_2, \dots \} and our individual data points as a feature vector XRnX \in \mathbb{R}^n. The number of classes in our data set is denoted as C\vert C \vert. The algorithm first computes the probability for each class from the feature values in XX, represented as P(cX)P(c \vert X). This probability is computed using Bayes’ theorem:

P(cX)=P(Xc)P(c)P(X).P(c|X) = \frac{P(X|c) \cdot P(c)}{P(X)}.

In this equation, we have the following probabilities:

  1. P(Xc)P(X|c) is the probability that the feature vector XX belongs to the class cCc \in C,

  2. P(c)P(c) is the prior probability of class cc, and

  3. P(X)P(X) is a normalization constant, so that a summation over {c1,,cn}\{ c_1, \dots, c_n \} would result in a total value of 11.

To calculate the probability P(Xc)P(X \vert c), we now need to assume that all features are fundamentally different from each other. With this assumption, we can then calculate the probability of

P(Xc)=i=1nP(xic).P(X|c) = \prod_{i=1}^n P(x_i|c).

We multiply the individual feature probabilities to obtain the total probability of the feature values. In this equation, the ith component of the feature vector XX is represented by xix_i.

An alternative approach to discrete feature classification is based on a multinomial distribution to model the probabilities of the feature values. In this case, we also need the value kik_i, which represents the number of occurrences of the iith feature value, and the binomial coefficient. By utilizing this approach, we are left with

P(Xc)=i=1n(xiki)P(kic).P(X|c) = \prod_{i=1}^n \binom{x_i}{k_i} \cdot P(k_i|c).

For our a priori probability P(c)P(c) with cCc \in C, the classes are generally equally distributed. This results in an equal distribution of the probabilities of each class, so we can calculate the probability P(c)P(c) according to

P(c)=1C.P(c) = \frac{1}{|C|}.

To be able to calculate P(cX)P(c|X) in the next step, we still need the normalization constant P(X)P(X). This can be calculated with the following equation:

P(X)=cCP(Xc)P(c)P(X) = \sum_{c \in C} P(X|c) \cdot P(c)

Finally, to make a prediction about the class membership of a feature vector XX^∗, we simply perform the following calculation and obtain our most likely class cc: 1 5

c=arg maxcCP(cX)c = \argmax_{c' \in C} P(c'|X^*)

Application example

In the following, I would like to introduce the Naive Bayes classifier using a natural language processing example. We use the dataset ag_news_subset, which contains news articles from ~2000 news sources and divides them into four different classes. The classification is to be performed by an algorithm trained by us with the highest possible accuracy. 6 For this purpose we use a Naive Bayes classifier.

We start with all relevant imports for the following code:

naive_bayes.py
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB

Then we download the dataset. Immediately after the download, we can edit the dataset to access all the features and labels.

naive_bayes.py
text_data = tfds.load("ag_news_subset")

train_text, train_y_ = tfds.as_dataframe(text_data["train"])["description"].values, tfds.as_dataframe(text_data["train"])["label"].values
test_text, test_y = tfds.as_dataframe(text_data["test"])["description"].values, tfds.as_dataframe(text_data["test"])["label"].values

In the next step, we use a vectorizer to convert all the documents in the data set into a numerical representation. This is necessary because machine learning algorithms cannot process words directly. They are often embedded in high-dimensional vector spaces by different vectorizers and can be processed in this way.

It is important to note that we adapt the vectorizer to all documents. The reason for this is that it is theoretically possible to have words in the test dataset that are not in the training dataset - this would either lead to errors or ignore potentially interesting data points (depending on the implementation). Since we do not want this, we first combine all data and then pass it to the vectorizer.fit_transform([...]) function.

naive_bayes.py
text_total = np.concatenate([train_text, test_text])
vectorizer = CountVectorizer()
tokens = vectorizer.fit_transform(text_total)

More information regarding the CountVectorizer can be found here.

Before training the classifier, the features are converted to an array representation. In this case, due to the implementation of the vectorizer, this always returns sparse matrices. However, since our classifier does not accept sparse matrices, it is necessary to convert them to a non-sparse matrix.

In many cases, it makes sense to return a sparse matrix because the results are often very high-dimensional and therefore require large amounts of memory. To avoid this, sparse matrices are used, which in principle store only the non-zero elements with their corresponding positions instead of storing all elements.

naive_bayes.py
train_x, test_x = tokens.toarray(), tokens.toarray()

In the second to last step, we train the classifier on the training data set. This can be done in the following way:

naive_bayes.py
gaussianNB = GaussianNB()
gaussianNB = gaussianNB.fit(train_x, train_y)

In the last step, the accuracy of the classifier can be determined for both the training and the test data set.

naive_bayes.py
print(gaussianNB.score(train_x, train_y))
naive_bayes.py
print(gaussianNB.score(test_x, test_y))

When the code was executed, the accuracy was 98,63%98,63 \% on the training data set and 78,75%78,75 \% on the test data set. This shows that the algorithm can handle more complex tasks.

If users were to run this algorithm through the identical script, it is likely that different results would be obtained. Due to the limitations of our hardware, it was only possible to perform the classification on part of the dataset. Memory in particular is a limiting factor in being able to process the entire dataset.

Advantages and disadvantages of the Naive Bayes classifier

There are a number of advantages and disadvantages to consider when using a Naive Bayes classifier.

Advantages

The advantages of the Naive Bayes classifier can be described as follows: 2

  1. Efficient training: The algorithm’s low computational complexity allows it to process large amounts of data quickly. By assuming that each feature is independent, the amount of training data required can theoretically be reduced.

  2. Fast execution: Due to the simplicity of the algorithm, class membership can be predicted quickly after training.

  3. Scalability: The Naive Bayes algorithm is well suited to handle and perform well on high-dimensional data. It can also handle both discrete and continuous data.

  4. Robustness: The Naive Bayes classifier is an algorithm that is not as affected by outliers as other algorithms.

Disadvantages

The great advantage of Naive Bayes lies in the simplicity and efficiency of the learning process. This is based on the assumption that all features are independent of each other. Although this leads to robust results in practice, it is far from optimal - especially in areas that work with natural language, e.g. where word sequences are relevant. An independent consideration of the features in this sense is possible, but will generally not lead to an optimal solution.

If complex correlations in the data are to be learned, it may make sense to switch to other algorithms. Examples of algorithms with potentially better solutions include: 2 7

  1. Decision Trees

  2. Random Forests

  3. Support Vector Machines or

  4. Neural Networks.

Areas of application

The Naive Bayes classifier can be used in data-driven environments in various domains. Some examples in this regard are:

  1. spam detection,

  2. message classification, or

  3. sentiment analysis.

Due to the low complexity of the algorithm, it quickly finds good solutions in the data. 1 2

Sources

Footnotes

  1. wikipedia.org 2 3

  2. guru99.com 2 3 4

  3. wikipedia.org

  4. wolfram.com

  5. arxiv.org

  6. di.unipt.it

  7. opengenus-org