Graphics from Shubham Dhage – https://unsplash.com/@theshubhamdhage

Introduction to Semi Supervised Learning

Semi supervised learning represents one of the four topics of machine learning. This post is intended to give an introduction to the topic.

Henrik Bartsch

Henrik Bartsch

Classification

Machine learning has attracted much attention in recent years. Be it predictions in a classification or even regression, machine learning is characterized by its excellent handling of large amounts of data. In supervised learning, large amounts of labeled data are often used for this purpose. 1

Labels are categories of data which are assigned to them

But what if large amounts of labels are not always available or the generation of corresponding labels is too expensive?


Definition and tasks of semi supervised learning

Semi supervised learning basically refers to models that are trained using a combination of both labeled and unlabeled data. Hereby it lies between the supervised and unsupervised learning, which use accordingly only labeled or unlabeled data, in order to make a result or a prediction. 2 2 3

Accordingly, the task of the corresponding semi supervised learning algorithm can be interpreted as applying an existing structure of labels on the labeled data set to what may be a large set of unlabeled data.

In places where generating labels is correspondingly expensive or there are fundamentally few labels, semi supervised learning is used. In semi supervised learning, a model is trained using both labeled and unlabeled data. This approach to machine learning is between supervised learning, where the model is trained using labeled data, and unsupervised learning, where the model has no labeled data and must discover patterns or relationships in the data itself. In the case of semi supervised learning, the task can be interpreted as transferring a label structure from the labeled data to the unlabeled data. 4


Advantages of semi supervised learning

Semi supervised learning can be useful in situations where it is difficult or expensive to obtain a large amount of labeled data, but there is still a significant amount of unlabeled data that can be used to improve model performance. Using a lot of unlabeled data makes the process of training less expensive than it would be for classical supervised learning problems. 5

Disadvantages and challenges of semi supervised learning.

However, it is important to note with semi supervised learning that the model may be more prone to errors or overfitting than a model trained using only labeled data.

Some of the challenges in semi supervised learning are: 5 6

  1. selecting the appropriate labeled and unlabeled data: It is important to carefully select the labeled and unlabeled data that will be used for training, as the quality and relevance of this data will significantly affect the performance of the model. Otherwise, a supervised learning task with fully labeled dataset will give significantly better performance than what happens with pseudo-labeling.

  2. lack of supervision: since the model is trained with both labeled and unlabeled data, it may be more prone to errors or overfitting than a model trained with labeled data only.

  3. label noise: in some cases, the labeled data may contain incorrect or noisy labels that negatively affect the performance of the model.


Assumptions on the data

A semi supervised learning algorithm makes the following assumptions on the data: 7

  1. Continuity Assumption: The algorithm assumes that data that are close to each other have a high probability of having the same label.
  2. Cluster Assumption (Cluster Assumption): The data set can be divided into discrete clusters. Points in the same cluster are more likely to have the same label.
  3. Manifold Assumption: The label data lies approximately on a manifold of smaller dimension of the input data.

The manifold assumption allows to define distance measures and density measures on the manifold.


Sequence of the learning process

The training of the models in the semi supervised learning is usually led under the term Self-Training. Here the fact can be used that already a part of the data is labeled from the beginning. The training can be divided into three different phases:

  1. training of a “supervised learning” model based on the already labeled data.
  2. application of pseudo-labeling: predictions are made for non-labeled data points based on the already partially trained model.
  3. now the most confident predictions are used to extend the labeled data set using the pseudo-labels.

The resulting algorithm can be iterative - i.e., executed multiple times - or it can label all unknown data in one time with a pseudo-label that is accepted. In general, iterative algorithms are used whose performance increases with each iteration. 5

The labels generated in the second step are called pseudolabels because they are generated based on the labels of the labeled data set. These already labeled data have however under circumstances limitations, why the labels must be not 100% correct.

It can be problematic if the distribution of classes in a classification is no longer evenly distributed - in this case, the performance of the model can suffer. This is difficult to avoid with unlabeled data sets, since the data set is usually only roughly known.


Applications of semi supervised learning

Semi supervised learning has a wide applicability due to its low requirements on the necessary data sets. Examples of applications can be found in the following areas, among others:

  1. natural language processing (NLP)
  2. computer vision
  3. anomaly detection

Furthermore, semi supervised learning can be used supportively in many supervised learning tasks, so that they can be generated more cheaply and quickly. 6


The text in this post was generated in parts by OpenAI’s ChatGPT and corrected and revised by us.

Sources

Footnotes

  1. wikipedia.org

  2. wikipedia.org 2

  3. wikipedia.org

  4. blog.roboflow.com

  5. machinelearningpro.org 2 3

  6. deepai.org 2

  7. geeksforgeeks.org