An Introduction to Unsupervised Learning

Classification

Machine Learning has attracted much attention in recent years. Be it predictions in a classification or even regression, machine learning is characterized by its excellent handling of large amounts of data. In Supervised Learning, large amounts of labeled data are often used for this purpose. ¹

Labels are defined as categorizations into classes or other comparative variables which are assigned to each data point.

But what if large quantities of labels are not always available or generating appropriate labels is too expensive?

Definition of Unsupervised Learning

In Unsupervised Learning we are talking about a category of machine learning, in which we are supposed to form relations from completely unlabeled data. By imitation - which is an important part of the learning process in humans - to generate an accurate representation of the environment. During training, the algorithm is supposed to recognize patterns and relations in the data on its own. The “training” in such algorithms occurs without any supervision. ² ³

Advantages of Unsupervised Learning

Unsupervised Learning has many application areas, especially in exploratory data analysis. The advantages of Unsupervised Learning here are: ⁴

unlabeled data is significantly easier to obtain than having to manually assign labels.
it finds unknown patterns in arbitrary data sets,
unsupervised learning supports the user to find new criteria for classification or unimportant features to reduce the dimension of a dataset without losing much information.
when used on a dataset, training takes place in real time instead of training having to have happened beforehand.

Disadvantages of Unsupervised Learning

Besides many advantages of Unsupervised Learning, there are also some disadvantages there. These are:

higher complexity due to large amounts of data,
long training times,
higher risk of inaccurate results.
it may be necessary to validate the corresponding results and control the training if the results are not usable in this way.
Lack of transparency on how results were generated.

At the end of the day, Unsupervised Learning algorithms represent powerful tools for determining relationships within datasets, however, the absence of data labels creates difficulties that must be considered. ⁵

Tasks of Unsupervised Learning

Unsupervised Learning can be used to solve a number of tasks. A list of such tasks follows.

Clustering

In Clustering, the goal is to find a mapping from data with unknown groupings that finds the greatest possible differences between the elements of the groupings, but the greatest possible similarity between the elements of the groupings. Since it concerns here no clear “classes” in the actual sense, but only assumed relations, the result is called here Cluster. The individual clusters are not given here and result dynamically during the run time.

It should be clear to the user that the algorithm itself creates relationships and that these do not necessarily have to be clearly recognizable. This is a clear difference to the Supervised Learning, with which one gives as a user a clear default to the result. ² ⁶

When given both dog and cat photos, the algorithm can form two clusters each - one for dog photos and one for cat photos. However, this is not necessarily the case; it may also be that the algorithm creates a clustering by coat color.

Association

Another method is the Association. Here, data that can be associated with other data via certain attributes are categorized. So the task of the algorithms is to find objects that are related to each other - but they don’t have to be the same for that. Again the example with the dog photos: In the Association, the Unsupervised Learning algorithm would not group all dogs together, but would associate, for example, a leash with the dog. ⁶

Dimensionality Reduction

In many analyses or datasets for supervised learning, for example, users classically specify datasets with high dimension. Data sets with high dimensions of input variables often provide better results, but also suffer from slower training.

The goal of Dimensionality Reduction is to remove input variables from the actual data set that have little or no information content. As a result, the higher dimensional data set is reduced to a lower dimensional data set, which is easier to handle. It is important not to remove too many input variables from the data set, so that it still contains the most important information and the actual task is not complicated.

For more information regarding the problem of large dimensions of input variables in machine learning, see wikipedia.org or builtin.com.

Additionally, this allows for a simplified visualization of the dataset. ² ⁵ ⁷

Application examples of Unsupervised Learning

Due to its property of independently finding relationships in data, Unsupervised Learning is generally used in many domains with other tasks. Examples can be:

marketing: by clustering, groups of people can be compiled, which are distinguished from other groups of people by different characteristics. This can be used to find target groups in particular. Customer recommendations can also be made.
speech recognition: by means of speech input, speech processing can be increasingly specialized and adapted more precisely to users.
speech processing: using unsupervised learning, toxic speech can be detected on the Internet and appropriate measures can be taken after analyzing the speech.
anomaly detection: in many data streams (for example, transactions), a large amount of data is received every day. In this, deviations from the norm can be measured in real time and checked separately accordingly.
purchase associations: With the help of purchase histories, patterns can be found in people’s shopping baskets. Marketing strategies and product placements can be generated from such data to increase corresponding sales.

Clustering can be used to compile groups of people who are distinguished from other groups of people by various characteristics. This can be used to find target groups in particular. ⁶ ⁴ ⁵