Ethics and Bias in Machine Learning

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

The basic problem

Machine learning algorithms are already being used in many areas of our lives to make automated decisions. Scientific work can show that such models unintentionally and sometimes unconsciously adopt human stereotypes. A good example is word processing, where such a model tends to assign the job of secretary to a woman, while the man is trained to be more of a manager, even though that was never the goal. ¹ These are the kinds of issues that need to be addressed - and in this article we aim to shed light on the ethical issues of machine learning.

What is bias?

In machine learning, we talk about bias when we see a tendency in the model to make inaccurate or unfair predictions. This is often due to systematic errors in the models or the training data used.

Bias in machine learning can be caused by a variety of factors. Some common causes include:

Insufficient amount of training data
Incorporation of human stereotypes into the training data, or
Incorrect selection of a training model that is inappropriate for the task or does not understand the data well enough. ²

There are many different types of bias. However, we will focus here on the basic problem of bias in the application; more information on the different types of bias can be found here.

Two Examples of Unfair A.I. Models

As a user and developer of such a machine learning model, it is important to create fair models. This ensures that no one is disadvantaged or even harmed by the application of the product. Biases in such models can also lead to business risks. ³ In general, ethical issues should be considered as early as possible in the development process. In the following, we will look at two popular examples where fairness has not been sufficiently analyzed.

Amazon Recruiting Software

In 2014, Amazon was using large amounts of collected data. The goal was to develop software that would automate the hiring process using data already known within the company.

Everyone wanted this holy grail,” one of the people said. “They literally wanted it to be an engine where I’m going to give you 100 resumes, it will spit out the top five, and we’ll hire those. ⁴

For example, it seemed like a very good idea to enter 100100 applications into the software and filter out the five best - a big time saver, especially for very large companies. However, when the software was analyzed in 2015, it turned out that the software was not working properly - applications with characteristics that are generally associated with women were rated negatively. The reason for this was that most applications in the technology sector are typically male-dominated, so there were basically more male applicants than female applicants. ¹ ⁴

U.S. Justice System

Traditionally, in the U.S. justice system, punishments were based on various written rules. However, this led to a shift in the administration of justice that allowed for the use of predictive forecasting algorithms. The goal of these algorithms was to predict the likelihood that offenders would reoffend in the future and determine an appropriate sentence. Such an approach is also known as predictive justice. The use of many different aspects of an individual’s life, such as marital status, criminal history, age, and especially ethnicity, was problematic.

However, critics instead raise robust concerns over this system of codified justice and labeled the current actuarial assessments tools as “unreliable, controversial, and unconstitutional”; as being created and trained with biased data “produced through histories of exclusion and discrimination” […]. ⁵

Through the use of this software, it has been found that many inequalities in society are reinforced in the justice system. ⁵ A well-known example of this is the historical discrimination against African-American U.S. citizens. ⁶

Sensitive characteristics and important aspects

These two examples illustrate that AI systems must be subject to some level of control to ensure that their use actually improves the end result or decision making. Therefore, it is important that the following sensitive characteristics are not included in a dataset:

Cultural affiliation
Gender
Age
Ethnicity

Author’s note: In principle, such characteristics are sometimes relevant, e.g. in the medical field (as in medical datasets like Medical Data | Kaggle.com). In principle, the goal should be to cleanse the dataset of any discriminatory or distorting features. For many applications, however, such features should not be included.

There are also a number of other points to consider in this context with an A.I. model: ³ ⁵

Reliability and security: Models should be robust and should not be able to be manipulated by an attacker.
Transparency: Users should be able to understand how the model’s decisions were made. However, due to the widespread use of neural networks, this is not always possible, as their decisions are highly complex and therefore cannot always be presented in an understandable way.
Responsibility: There should be clear responsibilities for the development and use of machine learning models.
Privacy: In an increasingly digital world, it is important to protect all data from unwanted access, thereby protecting the rights of each individual.

Prevention options

If the features described above cannot be removed from the data set, we have several options to avoid an unfair model as much as possible: ⁷ ⁸ ⁹

If the number of features and the size of the data set allow it, we should try to understand the data as well as possible.
If there are imbalances or unfair relationships in the data set, it is important to remove the associated data or relationships.
By choosing an appropriate fairness metric, we can ensure that our model can make fair predictions during training and afterwards.
If the steps mentioned here are not sufficient, hyperparameter optimization of the model may be a way out.

Examples of Fairness Metrics

To briefly review the fairness metrics discussed above, here are two examples. These two metrics can be used to evaluate the fairness of a model. ¹⁰ ¹¹

Demographic Parity: Demographic parity is a fairness metric that assesses fairness based on the prediction rate across various sensitive characteristics - such as ethnicity or gender. Using the example of the U.S. justice system, African Americans should not receive (significantly) higher sentences than U.S. citizens of European descent.
Equalized Odds Metric: In the case of the equalized odds metric, the goal was to make machine learning equally applicable to different groups. In practice, this comparative metric is intended to ensure that similar error rates occur for different members of sensitive groups.

In principle, there are other fairness metrics. Often, fairness metrics are designed to prioritize specific situations and are application based.

TL;DR

Bias in machine learning is when a model makes unwarranted predictions that miss the point of the model. This can pose both personal and business risks, so it is necessary to reduce bias. The source of bias is in the data set, which requires a close examination of the data set before (or even after) training. Certain characteristics, such as age, should be irrelevant for most datasets and therefore should never be included. Various options (e.g. fairness metrics) can be used to avoid bias in our model and thus avoid ethical problems as much as possible.