The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
How we use machine learning to create our articlesIntroduction
Images play an increasingly important role in our digital lives. Whether on social networks, in newsletters and messages, or in a Google search, we often come into contact with images in digital form. This amount of possible and useful files leads to the usefulness of Question Answering to have certain questions answered from a series of images. This can be useful for reasons of time saving, accessibility, data analysis or any other task that can be determined by the user.
In today’s article, we want to introduce a neural network that has been trained for this very task: Matcha-Quarta from Google.
Acknowledgments
At this point, we would like to draw your attention to the work of the ourworldindata team. This team provides high-quality statistics that draw attention to difficulties and problems and enable a more detailed analysis.
To make progress against the pressing problems the world faces, we need to be informed by the best research and data. Our World in Data makes this knowledge accessible and understandable, to empower those working to build a better world.
By publishing various statistics, the team wants to show that not all problems in the world are getting worse - but that situations are continuously improving. 1 2 We would like to thank the team for using the publicly available statistics on this website.
Question Answering
To establish a common ground, we would first like to provide a definition of the Question Answering task in the context of Matcha-Quarta before we get into the actual use and results of the model.
In Question Answering, we give the model two related things:
-
An image and
-
a question that relates to the image.
The task of the model is to use both components to determine an answer to the question by trying to understand the content and context of the images. Depending on the accuracy of the training data and the model parameters, the accuracy of the answer can vary. Although there are other forms and variations of question answering, in this section we will focus on what is known as extractive question answering. 2 3 4
Now let’s look at a brief example of this problem. For this, we provide an image that we obtained from this source.
A possible question we could ask about this picture would be, for example, “What percentage of all people have formal basic education? A look at the source tells us that this value is exactly in 2020. Depending on the required accuracy, a value like might be sufficient. We could also ask questions about the slope of the graph of the proportion with primary education, for example, or ask for an explanation of what exactly is meant by the term “primary education” here.
Using the Model
We only need 9 lines to implement the model. Let’s start with the imports:
We can then download both the preprocessor and the neural network using Huggingface’s Transformers API.
In the next step, we can load the image into the script via PIL and assign a question to a variable.
In this step, we can start preprocessing the image and the question using the processor
. We can then pass the processed inputs to the model. In the final step, we use the processor
again to process the raw data from the model and convert it into a readable form.
Application examples
In the following we will show you the possibilities of Kosmos2. We will use different graphics and ask questions with different levels of difficulty. We will also link to the source of all statistics in order to make the source of the statistics directly available to all users of our site.
Question: Which country had the highest population in the year 2000?
Answer: China
Question: Which country had the highest population growth in the year 2000?
Answer: China
This picture shows that the basic questions about certain values and their ranking are quite correct, but the changes in values are not always correct. The correct answer to the second question would have been India.
Question: In which year did the United Kingdom hit 5t the first time?
Answer: 1800
Question: Which country has the lowest emissions in the year 2022?
Answer: India
This picture also shows that the basic rankings are being analyzed correctly again. However, the year here is quite different from the actual answer - the answer to the first question is more likely to be 1860.
Question: Which country had the steepest decline in renewable freshwater resources between 1980 and 1990?
Answer: Brazil
With the next image, we want to show that Matcha Quarta also has the ability to convey information that is not directly given in the image.
Question: What amount of renewable freshwater reserves did India have in the year 2019?
Answer: 1.131
Although India was never defined as “South Asia” in the picture, Matcha-Quarta was able to transfer this knowledge.
Question: Which country achieved a higher womens political empowerement index in the year 1950?
Answer: Germany
Question: Which country achieved a higher womens political empowerement index in the year 2000?
Answer: Germany
Compared to the previous results, this picture shows that Matcha-Quarta has more problems with classification at the beginning of a time interval, while it often achieves (significantly) better results at the end of the time interval. The answer for 1950 should be “United States” (or similar).
Now we want to show that Matcha-Quarta is also able to handle relatively complex data. We will use the following image:
Question: How many times more CO2 is produced by coal compared to nuclear energy?
Answer: 160
Question: What type of energy source is the safest source of energy?
Answer: [Solar, Greenhouse gas]
Question: What type of energy source is the cleanest source of energy?
Answer: [Solar, Greenhouse gas]
We can see that the first answer is correct. For the other two questions, however, we get two answers each, although it is clear from the picture that only one answer can be correct. However, the first answer is solar energy, which is the correct answer. For the third question we have only wrong answers.
Faulty results
Although we have already shown some examples with errors above, in this section we want to show queries that have particularly noticeable errors. We will start with these statistics:
We asked the following questions about this image:
Question: Which country had the highest percentage of nuclear energy consumed?
Answer: Mexico
Question: Which country had the highest percentage of nuclear energy consumed across europe?
Answer: China
Question: Please describe the “substitution method” in a single sentence.
Answer: 2
We did not necessarily see these answers coming. The fact is that the answer “2” to the last two questions makes no sense, and the variation of the first questions (with and without the addition “in Europe”) is also far from the truth. Unfortunately, the answer “China” in the context of “in Europe” is completely wrong.
We achieved a similar quality of results with the next picture.
Question: Which country had the highest share of renewable energy generation in europe?
Answer: Germany
Again, we got confusing results, but compared to the previous answer, Germany is in Europe. The correct answer would have been Norway.
In our experience, Matcha-Quarta is not yet able to draw reliable conclusions from maps.
Areas of application
The ability to automatically answer questions about images can be useful to both individuals and organizations for a variety of reasons:
-
Time savings: Automated image analysis can be performed faster than manual analysis, saving time. 5
-
Scalability: It enables the processing of large volumes of images that would be impractical to process manually. 5
-
Accessibility: It can help people who have difficulty seeing or understanding images. For example, blind or visually impaired people can be assisted by the description of image content. 5
-
Data analytics: Companies can use this technology to gain insights from images that are relevant to their business goals. For example, retailers can use automated image analysis to optimize product placement in their stores. 6
-
Security and surveillance: In security and surveillance applications, automated image analysis can help detect suspicious activity and trigger appropriate alerts.
-
Healthcare: In healthcare, automated image analysis can help diagnose diseases and monitor treatments by analyzing medical images. 7
This capability opens up a wide range of potential applications and can add significant value. However, it is important to note that the accuracy and usefulness of automated image analysis depends on the quality of the algorithms and data used.
TL;DR
The basic idea behind Kosmos2 is relatively simple - in the early stages of efficient “question answering” models, we can use a relatively small neural network with a tokenizer to automatically answer questions about images. Even if only relatively simple questions can be answered with good probability, Kosmos2 can be used to automate everyday tasks with images or to make humans particularly efficient at tasks with images. For more complex questions, we were able to show that answers are often given that unfortunately have nothing to do with the actual question or the image.
In the (near) future, we expect to have models that will provide much better answers for this task. We will let you know when exactly.