The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
Introduction
Whether in e-commerce, healthcare, social media, or self-driving cars, identifying and localizing objects in images is a useful capability of algorithms. This capability plays a critical role in today’s digital world, enabling a wide range of applications. It is also a very complex task, requiring a high level of accuracy and detail.
In this article I would like to introduce a pre-trained transformer for this task: Kosmos2 from Microsoft.
Image Segmentation
Image segmentation is a technique used in digital image processing and machine vision. It involves dividing a digital image into several segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. More specifically, image segmentation is the process of assigning a name to each pixel in an image so that pixels with the same name share certain properties. The result of image segmentation is a set of segments that together cover the entire image, or a set of contours extracted from the image. All pixels in a region are similar with respect to certain characteristic or computed properties such as color, intensity, or texture. 1 2 3
Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. When applied to an image stack (a series of images), as is common in medical imaging, the contours resulting from image segmentation can be used to create 3D reconstructions using geometric reconstruction algorithms. 1 2
Kosmos-2
Kosmos-2 is a Multimodal Large Language Model (MLLM) developed by Microsoft Research. It is designed to generate object descriptions (e.g. bounding boxes) and to integrate text into the visual world. This means that text descriptions can be linked to the corresponding visual representations.
Kosmos-2 represents reference expressions as links in Markdown, where object descriptions are sequences of location tokens. Extensive data from grounded image-text pairs (so-called GrIT) is used for training. In addition to the existing capabilities of MLLMs (e.g., general modality perception, instruction following, and contextual learning), it is possible to integrate Kosmos-2 into downstream applications. It will be evaluated on a wide range of tasks, including multimodal grounding, multimodal referencing, speech perception tasks, speech comprehension, and speech generation.
This work lays the foundation for the development of embodiment AI and demonstrates the convergence of language, multimodal perception, action, and world modeling, which is an important step toward Artificial General Intelligence (AGI). 4 5 6 7
Using the Model
As usual, we start by loading all the necessary imports.
We can then use the huggingface API to download both the actual neural network and the processor.
We can now take a closer look at the two main applications of this model: Image Description and Image Segmentation. At this point, we will start with the textual description use case due to its simplicity and familiarity.
Image Description
For a text description it is necessary to select an image that we want to describe. We have chosen this image as an example:
We can now start inviting this image through PIL.
We can then transfer the image to the model for processing using the prompt defined here.
The generated_text
variable now contains a simple description. This description now looks like this:
<image>. the, to and of as in I that' for is was- on it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of<phrase> a forest with fog</phrase><object><patch_index_0224><patch_index_1023></object>
This text is not immediately clear to us, but it contains the relevant information. Fortunately, Microsoft directly provides the appropriate post-processing so that we can get a more readable text:
<grounding> An image of<phrase> a forest with fog</phrase><object><patch_index_0224><patch_index_1023></object>
Here we also get a text description containing model tokens. This can be useful if users want to know more about the structure of the output. If users do not want this, we also have the corresponding part of the code that gives us a readable text and the position and size of all objects in the image:
An image of a forest with fog
[(‘a forest with fog’, (12, 29), [(0.015625, 0.234375, 0.984375, 0.984375)])]
Now that we’ve seen that this model can produce image descriptions, let’s move on to image segmentation. We will use the bottom row of the output for this.
Image Segmentation
For the image segmentation, which we will perform in a moment, we have used the source code from the above example 5. To do this, we will define the functions and then apply them to our example image.
In this image, we now see a frame around our forest called “Forest with Fog” - a fitting description of the contents of this image.
Results of the model
Here are some of our experiments with this model.
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
Original Image | Modified Image |
---|---|
In our opinion, this model delivers good results. However, there are a few exceptions that we would have liked to see differently:
- In this picture, in this picture and in this picture, we would have liked the mountain to have its own box as well.
- In this picture, we would have liked the houses to have been boxed individually.
Areas of application
Some practical uses for segmenting images include: 1 2 3
- content based image search
- computer vision
- medical imaging, including volume-rendered images from computed tomography, magnetic resonance imaging, and volume electron microscopy
- localization of tumors and other pathologies
- tissue volume measurement
- diagnosis, examination of anatomical structure
- surgical planning
- simulation of virtual surgery
- navigation during surgery
- radiation therapy
- object recognition
TL;DR
Image segmentation techniques range from simple, intuitive heuristic analysis to state-of-the-art deep learning implementations. Traditional image segmentation algorithms process high-level visual features of each pixel, such as color or brightness, to identify object boundaries and background areas. Machine learning, using specialized datasets, is used to train models to accurately classify the specific types of objects and regions contained in an image. Kosmos-2 is one such model that performs well. However, we would like to see a little more detail here and there, as sometimes objects are not directly identified.