Graphics from Sebastian Svenson – https://unsplash.com/@sebastiansvenson

Image Segmentation: Exploring Use Cases of Microsoft's Pretrained Transformer Kosmos2

Image segmentation is crucial for object recognition and advanced image processing, and the Kosmos-2 neural network enhances this by grounding text in the visual world, perceiving object descriptions, and associating them with their respective image regions. Today we will explore use cases of such a model.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:

How we use machine learning to create our articles

Introduction

Whether in e-commerce, healthcare, social media, or self-driving cars, identifying and localizing objects in images is a useful capability of algorithms. This capability plays a critical role in today’s digital world, enabling a wide range of applications. It is also a very complex task, requiring a high level of accuracy and detail.

In this article I would like to introduce a pre-trained transformer for this task: Kosmos2 from Microsoft.

Image Segmentation

Image segmentation is a technique used in digital image processing and machine vision. It involves dividing a digital image into several segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. More specifically, image segmentation is the process of assigning a name to each pixel in an image so that pixels with the same name share certain properties. The result of image segmentation is a set of segments that together cover the entire image, or a set of contours extracted from the image. All pixels in a region are similar with respect to certain characteristic or computed properties such as color, intensity, or texture. 1 2 3

Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. When applied to an image stack (a series of images), as is common in medical imaging, the contours resulting from image segmentation can be used to create 3D reconstructions using geometric reconstruction algorithms. 1 2

Kosmos-2

Kosmos-2 is a Multimodal Large Language Model (MLLM) developed by Microsoft Research. It is designed to generate object descriptions (e.g. bounding boxes) and to integrate text into the visual world. This means that text descriptions can be linked to the corresponding visual representations.

Kosmos-2 represents reference expressions as links in Markdown, where object descriptions are sequences of location tokens. Extensive data from grounded image-text pairs (so-called GrIT) is used for training. In addition to the existing capabilities of MLLMs (e.g., general modality perception, instruction following, and contextual learning), it is possible to integrate Kosmos-2 into downstream applications. It will be evaluated on a wide range of tasks, including multimodal grounding, multimodal referencing, speech perception tasks, speech comprehension, and speech generation.

This work lays the foundation for the development of embodiment AI and demonstrates the convergence of language, multimodal perception, action, and world modeling, which is an important step toward Artificial General Intelligence (AGI). 4 5 6 7

Using the Model

As usual, we start by loading all the necessary imports.

kosmos2.py
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

import cv2

import numpy as np

We can then use the huggingface API to download both the actual neural network and the processor.

kosmos2.py
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

We can now take a closer look at the two main applications of this model: Image Description and Image Segmentation. At this point, we will start with the textual description use case due to its simplicity and familiarity.

Image Description

For a text description it is necessary to select an image that we want to describe. We have chosen this image as an example:

Graphics from Jay Mantri - https://unsplash.com/@jaymantri

We can now start inviting this image through PIL.

kosmos2.py
image = Image.open("Kosmos-Images/test-image.jpeg")
prompt = "<grounding>An image of"

We can then transfer the image to the model for processing using the prompt defined here.

kosmos2.py
inputs = processor(text=prompt, images=image, return_tensors="pt")

generated_ids = model.generate(
    pixel_values=inputs["pixel_values"],
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    image_embeds=None,
    image_embeds_position_mask=inputs["image_embeds_position_mask"],
    use_cache=True,
    max_new_tokens=128,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

The generated_text variable now contains a simple description. This description now looks like this:

<image>. the, to and of as in I that' for is was- on it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of<phrase> a forest with fog</phrase><object><patch_index_0224><patch_index_1023></object>

This text is not immediately clear to us, but it contains the relevant information. Fortunately, Microsoft directly provides the appropriate post-processing so that we can get a more readable text:

kosmos2.py
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)

print(processed_text)

<grounding> An image of<phrase> a forest with fog</phrase><object><patch_index_0224><patch_index_1023></object>

Here we also get a text description containing model tokens. This can be useful if users want to know more about the structure of the output. If users do not want this, we also have the corresponding part of the code that gives us a readable text and the position and size of all objects in the image:

kosmos2.py
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
print(entities)

An image of a forest with fog

[(‘a forest with fog’, (12, 29), [(0.015625, 0.234375, 0.984375, 0.984375)])]

Now that we’ve seen that this model can produce image descriptions, let’s move on to image segmentation. We will use the bottom row of the output for this.

Image Segmentation

For the image segmentation, which we will perform in a moment, we have used the source code from the above example 5. To do this, we will define the functions and then apply them to our example image.

kosmos2.py
def is_overlapping(rect1, rect2):
    x1, y1, x2, y2 = rect1
    x3, y3, x4, y4 = rect2
    return not (x2 < x3 or x1 > x4 or y2 < y3 or y1 > y4)
kosmos2.py
draw_entity_boxes_on_image(image, entities, show=False, save_path="Kosmos-Images/test-image-modified.jpeg")
kosmos2.py
def draw_entity_boxes_on_image(image, entities, show=False, save_path=None):
    """_summary_
    Args:
        image (_type_): image or image path
        collect_entity_location (_type_): _description_
    """

    image_h = image.height
    image_w = image.width
    image = np.array(image)[:, :, [2, 1, 0]]

    if len(entities) == 0:
        return image

    new_image = image.copy()
    previous_bboxes = []
    text_size = 1
    text_line = 1
    box_line = 3
    text_spaces = 3

    (c_width, text_height), _ = cv2.getTextSize("F", cv2.FONT_HERSHEY_COMPLEX, text_size, text_line)
    base_height = int(text_height * 0.675)
    text_offset_original = text_height - base_height

    for entity_name, (start, end), bboxes in entities:
        for (x1_norm, y1_norm, x2_norm, y2_norm) in bboxes:
            orig_x1, orig_y1, orig_x2, orig_y2 = int(x1_norm * image_w), int(y1_norm * image_h), int(x2_norm * image_w), int(y2_norm * image_h)

            # random color
            color = tuple(np.random.randint(0, 255, size=3).tolist())
            new_image = cv2.rectangle(new_image, (orig_x1, orig_y1), (orig_x2, orig_y2), color, box_line)

            l_o, r_o = box_line // 2 + box_line % 2, box_line // 2 + box_line % 2 + 1

            x1 = orig_x1 - l_o
            y1 = orig_y1 - l_o

            if y1 < text_height + text_offset_original + 2 * text_spaces:
                y1 = orig_y1 + r_o + text_height + text_offset_original + 2 * text_spaces
                x1 = orig_x1 + r_o

            # add text background
            (text_width, text_height), _ = cv2.getTextSize(f"  {entity_name}", cv2.FONT_HERSHEY_COMPLEX, text_size, text_line)
            text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2 = x1, y1 - (text_height + text_offset_original + 2 * text_spaces), x1 + text_width, y1

            for prev_bbox in previous_bboxes:
                while is_overlapping((text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2), prev_bbox):
                    text_bg_y1 += (text_height + text_offset_original + 2 * text_spaces)
                    text_bg_y2 += (text_height + text_offset_original + 2 * text_spaces)
                    y1 += (text_height + text_offset_original + 2 * text_spaces)

                    if text_bg_y2 >= image_h:
                        text_bg_y1 = max(0, image_h - (text_height + text_offset_original + 2 * text_spaces))
                        text_bg_y2 = image_h
                        y1 = image_h
                        break

            alpha = 0.5
            for i in range(text_bg_y1, text_bg_y2):
                for j in range(text_bg_x1, text_bg_x2):
                    if i < image_h and j < image_w:
                        if j < text_bg_x1 + 1.35 * c_width:
                            # original color
                            bg_color = color
                        else:
                            # white
                            bg_color = [255, 255, 255]
                        new_image[i, j] = (alpha * new_image[i, j] + (1 - alpha) * np.array(bg_color)).astype(np.uint8)

            cv2.putText(
                new_image, f"  {entity_name}", (x1, y1 - text_offset_original - 1 * text_spaces), cv2.FONT_HERSHEY_COMPLEX, text_size, (0, 0, 0), text_line, cv2.LINE_AA
            )
            # previous_locations.append((x1, y1))
            previous_bboxes.append((text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2))

    pil_image = Image.fromarray(new_image[:, :, [2, 1, 0]])
    
    if save_path:
        pil_image.save(save_path)
    if show:
        pil_image.show()

    return new_image

Based on graphic by Jay Mantri - https://unsplash.com/@jaymantri

In this image, we now see a frame around our forest called “Forest with Fog” - a fitting description of the contents of this image.

Results of the model

Here are some of our experiments with this model.

Original ImageModified Image
Graphics from 3darts renders - https://unsplash.com/@3d_arts
Original ImageModified Image
Graphics from Dan Asaki - https://unsplash.com/@danasaki
Original ImageModified Image
Graphics from Daniel Ramírez - https://unsplash.com/@danramirez1998
Original ImageModified Image
Graphics from Filip Mroz - https://unsplash.com/@mroz
Original ImageModified Image
Graphics from Francesco Ungaro - https://unsplash.com/@francesco_ungaro
Original ImageModified Image
Graphics from Icons8 Team - https://unsplash.com/@icons8
Original ImageModified Image
Graphics from Igor Omilaev - https://unsplash.com/@omilaev
Original ImageModified Image
Graphics from Jaromír Kavan - https://unsplash.com/@jerrykavan
Original ImageModified Image
Graphics from kameli̯ə - https://unsplash.com/@camelieinpic
Original ImageModified Image
Graphics from Maksym Mazur - https://unsplash.com/@withmazur
Original ImageModified Image
Graphics from Maria Teneva - https://unsplash.com/@miteneva
Original ImageModified Image
Graphics from Nadzeya Matskevich - https://unsplash.com/@nadzeya1104
Original ImageModified Image
Graphics from Olga Deeva - https://unsplash.com/@loniel
Original ImageModified Image
Graphics from Red Zeppelin - https://unsplash.com/@redzeppelin
Original ImageModified Image
Graphics from Shubham Dhage - https://unsplash.com/@theshubhamdhage
Original ImageModified Image
Graphics from Sophie Gerrie - https://unsplash.com/@sophiegerrie
Original ImageModified Image
Graphics from Volodymyr M - https://unsplash.com/@huzhewseh

In our opinion, this model delivers good results. However, there are a few exceptions that we would have liked to see differently:

  1. In this picture, in this picture and in this picture, we would have liked the mountain to have its own box as well.
  2. In this picture, we would have liked the houses to have been boxed individually.

Areas of application

Some practical uses for segmenting images include: 1 2 3

  1. content based image search
  2. computer vision
  3. medical imaging, including volume-rendered images from computed tomography, magnetic resonance imaging, and volume electron microscopy
  4. localization of tumors and other pathologies
  5. tissue volume measurement
  6. diagnosis, examination of anatomical structure
  7. surgical planning
  8. simulation of virtual surgery
  9. navigation during surgery
  10. radiation therapy
  11. object recognition

TL;DR

Image segmentation techniques range from simple, intuitive heuristic analysis to state-of-the-art deep learning implementations. Traditional image segmentation algorithms process high-level visual features of each pixel, such as color or brightness, to identify object boundaries and background areas. Machine learning, using specialized datasets, is used to train models to accurately classify the specific types of objects and regions contained in an image. Kosmos-2 is one such model that performs well. However, we would like to see a little more detail here and there, as sometimes objects are not directly identified.

Quellen

Footnotes

  1. ibm.com 2 3

  2. wikipedia.org 2 3

  3. huggingface.co 2

  4. arxiv.org

  5. huggingface.co 2

  6. github.com

  7. medium.com