Stable Diffusion XL - Basic Text 2 Image

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Introduction

Text-to-image synthesis has emerged as a groundbreaking field of artificial intelligence research over the past few years, captivating both enthusiasts and experts alike. Originally pioneered by models such as DALL-E in 2021, this technology enabled users to generate vivid images from text prompts, sparking immense interest and creative applications across various industries. Amidst these developments, Stable Diffusion XL 1.0 has revolutionized the landscape by building upon its predecessor, Stable Diffusion, with significant advancements in efficiency, quality, and versatility. By leveraging cutting-edge architectures and incorporating substantial improvements, Stable Diffusion XL has become an significant advancement in text-to-image generation models.

Today, we will take a closer look at how to generate images with Stable Diffusion XL, as well as their quality.

In the following article, we will utilize the abbreviation “SDXL” instead of Stable Diffusion XL.

Text to Image Generation

Before we dive into how to generate images with the specified model and their quality, we want to lay the groundwork first and define what text to image generation means.

Text-to-image generation is a process in generative machine learning that enables the creation of images from textual descriptions. Here’s an overview of the key steps involved:

Text Preprocessing: The input text is preprocessed to extract relevant information and convert it into a numerical representation that can be processed by the model. This mainly involves tokenization, although other techniques may be applied to normalize the text. Similarly, these techniques leverage concepts from Natural Language Processing tasks, trying to extract the semantic meaning and relationships between words inside the prompt.
Image Generation: A generative model, typically based on Generative Adversarial Network or Varational Autoencoders, is used to generate an initial image from the processed prompt. This process involves sampling from a probability distribution over images that match the input text description.
Image Refining: The initial image may undergo Post-processing, where we can apply filters, adjustments, or other transformations to enhance the image quality or consistency with the original text prompt.

In the end, we are left with a refined image from the model, that we can extract from the models output.

In the context of Stable Diffusion XL, this process involves using a combination of transformer-based models and Varational Autoencoders to generate high-quality images from textual descriptions. The model leverages a range of techniques, including:

Text encoding: Using a neural network to encode the input text into a fixed-size vector representation.
Image synthesis: Employing a Varational Autoencoder to sample from a probability distribution over images that match the encoded text description.
Diffusion process: Applying a series of noise reduction steps to refine the generated image and improve its quality.

By combining these techniques, Stable Diffusion XL has achieved state-of-the-art results in text-to-image generation, enabling users to create visually stunning images from textual descriptions.

Utilizing Stable Diffusion XL

Why Stable Diffusion XL?

While newer models like Flux-Dev and Stable Diffusion 3 offer impressive capabilities, our decision to utilize SDXL stems from its availability in scenarios where VRAM requirements prove too high. The immense computational demands of these cutting-edge models can quickly become a bottleneck, rendering them impractical for deployment on lower-end hardware that is used in private households.

In contrast, SDXL’s more modest VRAM requirements make it a viable option for those seeking to leverage the benefits of generative image modeling without being constrained by limited system resources.

As an example: You can run SDXL on any NVIDIA graphics card, which has 16 GB of VRAM or lower (depending of youre using offloading or not).

Despite using an older architecture, this guide series aims to provide valuable insights and guidelines that can be applied to other models, including SDXL, as well as newer architectures when their specifications are within reach. By providing actionable advice and guidance on image generation best practices, we hope to empower users to harness the full potential of generative image models, regardless of their chosen architecture or hardware configuration.

Example Code

We can utilize the model by using the Huggingface API. For this, we need two imports:

from diffusers import DiffusionPipeline
import torch

sdxl.py

Nextup, we can download the model and transfer it to the GPU for significantly faster inference.

Running this model on CPU alone is not recommended. While inference with a GPU usually takes a matter of seconds (with 50-80 inference steps per image), the same process can take up to half an hour with CPU only, depending on your CPU.

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", 
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
    ).to("cuda")

sdxl.py

Nextup, we generate a sample wrapper function to generate an image from a prompt and save it to our local file system.

def generateImage(prompt):
  image = pipe(prompt,
            num_inference_steps=inferenceSteps,
            )[0][0]
  
  image.save(prompt + ".jpeg")

sdxl.py

For users with few VRAM capabilities, we can use a tradeoff between speed and VRAM requirements. The code line of pipe.enable_sequential_cpu_offload() reduces both speed and VRAM necessary for image generation, by offloading some of the models parameters. You can read more about this here.

With that out of the way, lets get started!

Example Images

In this section, we showcase examples created using Stable Diffusion XL, demonstrating its capabilities through carefully crafted pairs of images. For each prompt, two images are generated to illustrate the model’s output, with their corresponding prompts placed above them for easy comparison.

cartoonish style, a church sitting atop a cliff, night time, the moon shining into the church

cartoonish style, a steaming hot cup of coffee standing on a small table, inside a wooden house, light shining down from above

concept art, a futuristic space ship, entering the atmosphere of a vulcanic planet, black and white image

digital painting, a futuristic car entering an outdated city

photorealistic shot, a person strolling at the beach, sunset in the background, camera facing onto the ocean

photorealistic shot, a sandwich laying on a white plate

photorealistic shot, a tram going down a densely populated street, all people are blurred out, nighttime, lanterns are shining dimly

photorealistic shot, a large amber that encloses a fire

photorealistic shot, a metal key, glowing due to heat

photorealistic shot, a stash of treasures in the corner of a room, dark fortress background with dark brown bricks

digital painting, a person sitting in front of a computer, programming a website

In our generated images, we’ve observed several characteristics:

Reducing photorealism: Photorealistic shots can sometimes appear overly clean, particularly when creating images for everyday applications like our sandwich generation. To mitigate this, we recommend incorporating additional guiding requirements that include more background details.
Capturing human postures: While photorealistic shots of humans can look realistic, capturing them from the front often results in less convincing outcomes. However, guiding the generation of posture can lead to more lifelike scenes, as seen in our beach scene image.

Image generation models from the generation of SDXL typically have problems generating human faces due to the necessary level of detail. For this, we either recommend extremely rightly guiding prompts for image generation, or utilizing a different model.

Object and landscape rendering: Photorealistic images of objects and landscapes can be stunning, with varying levels of success depending on the complexity of the subject.
Cartoonish quality: Cartoonish images generated by SDXL consistently produce high-quality results, without any notable shortcomings in our testing. This is likely due to the fact, that cartoonish images typically dont suffer from random noise (like not quite straight lines) as much as other types of digital content suffer from it.
Digital paintings with depth blur: One of the benefits of using SDXL for digital paintings is the ability to achieve a natural-looking depth blur without requiring explicit specification. This makes it ideal for creating abstract images with a strong sense of depth.
Avoiding overdetailing: It’s essential to balance image complexity with the level of detail required. Overloading the image generator with too many details can result in an overall decrease in quality, as seen in our tram image example.

TL;DR

In this blog post, we investigated and evaluated different basic concepts of text to image generation utilizing Stable Diffusion XL.

Photorealistic shots can sometimes appear overly clean, particularly when creating images for everyday applications like our sandwich generation. To mitigate this, we recommend incorporating additional guiding requirements that include more background details.
While photorealistic shots of humans can look realistic, capturing them from the front often results in less convincing outcomes. However, guiding the generation of posture can lead to more lifelike scenes, as seen in our beach scene image. on the other hand Photorealistic images of objects and landscapes can be stunning, with varying levels of success depending on the complexity of the subject.
Cartoonish images generated by SDXL consistently produce high-quality results, without any notable shortcomings in our testing.
One of the benefits of using SDXL for digital paintings is the ability to achieve a natural-looking depth blur without requiring explicit specification. This makes it ideal for creating abstract images with a strong sense of depth.
It’s essential to balance image complexity with the level of detail required. Overloading the image generator with too many details can result in an overall decrease in quality, as seen in our tram image example.

In our next article in this series, we are going to focus on what a good prompt should incorporate and how that is correlated with the quality of our images.