Throwback and Prospects
In our previous blog post, we explored the basics of text-to-image neural networks technology. We delved into its rise to prominence, defined the core components of a text-to-image pipeline, and provided an overview using the Stable Diffusion XL model as a case study.
Now, we’re building on that foundation by advancing to more sophisticated techniques for improving image quality during generation. In this section, we’ll outline established baseline steps that can enhance image outcomes, followed by a discussion on where to find additional resources for refining your results. Finally, we’ll dive into an example of how these improvements manifest in real-world applications.
Advanced Prompt Engineering for Text-to-Image Tasks
Before delving into what makes our prompts effective, it’s worth exploring what aspects of a prompt matter most when building them. Fortunately, a scientific study has already distilled the key parameters to consider. Here’s a summary of the most crucial bullet points: 1
• When picking the prompt, focus on subject and style keywords instead of connecting words. Rephrasings using the same keywords do not make a significant difference on the quality of the generation as no prompt permutation consistently succeeds over the rest.
• When generating, generate between $$3$$ to $$9$$ different seeds to get a representative idea of what a prompt can return. Generations may be significantly different owing to the stochastic nature of hyperparameters such as random seeds and initializations. Returning multiple results acknowledges this stochastic nature to users.
• When generating, for fast iteration, using shorter lengths of optimization between $$100$$ and $$500$$ iteration is sufficient. We found that the number of iterations and length of optimization did not significantly correlate with user satisfaction of the generation.
• When choosing the style of the generation, feel free to try any style, no matter how niche or broad. The deep learning frameworks capture an impressive breadth of style information, and can be surprisingly good even for niche styles. However, avoid style keywords that may be prone to misinterpretation.
• When picking the subject of the generation, pick subjects that can complement the chosen style in level of abstractness. This could be done by picking subjects for styles considering how abstract or concrete both are or pair ing subjects that are easily interpretable or highly relevant to the style.
[…]
The key takeaway is that while these findings may not be universally applicable due to variations across models and generation topics, these bullet points provide a theoretical foundation for improving generation quality and offer valuable insights into neural network capabilities.
After gaining some information about the backgrounds of prompt improvement, we can take a look at a concrete example of it, which serves as a syntactical foundation. The anatomy of a good prompt can look like this: 2
[1] Subject, [2] Detailed Imagery, [3] Environment Description, [4] Mood/Atmosphere Description, [5] Style, [6] Style Execution
The elements of the anatomy can be but into perspective in the following way:
- Subject: A detailed description of what the image should display. This mostly focuses on what the image centers, whereas details can be given later.
- Detailed Imagery: A description, which adds detail to the elements already included in the description. This can include textures of items, clothing of people, the perspective of the point of view or other details.
- Environment Description: Adding elements to the image, which are not a direct part of the subject.
- Mood/Atmosphere Description: A description, mostly focused on capturing feelings and tensions. This might also include brightness and darkness or contrast parameters.
- Style: A description of the genre, which the image should look similar to.
- Style Execution: Describing the overall drawing style of the image. Examples can include the illustration technique, what camera should be used or lighting.
Note: While we invested a lot of time investigating how to improve prompts for such generation tasks, its impossible to cover everything. Therefore, we are giving you a list of ressources, which you can use to investigate further, should this information become outdated or not be sufficient.
On this pages you can find ressources on how to engineer advanced prompts.
Showcase - An Example
In order to showcase how poorly executed prompt engineering can distinguished from advanced prompt engineering, we are going to look at how a castle can be generated and improved upon iteratively.
Note: To stay as true to style as possible, we have used the same seed across each generation.
We’re now going to present the corresponding prompts - which have been used to generate the image - alongside the resulting image. With this, we can observe how the images change - depending on the inputs we supply to the neural network.
Before we dive in, we want to give some basic parameters used for the generation process:
- Number of Inference Steps: $$50$$
- Guidance Scale: $$0.8$$
- Negative Prompt: None
- Seed: $$10$$
Unless specified otherwise, these standard parameters apply to each image generated.
a medieval castle

a medieval environment centered around a castle

a medieval environment centered around a castle, from a human perspective

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere

Styles
a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, digital painting

Number of Inference Steps
a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style
Note: This image was generated using $$10$$ inference steps.

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style
Note: This image was generated using $$15$$ inference steps.

Guidance Scale
a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style
Note: This image was generated using a guidance scale of $$0.2$$.

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style
Note: This image was generated using a guidance scale of $$2.0$$.

a medieval environment centered around a castle, from a human perspective, image from outside the castle walls, castle behind a lake, a small village on the left of the castle, busy atmosphere, photorealistic style
Note: This image was generated using the negative prompt of “mountains in the background”.

As a last step - in order to showcase the diffusion process of text-to-image models, we have generated an animation. Enjoy!

TL;DR
Effective prompt engineering is essential for realizing the full potential of Stable Diffusion XL (or any Text-to-Image model). A well-designed prompt can elevate a mundane image into a breathtaking work of art, while a poorly crafted one may result in nonsensical or even unsettling output. By thoughtfully selecting specific words, phrases, and parameters, users can harness the power of the model to generate images that not only captivate the eye but also resonate with their artistic vision.