Training-Free Layout Control with Cross-Attention Guidance

Visual Geometry Group, University of Oxford




We present a method for controlling the layout of images generated by large pre-trained text-to-image models by guiding the cross-attention patterns produced by the model in a spatially-directed manner. Our method requires no further training or finetuning.


Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.



We present two types of guidance for controlling the image layout -- forward and backward. In forward guidance, we use a smooth windowing function to bias the original cross-attention map of a specific token towards a user-specified region (such as a bounding box), to "force" the generated image to conform to the desired layout. Backward guidance involves calculating a loss between the cross-attention and bounding box to evaluate whether the attention map follows the desired pattern and updates the latent using back-propagation to guide the attention to focus on a specific region.


Comparison to Text-to-Image Generation Methods

Our method, built on top of Stable Diffusion, achieves image generation with correct spatial relationships. Some examples are from here.


Real Image Editing

We achieve real image editing based on Dreambooth and Text Inversion. Specifically, we can change the context, location and size of the objects in the original image.

real image editing


      title={Training-Free Layout Control with Cross-Attention Guidance},
      author={Minghao Chen and Iro Laina and Andrea Vedaldi},
      journal={arXiv preprint arXiv:2304.03373},


This research is supported by ERC-CoG UNION 101001212.