Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. (1) First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. (2) Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on less than 500 images. (3) Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

Online Demo

The SPRIGHT Dataset

SPRIGHT (SPatially RIGHT) is the first spatially focused, large scale vision-language dataset. It was built by re-captioning ~6 million images from 4 widely-used datasets (CC12M, Segment Anything, COCO validation, and LAION Aesthetics).

SPRIGHT captions, generated by prompting LLaVA-1.5-13B, shows a significant increase in spatial phrase occurrences compared to web-scraped captions (LAION), captioning models (CoCa on CC-12M and Segment Anything), and even human annotators (COCO).


table showing increased occurrence of spatial keywords in SPRIGHT captions

SPRIGHT synthetic captions have high correctness scores according to automated evaluations using FAITHScore and GPT-4(V). In addition, human annotators determine that 66.57% of the captions contain no errors.


SPRIGHT evals with FAITHScore and GPT-4(V)

Training Methodology

We fine-tune our model on 444 images and corresponding SPRIGHT captions, where each image contains a large number of objects. We hypothesize that a) Images that capture a large number of objects inherently also contain multiple spatial relationships and, b) Training on such images will optimize the model to consistently generate a large number of objects, given a prompt containing spatial relationships; a current failure mode of existing T2I models.

Quantitative Evaluations

We benchmark our model on VISOR, T2I-CompBench and GenEval, and find that we significantly improve upon existing methods in terms of spatial consistency. Furthermore, our fine-tuning method significantly improves image fidelity metrics, quantified by the increase in FID and CMMD scores.


tables showing performance of our model vs other T2I models

BibTeX

@misc{chatterjee2024getting,
      title={Getting it Right: Improving Spatial Consistency in Text-to-Image Models}, 
      author={Agneet Chatterjee and Gabriela Ben Melech Stan and Estelle Aflalo and Sayak Paul and Dhruba Ghosh and Tejas Gokhale and Ludwig Schmidt and Hannaneh Hajishirzi and Vasudev Lal and Chitta Baral and Yezhou Yang},
      year={2024},
      eprint={2404.01197},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}