Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee^1,*,‡, Gabriela Ben Melech Stan^2,*, Estelle Aflalo², Sayak Paul³, Dhruba Ghosh⁴, Tejas Gokhale⁵, Ludwig Schmidt⁴, Hannaneh Hajishirzi⁴, Vasudev Lal², Chitta Baral¹, Yezhou Yang¹

¹Arizona State University, ²Intel Labs, ³Hugging Face, ⁴University of Washington ⁵University of Maryland, Baltimore County
^*Equal contribution.
^‡Corresponding author.

ECCV 2024

^{October 2 4:30 PM CEST (#213)}

Code 🤗 SPRIGHT Artifact Collection arXiv

🤗 Dataset 🤗 Model 🤗 Demo

In a lush, green meadow, a large, colorful hot air balloon is preparing to ascend, positioned on the far right. On the left, a group of small rabbits, each no bigger than a balloon basket, curiously watches from a safe distance.

In the foreground, a grand piano, about three times the size of a cat, is positioned at the center. Behind it, slightly to the right, a window reveals a bright, full moon in the dark night sky, casting a gentle glow on the piano.

Above, a massive, dark storm cloud looms, filling the top half of the image with its ominous presence. Below, a small, winding river flows, while to the right a small house stands alone

A cozy cabin nestled in the woods, with a stream flowing in front and a fire burning in the fireplace inside.

A cat sitting on a chair with a lamp to the right and a window above, casting shadows on the floor below.

A garden with rows of vegetables growing, with a scarecrow standing guard to the left and a greenhouse next to the flowers, on the right.

A large, full moon dominates the top right corner of the image, casting a soft glow on a small, abandoned house below, situated in the center of a barren field. In the foreground, a twisted, gnarled tree leans towards the house.

A telescope pointed at the stars, with planets orbiting in the distance and a moon shining brightly overhead.

A giant, open book lies in the center of a wooden table, with tiny, detailed illustrations of mythical creatures scattered across its pages. In the background, a small, glowing lamp casts light over the scene, with small dragons flying above

A person standing on a hill, with a rainbow stretching across the sky behind them and a valley spreading out below.

Inside a cozy living room, a fireplace crackles with warmth. To the left of the fireplace, a bookshelf stands filled with well-loved novels, while to the right, a comfortable sofa beckons relaxation.

A yellow sun shining in the sky above a green meadow, with a river winding its way below, and a mountain range towering in the distance.

A train traveling along tracks, with mountains towering to the right and fields of flowers blooming to the left

A vibrant coral reef occupies the bottom half of the image, with a large sea turtle swimming above it towards the right. In the distant background, a small school of fish forms a swirling pattern, with the sunlight filtering through the water from the top left corner, illuminating the scene.

An airplane above a bench.

An apple above a cow.

A horse above a pizza.

A surfboard above a motorcyle.

A TV above a toilet.

A cup below a bus.

A cell phone below a cow.

A bicycle below a teddy bear.

A chair below a tie.

A bicycle below a traffic light.

An elephant to the left of an apple.

A bear to the left of an umbrella.

A fire hydrant to the left of an dining table.

A refrigerator to the left of a toilet.

A zebra to the left of a toaster.

A fire hydrant to the right of an airplane.

A bus to the right of a donut.

A carrot to the right of a cat.

A giraffe to the right of a truck.

A hair drier to the right of a wine glass.

Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. (1) First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. (2) Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on less than 500 images. (3) Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

Online Demo

The SPRIGHT Dataset

SPRIGHT (SPatially RIGHT) is the first spatially focused, large scale vision-language dataset. It was built by re-captioning ~6 million images from 4 widely-used datasets (CC12M, Segment Anything, COCO validation, and LAION Aesthetics).

a couple of cars parked next to each other on the street

The image features two old-fashioned cars parked next to each other, with one being a vintage black car and the other being a purple car. The black car is parked in front of the purple car, and both cars are positioned next to a building. The black car is larger than the purple car, and the purple car is parked behind the black car.

The red and black purses are placed next to each other on a glass shelf. The red purse is on the left side and the black purse is on the right side.

The image features a bucket filled with ice and three cans of beer placed on top of it. The cans are positioned in a triangular arrangement, with one can on the left side, one on the right side, and the third one in the center. The bucket is larger than the cans, and the cans are placed on top of the bucket, creating a visually appealing and refreshing scene.

a white and red sports car driving down a road

The image features a white and red sports car driving on a race track. The car is positioned on the left side of the image, and it is relatively small compared to the vast empty space surrounding it. The car is also driving on a road that is adjacent to a mountain, which adds to the overall scenery of the image.

a red brick building sitting in the middle of a field

The image shows a large red building with a sign in front of it. The building is situated next to a road, and there are two potted plants on the sidewalk in front of the building. The potted plants are relatively small in comparison to the large building, and they are positioned on the left and right sides of the building. The sign in front of the building is also relatively small compared to the building and the potted plants.

a red and yellow truck with a large light bar on top

The image features a large red truck parked on a dirt road, with a mountain in the background. The truck is positioned in the foreground, occupying a significant portion of the image, while the mountain is in the background, providing a sense of scale and depth to the scene.

looking up at a construction site with scaffolding

The image shows a large structure with a lot of steel beams and scaffolding. The structure is being built inside a large building, and the scaffolding is being used to support the construction process. The steel beams are of various sizes, with some being larger and more prominent than others. The scaffolding is also of different sizes, with some being larger and more prominent than others. The overall scene is a complex network of steel beams and scaffolding, creating a sense of depth and complexity in the image.

a group of houses sitting on top of a green hillside

The image shows a group of houses situated on a hillside, with some houses being closer to the foreground and others further back. The houses are surrounded by trees, creating a picturesque and serene landscape. The houses are of various sizes, with some being larger and more prominent than others. The overall scene is a mix of natural and man-made elements, showcasing the harmony between the houses and their environment.

a close - up of the back of a blue vivo phone

The blue cell phone is sitting on a white surface, with the back of the phone facing the camera. The phone is large compared to the white surface it is placed on.

a castle on top of a hill in the middle of the day

The image features a large castle situated on top of a hill, with a village of houses located below it. The castle is significantly taller than the houses, and the houses are spread out in the valley below the castle, creating a picturesque scene with the castle towering over the village.

The image shows a group of boats docked together in a harbor, with some boats being larger and others smaller. The boats are parked in a line, with some boats being closer to the foreground and others further in the background. The boats are positioned in a way that they are adjacent to each other, creating a sense of unity and organization within the harbor.

SPRIGHT captions, generated by prompting LLaVA-1.5-13B, shows a significant increase in spatial phrase occurrences compared to web-scraped captions (LAION), captioning models (CoCa on CC-12M and Segment Anything), and even human annotators (COCO).

table showing increased occurrence of spatial keywords in SPRIGHT captions

SPRIGHT synthetic captions have high correctness scores according to automated evaluations using FAITHScore and GPT-4(V). In addition, human annotators determine that 66.57% of the captions contain no errors.

SPRIGHT evals with FAITHScore and GPT-4(V)

Training Methodology

We fine-tune our model on 444 images and corresponding SPRIGHT captions, where each image contains a large number of objects. We hypothesize that a) Images that capture a large number of objects inherently also contain multiple spatial relationships and, b) Training on such images will optimize the model to consistently generate a large number of objects, given a prompt containing spatial relationships; a current failure mode of existing T2I models.

Quantitative Evaluations

We benchmark our model on VISOR, T2I-CompBench and GenEval, and find that we significantly improve upon existing methods in terms of spatial consistency. Furthermore, our fine-tuning method significantly improves image fidelity metrics, quantified by the increase in FID and CMMD scores.

tables showing performance of our model vs other T2I models

BibTeX

@misc{chatterjee2024getting,
      title={Getting it Right: Improving Spatial Consistency in Text-to-Image Models}, 
      author={Agneet Chatterjee and Gabriela Ben Melech Stan and Estelle Aflalo and Sayak Paul and Dhruba Ghosh and Tejas Gokhale and Ludwig Schmidt and Hannaneh Hajishirzi and Vasudev Lal and Chitta Baral and Yezhou Yang},
      year={2024},
      eprint={2404.01197},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}