architecture GPS as a Control Signal for Image Generation architecture

Paper Code BibTeX
TL;DR: We can generate images (GPS-to-image) conditioned on GPS coordinates and text prompts in a compositional manner or extract 3D models (GPS-to-3D) from unordered GPS-tagged images.

New York City (Generation)


Abstract

We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.


Method

(a) After downloading geotagged photos, we train GPS-to-image diffusion models conditioned on GPS tags and image captions. GPS tags are extracted from the image EXIF metadata, captions are provided by BLIP-3. The obtained GPS-to-image diffusion model can generate images using both conditioning signals (GPS and text) in a compositional manner. (b) We can also extract 3D models from landmark-specific angle-to-image diffusion models using score distillation sampling. ''+'' in the figure means we concatenate GPS embeddings and text embeddings.

architecture

(a) GPS-to-image generation (b) GPS-to-3D reconstruction


GPS-to-Image Generation

Our GPS-to-image model can compose text prompts (optional) and GPS coordinates as conditions to generate corresponding images. Here, we show generated results compared with the pretrained text-to-image model (SD-v1.4) with location text prompts on two well-known areas: New York City and Paris.



Average Images

We apply our GPS-to-image models to the problem of obtaining images that are representative of a given concept over a large geographic area. Specifically, we generate a single image that has high probability under all GPS locations within a user-specified area, as measured by our diffusion model. To do this, following work on compositional generation , we simultaneously estimate noise vectors for a large number of evenly sampled GPS locations and average them during each step of the reverse diffusion process.


GPS-to-3D Reconstruction

(Click to collapse)

Through score distillation sampling from our angle-to-image diffusion models trained on unordered collections of geo-tagged photos, we can obtain better 3D models for scenes compared to models that use only text conditioning.

Reference Image
DreamFusion
Ours
dog
Leaning Tower of Pisa
dog
Arc de Triomphe
dog
Stonehenge
dog
Space Needle
dog
Statue of Liberty
dog
Washington Monument

BibTeX

@article{feng2025gps,
  author = {Feng, Chao and Chen, Ziyang and Holynski, Aleksander and Efros, Alexei A and Owens, Andrew},
  title = {GPS as a Control Signal for Image Generation},
  year = {2025},
}