Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
SIGGRAPH 2024
[Paper (arXiv)] [Paper (HighRes, 70MB)]
[Paper (arXiv)] [Paper (HighRes, 70MB)]
TL;DR: We build a system to generate Streetscapes—long sequences of views through an on-the-fly synthesized city-scale scene, controlled by layout maps and text. Our system builds on an n-frame (n=2 or 4) video diffusion model and uses autoregressive generation at inference to generate hundreds of frames. The generated video can be further reconstructed to a NeRF.
Our system generates large-scale realistic street scenes in the form of a video. You can control what scene is generated by picking the layout map and a camera path, as well as an optional text prompt. Key to the control is the rendering of Geometry-Buffers (G-buffers) into the screen space as our conditioning for video generation.
We can also use Streetscapes to interpolate low frame-rate real-world street view captures into a nice and steady ride along your favourite streets.
The text prompt allows us to control the style of Streetscapes, for example by specifying the time of day or the weather. We can even have heavy snow in Barcelona!
We know What Makes Paris Look Like Paris. But Streetscapes knows What Makes Paris Look Like New York City, or London, or Barcelona. That is, we can take the map of a real Paris neighbourhood but make it look like another city using the text prompt.
Video generated from our Streetscapes system (top) compared with videos from a related work, InfiniCity (bottom). We find our results have notably better photorealistic image quality.
@inproceedings{deng2024streetscapes,
title = {Streetscapes: Large-scale Consistent Street View Generation
Using Autoregressive Video Diffusion},
author = {Deng, Boyang and Tucker, Richard and Li, Zhengqi
and Guibas, Leonidas and Snavely, Noah and Wetzstein, Gordon},
booktitle = {SIGGRAPH 2024 Conference Papers},
year = {2024}
}
Acknowledgements: Thanks to Thomas Funkhouser, Kyle Genova, Andrew Liu, Lucy Chai, David Salesin, David Fleet, Jonathon Barron, Qianqian Wang, Shiry Ginosar, Luming Tang, Hansheng Chen, and Guandao Yang for their comments and constructive discussions; to William Freeman and John Quintero for helping review our draft; and to all anonymous reviewers for their helpful suggestions. G.W. was in part supported by Google, Samsung, and Stanford HAI. B.D. was supported by a Meta PhD Research Fellowship. The initial idea of this project was partly inspired by the Star Guitar video created by Michel Gondry and The Chemical Brothers.
Disclaimers:
Google Maps Street View images used with permission from Google.
Results on this page are not real Street View images,
but instead generated scenes that do not exist.
The only few real street view images are the Street Views
column in Interpolating Street Views.