Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

SIGGRAPH 2024

Boyang Deng2✧ Richard Tucker1 Zhengqi Li1 Leonidas Guibas1,2
Noah Snavely1✩ Gordon Wetzstein2✩
1 Google DeepMind        2 Stanford University
✩Equal Contributions
✧Part of this work done as a student researcher at Google

[Paper (arXiv)]        [Paper (HighRes, 70MB)]

Showreel (w/ 🔈, best in HD)

Overview

Method Overview.

TL;DR: We build a system to generate Streetscapeslong sequences of views through an on-the-fly synthesized city-scale scene, controlled by layout maps and text. Our system builds on an n-frame (n=2 or 4) video diffusion model and uses autoregressive generation at inference to generate hundreds of frames. The generated video can be further reconstructed to a NeRF.

Generating Streetscapes

NYC_0 Barcelona_0 London_Night Paris_Snow London_0 NYC_1 London_1

Our system generates large-scale realistic street scenes in the form of a video. You can control what scene is generated by picking the layout map and a camera path, as well as an optional text prompt. Key to the control is the rendering of Geometry-Buffers (G-buffers) into the screen space as our conditioning for video generation.

Interpolating Street Views

NYC_0 London_0 Barcelona_0 Paris_1 Barcelona_1 London_1 Paris_0 NYC_1

We can also use Streetscapes to interpolate low frame-rate real-world street view captures into a nice and steady ride along your favourite streets.

Style and Text Prompt

Weather and Time of Day

in the evening
Evening Sunrise Rain Sun Snow

The text prompt allows us to control the style of Streetscapes, for example by specifying the time of day or the weather. We can even have heavy snow in Barcelona!

What Makes 🥐Paris🥐 Look Like 🥯New York City🥯?

Paris-to-NYC Paris-to-London Paris-to-Barcelona

We know What Makes Paris Look Like Paris. But Streetscapes knows What Makes Paris Look Like New York City, or London, or Barcelona. That is, we can take the map of a real Paris neighbourhood but make it look like another city using the text prompt.

Comparison with InfiniCity



Video generated from our Streetscapes system (top) compared with videos from a related work, InfiniCity (bottom). We find our results have notably better photorealistic image quality.

Citation

@inproceedings{deng2024streetscapes, title = {Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion}, author = {Deng, Boyang and Tucker, Richard and Li, Zhengqi and Guibas, Leonidas and Snavely, Noah and Wetzstein, Gordon}, booktitle = {SIGGRAPH 2024 Conference Papers}, year = {2024} }

Acknowledgements: Thanks to Thomas Funkhouser, Kyle Genova, Andrew Liu, Lucy Chai, David Salesin, David Fleet, Jonathon Barron, Qianqian Wang, Shiry Ginosar, Luming Tang, Hansheng Chen, and Guandao Yang for their comments and constructive discussions; to William Freeman and John Quintero for helping review our draft; and to all anonymous reviewers for their helpful suggestions. G.W. was in part supported by Google, Samsung, and Stanford HAI. B.D. was supported by a Meta PhD Research Fellowship. The initial idea of this project was partly inspired by the Star Guitar video created by Michel Gondry and The Chemical Brothers.

Disclaimers: Google Maps Street View images used with permission from Google. Results on this page are not real Street View images, but instead generated scenes that do not exist. The only few real street view images are the Street Views column in Interpolating Street Views.