Visual Chronicles:
Using Multimodal LLMs to Analyze Massive Collections of Images

Boyang Deng1,✧ Songyou Peng2,✩ Kyle Genova2,✩
Gordon Wetzstein1 Noah Snavely2 Leonidas Guibas1,2 Thomas Funkhouser2
1 Stanford University        2 Google DeepMind
✧Part of this work done as a student researcher at Google DeepMind
✩Equal contributions. Order decided by a random number generator.

[Paper (arXiv)]        [Paper (HighRes)]        [Supplementary]

Video (w/ 🔈)

Summary

TL;DR: Visual Chronicles is the first use of MLLMs to analzye massive collections of images, to answer open-ended queries such as “what are the trending changes in a city?”.
We build a system that breaks down the massive-scale analysis into 2 stages, local analysis and global aggregation. We design effective and scalable solutions for each stage using MLLMs. The results of the analysis are trends in text, along with the visual evidence of trending changes.

Method Overview.

Interactive Demo

The crosswalk had its marking changed to red. (seen 519 times)
Red Patches Added to Crosswalks Juice Shops Opened Solar Panels Added to Rooftops

Select different trends above to view. For each trend, click a plotted dark-colored icon to view a specific before/after image pair (click the image for better views). Light-colored icons show change locations only. View in full screen (top-right button) to spot some changes small in scale. Changes are sub-sampled for better visualizations.

Trending Changes

San Francisco

trend visual evidence.
A green bike lane was added to the street in front of a building. (seen 754 times)
Bike Lane Outdoor Dining Bus Lane Solar Panel Bike Rack

New York City

trend visual evidence.
The parking lot in front of the building now has a fence enclosing it. (seen 509 times)
Fenced Parking New Café Zebra Crossing Security Camera Wooden Overpass Plank

Conditional Search for Trends

Temporal Condition

Temporal-conditioned Search.

We can search for trending changes happened within a speicifc temporal window, e.g. 2020-2022.

Semantic Condition

Semantic-conditioned Search.

We can also search for trending changes relevant to a specific semantic concept, e.g., retail stores.

Socioeconomic Connections

socioeconomic connections.
The support of the overpass was painted blue. (seen 481 times) [News Source]
Blue Overpass in SF Red Crosswalk Patch in NYC

We can connect discoveries (left) in Visual Chronicles to socioeconomic events or policy (right).

Citation

@misc{deng2025visualchronicles, title={Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images}, author={Boyang Deng and Songyou Peng and Kyle Genova and Gordon Wetzstein and Noah Snavely and Leonidas Guibas and Thomas Funkhouser}, year={2025}, eprint={2504.08727}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.08727}, }

Acknowledgements: Thanks to Jiahui Lei, Anh Thai, Jiapeng Tang, Linyi Jin, Luming Tang, Rundi Wu, Ian Huang, Colton Stearns, Francis Engelman, Manu Gopakumar, Suyeon Choi, Haley So, Richard Tucker, Abhijit Kundu, Jonathan Barron, Glenn Entis, and David Salesin, for their comments and constructive discussions; to Abhijit Kundu, William Freeman, and John Quintero for helping review our draft; G.W. was in part supported by Google, Samsung, and Stanford HAI. B.D. was in part supported by a Qualcomm Innovation Fellowship. This project page is adopted from the Streetscapes project page designed by Richard Tucker.

Disclaimers: Google Maps Street View images used with permission from Google.