視覺編年史：運用多模態大語言模型解析海量圖像集

摘要

我們提出了一個利用多模態大語言模型（MLLMs）來分析包含數千萬張不同時間拍攝圖像的大型數據庫的系統，旨在發現時間變化中的模式。具體而言，我們的目標是捕捉城市在特定時期內頻繁共現的變化（“趨勢”）。與以往的視覺分析不同，我們的分析能夠回答開放式查詢（例如，“城市中常見的變化類型有哪些？”），而無需任何預先確定的目標對象或訓練標籤。這些特性使得先前的基於學習或無監督的視覺分析工具不再適用。我們將MLLMs視為一種新工具，因其具備開放式語義理解能力。然而，我們的數據集規模對於MLLM作為上下文輸入來說過於龐大，超出了四個數量級。因此，我們引入了一種自下而上的方法，將大規模視覺分析問題分解為更易處理的子問題。我們精心設計了基於MLLM的解決方案來應對每個子問題。在系統的實驗和消融研究中，我們發現其顯著優於基線方法，並能夠從大城市拍攝的圖像中發現有趣的趨勢（例如，“戶外用餐的增加”、“天橋被漆成藍色”等）。更多結果和互動演示請訪問https://boyangdeng.com/visual-chronicles。

English

We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.

視覺編年史：運用多模態大語言模型解析海量圖像集

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

摘要

Summary

Support

Support