利用網頁UI進行文本豐富的視覺理解
Harnessing Webpage UIs for Text-Rich Visual Understanding
October 17, 2024
作者: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
cs.AI
摘要
文字豐富的視覺理解——即處理將密集文本內容與視覺元素結合的能力——對於多模式大型語言模型(MLLMs)與結構化環境有效互動至關重要。為了增強這種能力,我們提出使用基於文本的大型語言模型(LLMs)從網頁UI中合成通用多模式指令。儘管缺乏直接的視覺輸入,基於文本的LLMs能夠處理來自網頁可訪問性樹的結構化文本表示。這些指令隨後與UI截圖配對,用於訓練多模式模型。我們介紹了MultiUI,一個包含來自100萬個網站的730萬樣本的數據集,涵蓋各種多模式任務和UI佈局。在MultiUI上訓練的模型不僅在Web UI任務上表現出色——在VisualWebBench上取得高達48%的改進,在Web代理數據集Mind2Web上的行動準確度提高了19.1%——而且驚人地泛化到非Web UI任務,甚至到非UI領域,如文檔理解、OCR和圖表解讀。這些結果突顯了Web UI數據在促進各種情境下文字豐富的視覺理解方面的廣泛應用性。
English
Text-rich visual understanding-the ability to process environments where
dense textual content is integrated with visuals-is crucial for multimodal
large language models (MLLMs) to interact effectively with structured
environments. To enhance this capability, we propose synthesizing general
multimodal instructions from webpage UIs using text-based large language models
(LLMs). Despite lacking direct visual input, text-based LLMs are able to
process structured text representations from webpage accessibility trees. These
instructions are then paired with UI screenshots to train multimodal models. We
introduce MultiUI, a dataset containing 7.3 million samples from 1 million
websites, covering diverse multimodal tasks and UI layouts. Models trained on
MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on
VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset
Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to
non-UI domains, such as document understanding, OCR, and chart interpretation.
These results highlight the broad applicability of web UI data for advancing
text-rich visual understanding across various scenarios.Summary
AI-Generated Summary