利用網頁UI進行文本豐富的視覺理解

摘要

文字豐富的視覺理解——即處理將密集文本內容與視覺元素結合的能力——對於多模式大型語言模型（MLLMs）與結構化環境有效互動至關重要。為了增強這種能力，我們提出使用基於文本的大型語言模型（LLMs）從網頁UI中合成通用多模式指令。儘管缺乏直接的視覺輸入，基於文本的LLMs能夠處理來自網頁可訪問性樹的結構化文本表示。這些指令隨後與UI截圖配對，用於訓練多模式模型。我們介紹了MultiUI，一個包含來自100萬個網站的730萬樣本的數據集，涵蓋各種多模式任務和UI佈局。在MultiUI上訓練的模型不僅在Web UI任務上表現出色——在VisualWebBench上取得高達48％的改進，在Web代理數據集Mind2Web上的行動準確度提高了19.1％——而且驚人地泛化到非Web UI任務，甚至到非UI領域，如文檔理解、OCR和圖表解讀。這些結果突顯了Web UI數據在促進各種情境下文字豐富的視覺理解方面的廣泛應用性。

English

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

利用網頁UI進行文本豐富的視覺理解

Harnessing Webpage UIs for Text-Rich Visual Understanding

摘要

Summary

Support

Support