텍스트 풍부한 시각 이해를 위한 웹페이지 UI 활용

초록

텍스트 풍부한 시각 이해력-밀집된 텍스트 콘텐츠가 시각적 요소와 통합된 환경을 처리하는 능력-은 다중 모달 대형 언어 모델(MLLMs)이 구조화된 환경과 효과적으로 상호 작용하기 위해 중요합니다. 이 능력을 향상시키기 위해, 우리는 텍스트 기반 대형 언어 모델(LLMs)을 사용하여 웹페이지 UI에서 일반적인 다중 모달 지침을 합성하는 것을 제안합니다. 직접적인 시각적 입력이 없음에도 불구하고, 텍스트 기반 LLMs는 웹페이지 접근성 트리로부터 구조화된 텍스트 표현을 처리할 수 있습니다. 이러한 지침은 UI 스크린샷과 결합되어 다중 모달 모델을 훈련시킵니다. 우리는 MultiUI라는 데이터셋을 소개합니다. 이 데이터셋은 1백만 개의 웹사이트에서 730만 개의 샘플을 포함하며, 다양한 다중 모달 작업과 UI 레이아웃을 다룹니다. MultiUI에서 훈련된 모델은 웹 UI 작업에서 우수한 성과를 보이는데, VisualWebBench에서 최대 48%의 향상과 웹 에이전트 데이터셋 Mind2Web에서 19.1%의 작업 정확도 향상을 달성합니다. 또한, 이러한 모델은 비-웹 UI 작업 및 문서 이해, OCR, 차트 해석과 같은 비-UI 도메인으로도 놀랍도록 일반화되는 것을 확인할 수 있습니다. 이러한 결과는 웹 UI 데이터의 폭넓은 적용 가능성을 강조하며, 다양한 시나리오에서 텍스트 풍부한 시각 이해력을 발전시키는 데 도움이 됩니다.

English

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

텍스트 풍부한 시각 이해를 위한 웹페이지 UI 활용

Harnessing Webpage UIs for Text-Rich Visual Understanding

초록

Support