ReFocus:視覺編輯作為結構化圖像理解的思維鏈。
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
January 9, 2025
作者: Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang
cs.AI
摘要
結構化圖像理解,例如解讀表格和圖表,需要在圖像中不同結構和文本之間進行策略性地重新聚焦,形成推理序列以得出最終答案。然而,目前的多模式大型語言模型(LLMs)缺乏這種多跳選擇性注意力能力。在這項工作中,我們引入了ReFocus,這是一個簡單而有效的框架,它賦予了多模式LLMs通過代碼對輸入圖像進行視覺編輯的能力,從而轉移和優化他們的視覺焦點,生成“視覺思維”。具體來說,ReFocus使多模式LLMs能夠生成Python代碼來調用工具並修改輸入圖像,依次繪製方框,突出顯示部分,並遮罩區域,從而增強視覺推理過程。我們對涉及表格和圖表的各種結構化圖像理解任務進行了實驗。ReFocus在所有任務上大幅提高了性能,相較於沒有視覺編輯的GPT-4o,表格任務平均提高了11.0%,圖表任務提高了6.8%。我們對不同視覺編輯的影響進行了深入分析,以及ReFocus為何能夠提高性能而不引入額外信息的原因。此外,我們使用ReFocus收集了一倩的訓練集,證明了這種具有中間信息的視覺思維鏈比標準VQA數據提供了更好的監督,相對於使用QA對進行訓練的相同模型,平均提高了8.0%,比CoT高出2.6%。
English
Structured image understanding, such as interpreting tables and charts,
requires strategically refocusing across various structures and texts within an
image, forming a reasoning sequence to arrive at the final answer. However,
current multimodal large language models (LLMs) lack this multihop selective
attention capability. In this work, we introduce ReFocus, a simple yet
effective framework that equips multimodal LLMs with the ability to generate
"visual thoughts" by performing visual editing on the input image through code,
shifting and refining their visual focuses. Specifically, ReFocus enables
multimodal LLMs to generate Python codes to call tools and modify the input
image, sequentially drawing boxes, highlighting sections, and masking out
areas, thereby enhancing the visual reasoning process. We experiment upon a
wide range of structured image understanding tasks involving tables and charts.
ReFocus largely improves performance on all tasks over GPT-4o without visual
editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart
tasks. We present an in-depth analysis of the effects of different visual
edits, and reasons why ReFocus can improve the performance without introducing
additional information. Further, we collect a 14k training set using ReFocus,
and prove that such visual chain-of-thought with intermediate information
offers a better supervision than standard VQA data, reaching a 8.0% average
gain over the same model trained with QA pairs and 2.6% over CoT.Summary
AI-Generated Summary