ReFocus：将视觉编辑视为结构化图像理解的思维链。

摘要

结构化图像理解，例如解释表格和图表，需要在图像中的各种结构和文本之间进行战略性地重新聚焦，形成推理序列以得出最终答案。然而，当前的多模态大型语言模型（LLMs）缺乏这种多跳选择性注意力能力。在这项工作中，我们引入了ReFocus，这是一个简单而有效的框架，它赋予多模态LLMs通过对输入图像进行代码视觉编辑的能力，从而转移和完善它们的视觉焦点，生成“视觉思维”。具体而言，ReFocus使多模态LLMs能够生成Python代码来调用工具并修改输入图像，依次绘制框，突出显示部分，并遮罩区域，从而增强视觉推理过程。我们对涉及表格和图表的各种结构化图像理解任务进行了实验。相较于没有视觉编辑的GPT-4o，ReFocus在所有任务上大幅提高了性能，表格任务平均提高了11.0%，图表任务提高了6.8%。我们对不同视觉编辑的影响效果进行了深入分析，以及ReFocus为何能提高性能而不引入额外信息的原因。此外，我们使用ReFocus收集了一个包含14k个训练集，并证明这种具有中间信息的视觉思维链比标准VQA数据提供更好的监督，相较于使用QA对训练的相同模型，平均提高了8.0%，比CoT提高了2.6%。

English

Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

ReFocus：将视觉编辑视为结构化图像理解的思维链。

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

摘要

Support