ReFocus: 구조화된 이미지 이해를 위한 사고 체인으로서의 시각 편집

초록

구조화된 이미지 이해는 표와 차트를 해석하는 것과 같이 이미지 내의 다양한 구조와 텍스트를 전략적으로 다시 초점을 맞추어 최종 답변에 이르는 추론 순서를 형성하는 능력을 필요로 합니다. 그러나 현재의 다중 모달 대형 언어 모델(Large Language Models, LLMs)은 이러한 다중 점프 선택적 주의 능력을 갖추고 있지 않습니다. 본 논문에서는 ReFocus를 소개합니다. 이는 간단하면서도 효과적인 프레임워크로, 시각적 편집을 통해 입력 이미지를 수정하고 시각적 초점을 이동하고 정제하여 "시각적 생각"을 생성할 수 있는 능력을 다중 모달 LLMs에 제공합니다. 구체적으로 ReFocus는 도구를 호출하고 입력 이미지를 수정하는 Python 코드를 생성하여, 순차적으로 상자를 그리고 섹션을 강조하며 영역을 마스킹하여 시각적 추론 과정을 향상시킵니다. 우리는 표와 차트를 포함하는 다양한 구조화된 이미지 이해 작업에 실험을 진행했습니다. ReFocus는 시각적 편집 없이 GPT-4o에 비해 모든 작업에서 성능을 크게 향상시켰으며, 표 작업에서 평균 11.0%의 향상과 차트 작업에서 6.8%의 향상을 보여주었습니다. 우리는 다양한 시각적 편집의 효과에 대한 심층적인 분석을 제시하고, ReFocus가 성능을 향상시킬 수 있는 이유와 추가 정보를 도입하지 않고도 성능을 향상시킬 수 있는 이유를 설명합니다. 더 나아가, ReFocus를 사용하여 14k 규모의 훈련 세트를 수집하고, 중간 정보를 활용한 시각적 사고 체인이 표준 VQA 데이터보다 더 나은 감독을 제공함을 증명하며, QA 쌍으로 훈련된 동일한 모델 대비 평균 8.0%의 향상과 CoT 대비 2.6%의 향상을 달성했습니다.

English

Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

ReFocus: 구조화된 이미지 이해를 위한 사고 체인으로서의 시각 편집

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

초록

Summary

Support