LLM에서 MLLM으로의 시각적 차트 추론 능력 추출

초록

복잡한 차트 Q&A 작업을 해결하기 위해서는 다중 모달 대형 언어 모델(MLLMs)에서 고급 시각 추론 능력이 필요합니다. 최근 연구들은 이러한 능력이 시각적 입력에서 핵심 정보를 인식하고 그에 대해 추론하는 두 가지 주요 부분으로 구성되어 있다고 강조하고 있습니다. 따라서 MLLMs를 향상시키기 위한 유망한 접근 방법은 두 측면에 초점을 맞춘 관련 훈련 데이터를 구축하는 것입니다. 그러나 복잡한 차트와 질문을 수집하고 주석을 다는 것은 비용이 많이 들고 시간이 많이 소요되며, 주석이 달린 답변의 품질을 보장하는 것은 여전히 어려운 과제입니다. 본 논문에서는 시각 추론 능력을 LLMs에서 MLLMs로 추출하기 위한 비용 효율적이고 효율적이며 쉽게 확장 가능한 데이터 합성 방법인 Code-as-Intermediary Translation (CIT)을 제안합니다. 코드는 시각적 차트 표현을 텍스트 표현으로 번역하는 중개자 역할을 하여 LLMs가 크로스 모달 정보를 이해할 수 있게 합니다. 구체적으로, 우리는 텍스트 기반의 합성 기술을 사용하여 차트 플로팅 코드를 구성하고, 인식 및 추론 능력을 향상시키기 위해 3천 개의 추론 중심 차트와 2만 개의 Q&A 쌍을 포함하는 ReachQA 데이터셋을 생성합니다. 실험 결과, 우리의 데이터로 세밀하게 조정된 모델은 차트 관련 벤치마크에서 우수한 성능을 보이는데 그치지 않고 MathVista와 같은 일반 수학 벤치마크에서 개선된 다중 모달 추론 능력을 나타냅니다. 코드와 데이터셋은 https://github.com/hewei2001/ReachQA에서 공개적으로 제공됩니다.

English

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

LLM에서 MLLM으로의 시각적 차트 추론 능력 추출

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

초록

Support