從LLMs到MLLMs提煉視覺圖推理能力

摘要

解決複雜的圖表問答任務需要多模式大型語言模型（MLLMs）具有先進的視覺推理能力。最近的研究強調這些能力包括兩個主要部分：從視覺輸入中識別關鍵信息和對其進行推理。因此，增強MLLMs的一種有前途的方法是構建聚焦於這兩個方面的相關訓練數據。然而，收集和標註複雜的圖表和問題既昂貴又耗時，確保標註答案的質量仍然是一個挑戰。在本文中，我們提出了代碼作為中介翻譯（CIT），這是一種成本效益高、高效且易於擴展的數據合成方法，用於從LLMs提煉視覺推理能力到MLLMs。代碼充當一個中介，將視覺圖表表示轉換為文本表示，使LLMs能夠理解跨模態信息。具體來說，我們採用基於文本的合成技術來構建繪製圖表的代碼，並生成了ReachQA數據集，其中包含3k個推理密集型圖表和20k個問答對，以增強識別和推理能力。實驗表明，當使用我們的數據進行微調時，模型不僅在與圖表相關的基準測試上表現良好，還在像MathVista這樣的一般數學基準測試上展現出改進的多模態推理能力。代碼和數據集可在https://github.com/hewei2001/ReachQA 公開獲取。

English

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

從LLMs到MLLMs提煉視覺圖推理能力

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

摘要

Summary

Support

Support