從LLMs到MLLMs提煉視覺圖推理能力
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
October 24, 2024
作者: Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI
摘要
解決複雜的圖表問答任務需要多模式大型語言模型(MLLMs)具有先進的視覺推理能力。最近的研究強調這些能力包括兩個主要部分:從視覺輸入中識別關鍵信息和對其進行推理。因此,增強MLLMs的一種有前途的方法是構建聚焦於這兩個方面的相關訓練數據。然而,收集和標註複雜的圖表和問題既昂貴又耗時,確保標註答案的質量仍然是一個挑戰。在本文中,我們提出了代碼作為中介翻譯(CIT),這是一種成本效益高、高效且易於擴展的數據合成方法,用於從LLMs提煉視覺推理能力到MLLMs。代碼充當一個中介,將視覺圖表表示轉換為文本表示,使LLMs能夠理解跨模態信息。具體來說,我們採用基於文本的合成技術來構建繪製圖表的代碼,並生成了ReachQA數據集,其中包含3k個推理密集型圖表和20k個問答對,以增強識別和推理能力。實驗表明,當使用我們的數據進行微調時,模型不僅在與圖表相關的基準測試上表現良好,還在像MathVista這樣的一般數學基準測試上展現出改進的多模態推理能力。代碼和數據集可在https://github.com/hewei2001/ReachQA 公開獲取。
English
Solving complex chart Q&A tasks requires advanced visual reasoning abilities
in multimodal large language models (MLLMs). Recent studies highlight that
these abilities consist of two main parts: recognizing key information from
visual inputs and conducting reasoning over it. Thus, a promising approach to
enhance MLLMs is to construct relevant training data focusing on the two
aspects. However, collecting and annotating complex charts and questions is
costly and time-consuming, and ensuring the quality of annotated answers
remains a challenge. In this paper, we propose Code-as-Intermediary Translation
(CIT), a cost-effective, efficient and easily scalable data synthesis method
for distilling visual reasoning abilities from LLMs to MLLMs. The code serves
as an intermediary that translates visual chart representations into textual
representations, enabling LLMs to understand cross-modal information.
Specifically, we employ text-based synthesizing techniques to construct
chart-plotting code and produce ReachQA, a dataset containing 3k
reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and
reasoning abilities. Experiments show that when fine-tuned with our data,
models not only perform well on chart-related benchmarks, but also demonstrate
improved multimodal reasoning abilities on general mathematical benchmarks like
MathVista. The code and dataset are publicly available at
https://github.com/hewei2001/ReachQA.Summary
AI-Generated Summary