改善視覺語言模型的思維鏈推理

摘要

在視覺語言模型（VLMs）中，思維鏈（CoT）推理對於提升可解釋性和可信度至關重要。然而，目前的訓練配方缺乏強大的 CoT 推理數據，依賴於由簡短註釋主導且具有最小合理性的數據集。在這項工作中，我們展示了將 VLM 訓練於簡短答案上並不能很好地應用於需要更詳細回應的推理任務。為了解決這個問題，我們提出了一個雙重方法。首先，我們從 GPT-4o 模型中提煉合理性，以豐富訓練數據並微調 VLM，提升其 CoT 表現。其次，我們應用強化學習進一步校準推理質量。具體來說，我們通過將模型生成的推理鏈的預測與註釋的簡短答案進行比較，構建正（正確）負（不正確）對。利用這些成對數據，我們應用直接偏好優化算法來提煉模型的推理能力。我們的實驗顯示了在基準數據集上 CoT 推理的顯著改善，以及對直接答案預測的更好泛化。這項工作強調了在訓練中納入詳細合理性並利用強化學習來增強 VLM 推理能力的重要性。

English

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

改善視覺語言模型的思維鏈推理

Improve Vision Language Model Chain-of-thought Reasoning

摘要

Summary

Support

Support