以少勝多:MCTS引導的樣本選擇實現數據高效的視覺推理自我提升
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
April 10, 2025
作者: Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
cs.AI
摘要
本文提出了一種有效方法,在僅需少量訓練樣本的情況下,純粹依靠自我改進(無需知識蒸餾)來增強視覺推理能力。我們的核心洞見在於,強化微調(RFT)過程中訓練數據的難度至關重要。即使數據集規模較小,適度挑戰性的樣本也能顯著提升推理能力。儘管這一觀點直觀,但主要挑戰仍在於如何精確量化樣本難度以實現有效的數據篩選。為此,我們提出了一種新穎的方法,重新利用蒙特卡洛樹搜索(MCTS)來達成這一目標。從我們精心挑選的70k開源訓練樣本出發,我們引入了一種基於MCTS的選擇方法,該方法根據視覺語言模型(VLMs)解決每個問題所需的迭代次數來量化樣本難度。MCTS中這種顯式的逐步推理迫使模型進行更長時間的思考,從而更好地識別真正具有挑戰性的樣本。我們篩選並保留了11k樣本,對Qwen2.5-VL-7B-Instruct進行RFT,最終得到我們的模型ThinkLite-VL。在八個基準測試上的評估結果顯示,ThinkLite-VL僅使用11k訓練樣本且無需知識蒸餾,就將Qwen2.5-VL-7B-Instruct的平均性能提升了7%。這顯著超越了所有現有的7B級別推理VLMs,以及我們使用經典選擇方法(如基於準確率的篩選)的相當可比基線。值得注意的是,在MathVista上,ThinkLite-VL-7B達到了75.1的SoTA準確率,超越了Qwen2.5-VL-72B、GPT-4o和O1。我們的代碼、數據和模型可在https://github.com/si0wang/ThinkLite-VL獲取。
English
In this paper, we present an effective method to enhance visual reasoning
with significantly fewer training samples, relying purely on self-improvement
with no knowledge distillation. Our key insight is that the difficulty of
training data during reinforcement fine-tuning (RFT) is critical. Appropriately
challenging samples can substantially boost reasoning capabilities even when
the dataset is small. Despite being intuitive, the main challenge remains in
accurately quantifying sample difficulty to enable effective data filtering. To
this end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS)
to achieve that. Starting from our curated 70k open-source training samples, we
introduce an MCTS-based selection method that quantifies sample difficulty
based on the number of iterations required by the VLMs to solve each problem.
This explicit step-by-step reasoning in MCTS enforces the model to think longer
and better identifies samples that are genuinely challenging. We filter and
retain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our
final model, ThinkLite-VL. Evaluation results on eight benchmarks show that
ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%,
using only 11k training samples with no knowledge distillation. This
significantly outperforms all existing 7B-level reasoning VLMs, and our fairly
comparable baselines that use classic selection methods such as accuracy-based
filtering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of
75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are
available at https://github.com/si0wang/ThinkLite-VL.Summary
AI-Generated Summary