以少胜多:基于蒙特卡洛树搜索的样本选择助力数据高效视觉推理自提升
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
April 10, 2025
作者: Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
cs.AI
摘要
本文提出了一种有效方法,在显著减少训练样本的情况下,仅依靠自我提升而无需知识蒸馏,即可增强视觉推理能力。我们的核心洞见在于,强化微调(RFT)过程中训练数据的难度至关重要。适当挑战性的样本即使在小数据集上也能大幅提升推理能力。尽管这一观点直观,但主要挑战仍在于如何准确量化样本难度以实现有效的数据筛选。为此,我们提出了一种新颖的方法,重新利用蒙特卡洛树搜索(MCTS)来实现这一目标。从我们精心挑选的70k开源训练样本出发,我们引入了一种基于MCTS的选择方法,该方法通过视觉语言模型(VLMs)解决每个问题所需的迭代次数来量化样本难度。MCTS中这种显式的逐步推理迫使模型进行更长时间的思考,从而更好地识别真正具有挑战性的样本。我们筛选并保留了11k样本,对Qwen2.5-VL-7B-Instruct进行RFT,最终得到我们的模型ThinkLite-VL。在八个基准测试上的评估结果显示,ThinkLite-VL仅使用11k训练样本且无需知识蒸馏,就将Qwen2.5-VL-7B-Instruct的平均性能提升了7%。这一表现显著优于所有现有的7B级推理VLMs,以及我们使用经典选择方法(如基于准确率的筛选)的相当可比基线。值得注意的是,在MathVista上,ThinkLite-VL-7B达到了75.1的SoTA准确率,超越了Qwen2.5-VL-72B、GPT-4o和O1。我们的代码、数据和模型可在https://github.com/si0wang/ThinkLite-VL获取。
English
In this paper, we present an effective method to enhance visual reasoning
with significantly fewer training samples, relying purely on self-improvement
with no knowledge distillation. Our key insight is that the difficulty of
training data during reinforcement fine-tuning (RFT) is critical. Appropriately
challenging samples can substantially boost reasoning capabilities even when
the dataset is small. Despite being intuitive, the main challenge remains in
accurately quantifying sample difficulty to enable effective data filtering. To
this end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS)
to achieve that. Starting from our curated 70k open-source training samples, we
introduce an MCTS-based selection method that quantifies sample difficulty
based on the number of iterations required by the VLMs to solve each problem.
This explicit step-by-step reasoning in MCTS enforces the model to think longer
and better identifies samples that are genuinely challenging. We filter and
retain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our
final model, ThinkLite-VL. Evaluation results on eight benchmarks show that
ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%,
using only 11k training samples with no knowledge distillation. This
significantly outperforms all existing 7B-level reasoning VLMs, and our fairly
comparable baselines that use classic selection methods such as accuracy-based
filtering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of
75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are
available at https://github.com/si0wang/ThinkLite-VL.Summary
AI-Generated Summary