探索强化学习对视频理解的影响: 来自SEED-Bench-R1的洞见
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
March 31, 2025
作者: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
cs.AI
摘要
近期,思维链(Chain of Thought, COT)生成技术的进步显著提升了大型语言模型(Large Language Models, LLMs)的推理能力,其中强化学习(Reinforcement Learning, RL)作为一种有效的后训练方法崭露头角。多模态大型语言模型(Multimodal Large Language Models, MLLMs)继承了这一推理潜力,但在需要感知与逻辑推理相结合的任务上仍待深入探索。为此,我们推出了SEED-Bench-R1,一个旨在系统评估MLLMs在视频理解任务中后训练方法的基准。该基准包含复杂的现实世界视频及日常规划任务,以多选题形式呈现,要求模型具备高级的感知与推理能力。SEED-Bench-R1通过三个层次评估模型的泛化能力:同分布、跨环境及跨环境-任务场景,并配备了一个大规模训练数据集,其答案易于验证。以Qwen2-VL-Instruct-7B为基础模型,我们对比了RL与监督微调(Supervised Fine-Tuning, SFT),结果显示RL在数据效率及同分布与分布外任务上的表现均优于SFT,甚至在LongVideoBench等通用视频理解基准上超越SFT。我们的深入分析表明,RL增强了视觉感知,但生成的推理链在逻辑连贯性上常显不足。我们指出了诸如推理不一致、忽视视觉线索等关键局限,并建议未来在基础模型推理能力、奖励建模及RL对噪声信号的鲁棒性方面进行改进。
English
Recent advancements in Chain of Thought (COT) generation have significantly
improved the reasoning capabilities of Large Language Models (LLMs), with
reinforcement learning (RL) emerging as an effective post-training approach.
Multimodal Large Language Models (MLLMs) inherit this reasoning potential but
remain underexplored in tasks requiring both perception and logical reasoning.
To address this, we introduce SEED-Bench-R1, a benchmark designed to
systematically evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in
the format of multiple-choice questions, requiring sophisticated perception and
reasoning. SEED-Bench-R1 assesses generalization through a three-level
hierarchy: in-distribution, cross-environment, and cross-environment-task
scenarios, equipped with a large-scale training dataset with easily verifiable
ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL
with supervised fine-tuning (SFT), demonstrating RL's data efficiency and
superior performance on both in-distribution and out-of-distribution tasks,
even outperforming SFT on general video understanding benchmarks like
LongVideoBench. Our detailed analysis reveals that RL enhances visual
perception but often produces less logically coherent reasoning chains. We
identify key limitations such as inconsistent reasoning and overlooked visual
cues, and suggest future improvements in base model reasoning, reward modeling,
and RL robustness against noisy signals.Summary
AI-Generated Summary