视频-SALMONN-o1：增强推理的视听大型语言模型

摘要

最近推理优化方面的进展显著增强了大型语言模型（LLMs）的能力，然而现有的改进推理的努力仅限于解决数学问题和专注于视觉图形输入，忽视了在一般视频理解中的更广泛应用。本文提出了video-SALMONN-o1，这是第一个面向一般视频理解任务设计的开源推理增强型视听语言模型。为了增强其推理能力，我们开发了一个推理密集型数据集，其中包含具有挑战性的视听问题及逐步解决方案。我们还提出了过程直接偏好优化（pDPO），利用对比步骤选择来实现针对多模态输入的高效步骤级奖励建模。此外，我们引入了RivaBench，这是第一个推理密集型视频理解基准，涵盖了超过4,000个高质量、专家策划的问题-答案对，涵盖了诸如脱口秀、学术演讲和合成视频检测等场景。video-SALMONN-o1在不同视频推理基准测试中相对于LLaVA-OneVision基线实现了3-8%的准确率提升。此外，pDPO在RivaBench上相对于监督微调模型实现了6-8%的改进。增强的推理使video-SALMONN-o1具备了零样本合成视频检测能力。

English

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.

视频-SALMONN-o1：增强推理的视听大型语言模型

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

摘要

Summary

Support