视频-SALMONN-o1:增强推理的视听大型语言模型
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
February 17, 2025
作者: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang
cs.AI
摘要
最近推理优化方面的进展显著增强了大型语言模型(LLMs)的能力,然而现有的改进推理的努力仅限于解决数学问题和专注于视觉图形输入,忽视了在一般视频理解中的更广泛应用。本文提出了video-SALMONN-o1,这是第一个面向一般视频理解任务设计的开源推理增强型视听语言模型。为了增强其推理能力,我们开发了一个推理密集型数据集,其中包含具有挑战性的视听问题及逐步解决方案。我们还提出了过程直接偏好优化(pDPO),利用对比步骤选择来实现针对多模态输入的高效步骤级奖励建模。此外,我们引入了RivaBench,这是第一个推理密集型视频理解基准,涵盖了超过4,000个高质量、专家策划的问题-答案对,涵盖了诸如脱口秀、学术演讲和合成视频检测等场景。video-SALMONN-o1在不同视频推理基准测试中相对于LLaVA-OneVision基线实现了3-8%的准确率提升。此外,pDPO在RivaBench上相对于监督微调模型实现了6-8%的改进。增强的推理使video-SALMONN-o1具备了零样本合成视频检测能力。
English
While recent advancements in reasoning optimization have significantly
enhanced the capabilities of large language models (LLMs), existing efforts to
improve reasoning have been limited to solving mathematical problems and
focusing on visual graphical inputs, neglecting broader applications in general
video understanding.This paper proposes video-SALMONN-o1, the first open-source
reasoning-enhanced audio-visual LLM designed for general video understanding
tasks. To enhance its reasoning abilities, we develop a reasoning-intensive
dataset featuring challenging audio-visual questions with step-by-step
solutions. We also propose process direct preference optimization (pDPO), which
leverages contrastive step selection to achieve efficient step-level reward
modelling tailored for multimodal inputs. Additionally, we introduce RivaBench,
the first reasoning-intensive video understanding benchmark, featuring over
4,000 high-quality, expert-curated question-answer pairs across scenarios such
as standup comedy, academic presentations, and synthetic video detection.
video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision
baseline across different video reasoning benchmarks. Besides, pDPO achieves
6-8% improvements compared to the supervised fine-tuning model on RivaBench.
Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection
capabilities.Summary
AI-Generated Summary