ChatPaper.aiChatPaper

VistaDPO:面向大规模视频模型的分层时空直接偏好优化

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

April 17, 2025
作者: Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei
cs.AI

摘要

基于大型语言模型(LLMs)构建的大型视频模型(LVMs)在视频理解方面展现出潜力,但常面临与人类直觉不符及视频幻觉问题。为解决这些挑战,我们提出了VistaDPO,一种新颖的视频层次时空直接偏好优化框架。VistaDPO在三个层次上增强文本与视频的偏好对齐:i) 实例层面,确保视频整体内容与回答一致;ii) 时间层面,使视频时间语义与事件描述相匹配;iii) 感知层面,将空间对象与语言标记对齐。鉴于缺乏细粒度视频-语言偏好对齐的数据集,我们构建了VistaDPO-7k,一个包含7.2K问答对的数据集,每个问答对均标注了优选与拒绝回答,以及时空定位信息,如时间戳、关键帧和边界框。在视频幻觉、视频问答及字幕生成等基准测试上的广泛实验表明,VistaDPO显著提升了现有LVMs的性能,有效缓解了视频与语言间的错位及幻觉现象。代码与数据已公开于https://github.com/HaroldChen19/VistaDPO。
English
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

Summary

AI-Generated Summary

PDF214April 18, 2025