VistaDPO:面向大型視頻模型的層次化時空直接偏好優化
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
April 17, 2025
作者: Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei
cs.AI
摘要
基於大型語言模型(LLMs)構建的大型視頻模型(LVMs)在視頻理解方面展現出潛力,但常面臨與人類直覺不對齊及視頻幻覺問題。為應對這些挑戰,我們引入了VistaDPO,一個新穎的視頻層次時空直接偏好優化框架。VistaDPO在三個層次上增強了文本與視頻的偏好對齊:i) 實例層次,將整體視頻內容與回應對齊;ii) 時間層次,將視頻的時序語義與事件描述對齊;iii) 感知層次,將空間對象與語言標記對齊。鑑於缺乏細粒度視頻-語言偏好對齊的數據集,我們構建了VistaDPO-7k,這是一個包含7.2K問答對的數據集,每個問答對都標註了選定和拒絕的回應,以及時空定位信息,如時間戳、關鍵幀和邊界框。在視頻幻覺、視頻問答和字幕生成等基準測試上的廣泛實驗表明,VistaDPO顯著提升了現有LVMs的性能,有效緩解了視頻與語言的不對齊和幻覺問題。代碼和數據可在https://github.com/HaroldChen19/VistaDPO獲取。
English
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown
promise in video understanding but often suffer from misalignment with human
intuition and video hallucination issues. To address these challenges, we
introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal
Direct Preference Optimization. VistaDPO enhances text-video preference
alignment across three hierarchical levels: i) Instance Level, aligning overall
video content with responses; ii) Temporal Level, aligning video temporal
semantics with event descriptions; and iii) Perceptive Level, aligning spatial
objects with language tokens. Given the lack of datasets for fine-grained
video-language preference alignment, we construct VistaDPO-7k, a dataset of
7.2K QA pairs annotated with chosen and rejected responses, along with
spatial-temporal grounding information such as timestamps, keyframes, and
bounding boxes. Extensive experiments on benchmarks such as Video
Hallucination, Video QA, and Captioning performance tasks demonstrate that
VistaDPO significantly improves the performance of existing LVMs, effectively
mitigating video-language misalignment and hallucination. The code and data are
available at https://github.com/HaroldChen19/VistaDPO.Summary
AI-Generated Summary