大型视频语言模型的自对齐与精细化正则化偏好优化

摘要

尽管大规模视频语言模型（LVLMs）近期取得了进展，它们仍面临细粒度时序理解的困难，容易产生幻觉，甚至在简单的视频问答任务上也会犯下基础错误，这些都对它们在现实应用中的安全可靠部署构成了重大挑战。为解决这些局限，我们提出了一种自对齐框架，使LVLMs能够从自身错误中学习。该框架首先构建了一个包含偏好与非偏好响应对的训练集，其中非偏好响应通过融入常见错误模式生成，这些错误往往源于时空理解不足、共现概念间的虚假关联，以及过度依赖语言线索而忽视视觉模态等问题。为促进LVLMs与构建的偏好与非偏好响应对之间的自对齐，我们引入了精细化正则化偏好优化（RRPO），这是一种新颖的偏好优化方法，利用子序列级精细化奖励和逐令牌KL正则化，以克服直接偏好优化（DPO）的不足。我们证明，与DPO相比，RRPO实现了更精确的对齐和更稳定的训练。通过实验与分析，我们验证了该方法在多样化视频任务中的有效性，包括视频幻觉、短长视频理解及细粒度时序推理。

English

Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.

大型视频语言模型的自对齐与精细化正则化偏好优化

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

摘要

Summary

Support

Support