ChatPaper.aiChatPaper

融合语言模型与扩散模型:视频生成的双重优势

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

March 6, 2025
作者: Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang
cs.AI

摘要

近期,文本到视频(T2V)生成技术的进步主要受到两大竞争范式的推动:自回归语言模型和扩散模型。然而,每种范式都存在固有的局限:语言模型在视觉质量和错误累积方面表现欠佳,而扩散模型则缺乏语义理解和因果建模能力。本研究中,我们提出了LanDiff,一种通过粗到细生成策略融合两者优势的混合框架。我们的架构引入了三项关键创新:(1)语义分词器,通过高效的语义压缩将3D视觉特征压缩为紧凑的1D离散表示,实现了高达14,000倍的压缩比;(2)语言模型,用于生成具有高级语义关系的语义标记;(3)流式扩散模型,将粗略语义精炼为高保真视频。实验表明,LanDiff作为一个5B规模的模型,在VBench T2V基准测试中取得了85.43的分数,超越了当前最先进的开源模型Hunyuan Video(13B)及其他商业模型如Sora、Keling和Hailuo。此外,我们的模型在长视频生成领域也达到了业界领先水平,超越了该领域的其他开源模型。我们的演示可在https://landiff.github.io/查看。
English
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a sim14,000times compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

Summary

AI-Generated Summary

PDF71March 7, 2025