ChatPaper.aiChatPaper

Loong:使用自回歸語言模型生成長達分鐘級的視頻

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

October 3, 2024
作者: Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu
cs.AI

摘要

在幾分鐘的時間尺度內生成內容豐富的長視頻是一個值得期待但具有挑戰性的任務。自回歸大型語言模型(LLMs)在自然語言處理領域中生成連貫且長序列的標記方面取得了巨大成功,然而,對於視頻生成,自回歸LLMs的探索僅限於生成幾秒鐘的短視頻。在這項工作中,我們對阻礙基於自回歸LLMs的視頻生成器生成長視頻的挑戰進行了深入分析。基於觀察和分析,我們提出了Loong,一種新的基於自回歸LLMs的視頻生成器,可以生成長達一分鐘的視頻。具體來說,我們將文本標記和視頻標記建模為自回歸LLMs的統一序列,並從頭開始訓練模型。我們提出了漸進式的從短到長的訓練方法,並使用損失重新加權方案來緩解長視頻訓練中的損失不平衡問題。我們進一步研究了推斷策略,包括視頻標記的重新編碼和抽樣策略,以減少推斷過程中的錯誤累積。我們提出的Loong可以在10秒的視頻上進行訓練,並可以擴展到根據文本提示生成長達一分鐘的長視頻,這一點已經通過實驗結果證明。更多樣本可在以下網址找到:https://epiphqny.github.io/Loong-video。
English
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.

Summary

AI-Generated Summary

PDF383November 16, 2024