Divot:擴散動力視頻分詞器用於理解和生成
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
December 5, 2024
作者: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
cs.AI
摘要
近年來,對於在大型語言模型(LLMs)中統一影像理解和生成的興趣顯著增加。這種持續增長的興趣促使我們探索將這種統一擴展到視頻。核心挑戰在於開發一種多功能的視頻分詞器,捕捉視頻的空間特徵和時間動態,以獲得LLMs的表示,並且這些表示可以進一步解碼為逼真的視頻片段,從而實現視頻生成。在這項工作中,我們介紹了Divot,一種利用擴散過程進行自監督視頻表示學習的視頻分詞器。我們認為,如果一個視頻擴散模型能夠通過將視頻分詞器的特徵作為條件有效去噪視頻片段,則該分詞器已成功捕捉到堅固的空間和時間信息。此外,視頻擴散模型本質上充當解密器,從其表示解碼視頻。在Divot分詞器的基礎上,通過視頻到文本自回歸和文本到視頻生成,我們提出了Divot-Vicuna,通過用高斯混合模型對連續值Divot特徵的分佈進行建模。實驗結果表明,我們基於擴散的視頻分詞器,當與預訓練的LLM集成時,在各種視頻理解和生成基準測試中取得了競爭性表現。經調整的Divot-Vicuna在視頻敘事方面表現出色,生成交錯的敘事和相應的視頻。
English
In recent years, there has been a significant surge of interest in unifying
image comprehension and generation within Large Language Models (LLMs). This
growing interest has prompted us to explore extending this unification to
videos. The core challenge lies in developing a versatile video tokenizer that
captures both the spatial characteristics and temporal dynamics of videos to
obtain representations for LLMs, and the representations can be further decoded
into realistic video clips to enable video generation. In this work, we
introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the
diffusion process for self-supervised video representation learning. We posit
that if a video diffusion model can effectively de-noise video clips by taking
the features of a video tokenizer as the condition, then the tokenizer has
successfully captured robust spatial and temporal information. Additionally,
the video diffusion model inherently functions as a de-tokenizer, decoding
videos from their representations. Building upon the Divot tokenizer, we
present Divot-Vicuna through video-to-text autoregression and text-to-video
generation by modeling the distributions of continuous-valued Divot features
with a Gaussian Mixture Model. Experimental results demonstrate that our
diffusion-based video tokenizer, when integrated with a pre-trained LLM,
achieves competitive performance across various video comprehension and
generation benchmarks. The instruction tuned Divot-Vicuna also excels in video
storytelling, generating interleaved narratives and corresponding videos.Summary
AI-Generated Summary