InstanceCap:通過實例感知結構化標題來改進文本到視頻生成

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

December 12, 2024
作者: Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai
cs.AI

摘要

最近幾年來,文本轉視頻生成技術迅速發展,呈現出顯著的成果。訓練通常依賴於視頻標題配對數據,這在提升生成性能方面起著至關重要的作用。然而,目前的視頻標題往往存在細節不足、幻覺和運動描述不準確等問題,影響了生成視頻的忠實度和一致性。在這項工作中,我們提出了一種新穎的基於實例感知結構化標題框架,稱為InstanceCap,首次實現了實例級和精細的視頻標題。基於這個方案,我們設計了一個輔助模型集群,將原始視頻轉換為實例以增強實例的忠實度。視頻實例進一步用於將密集提示改進為結構化短語,實現簡潔而精確的描述。此外,我們精心策劃了一個包含22K個實例視頻的InstanceVid數據集進行訓練,並提出了一個針對InstanceCap結構量身定制的增強流程用於推斷。實驗結果表明,我們提出的InstanceCap明顯優於先前的模型,確保標題和視頻之間高度忠實,同時減少幻覺。
English
Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

Summary

AI-Generated Summary

PDF193December 16, 2024