MotiF:利用運動焦點損失使文字在圖像動畫中更具影響力

MotiF: Making Text Count in Image Animation with Motion Focal Loss

December 20, 2024
作者: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin
cs.AI

摘要

文本-圖像到視頻(TI2V)生成旨在根據文本描述從圖像生成視頻,也被稱為文本引導的圖像動畫。大多數現有方法在生成與文本提示良好對齊的視頻方面存在困難,特別是當指定運動時。為了克服這一限制,我們引入了MotiF,這是一種簡單而有效的方法,它將模型的學習引導到具有更多運動的區域,從而改善文本對齊和運動生成。我們使用光流生成運動熱圖,並根據運動的強度加權損失。這種修改後的目標導致明顯改進,並補充了利用運動先驗作為模型輸入的現有方法。此外,由於缺乏用於評估TI2V生成的多樣化基準,我們提出了TI2V Bench,這是一個包含320個圖像文本對的數據集,用於進行強健的評估。我們提出了一個人類評估協議,要求標註者在選擇兩個視頻之間的整體偏好後提供其理由。通過對TI2V Bench的全面評估,MotiF勝過九個開源模型,實現了72%的平均偏好。TI2V Bench可在https://wang-sj16.github.io/motif/上獲得。
English
Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation on TI2V Bench, MotiF outperforms nine open-sourced models, achieving an average preference of 72%. The TI2V Bench is released in https://wang-sj16.github.io/motif/.

Summary

AI-Generated Summary

PDF62December 25, 2024