自由^2指南:無梯度路徑積分控制以增強大視覺語言模型的文本到視頻生成
Free^2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
November 26, 2024
作者: Jaemin Kim, Bryan S Kim, Jong Chul Ye
cs.AI
摘要
擴散模型在生成任務中取得了令人印象深刻的成果,如文本到圖像(T2I)和文本到視頻(T2V)合成。然而,在T2V生成中實現準確的文本對齊仍然具有挑戰性,這是由於幀之間存在著複雜的時間依賴性。現有基於強化學習(RL)的方法用於增強文本對齊通常需要可微的獎勵函數,或者受限於有限的提示,這限制了它們的可擴展性和適用性。在本文中,我們提出了一種名為Free^2Guide的新型無梯度框架,用於將生成的視頻與文本提示對齊,而無需額外的模型訓練。通過利用路徑積分控制原則,Free^2Guide使用不可微的獎勵函數來近似擴散模型的引導,從而實現將強大的黑盒大視覺語言模型(LVLMs)集成為獎勵模型。此外,我們的框架支持多個獎勵模型的靈活集成,包括大規模基於圖像的模型,以協同增強對齊而不會帶來重大的計算開銷。我們展示了Free^2Guide在各個維度上顯著改善了文本對齊,並增強了生成視頻的整體質量。
English
Diffusion models have achieved impressive results in generative tasks like
text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving
accurate text alignment in T2V generation remains challenging due to the
complex temporal dependency across frames. Existing reinforcement learning
(RL)-based approaches to enhance text alignment often require differentiable
reward functions or are constrained to limited prompts, hindering their
scalability and applicability. In this paper, we propose Free^2Guide, a novel
gradient-free framework for aligning generated videos with text prompts without
requiring additional model training. Leveraging principles from path integral
control, Free^2Guide approximates guidance for diffusion models using
non-differentiable reward functions, thereby enabling the integration of
powerful black-box Large Vision-Language Models (LVLMs) as reward model.
Additionally, our framework supports the flexible ensembling of multiple reward
models, including large-scale image-based models, to synergistically enhance
alignment without incurring substantial computational overhead. We demonstrate
that Free^2Guide significantly improves text alignment across various
dimensions and enhances the overall quality of generated videos.Summary
AI-Generated Summary