自由^2指南:無梯度路徑積分控制以增強大視覺語言模型的文本到視頻生成

Free^2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

November 26, 2024
作者: Jaemin Kim, Bryan S Kim, Jong Chul Ye
cs.AI

摘要

擴散模型在生成任務中取得了令人印象深刻的成果,如文本到圖像(T2I)和文本到視頻(T2V)合成。然而,在T2V生成中實現準確的文本對齊仍然具有挑戰性,這是由於幀之間存在著複雜的時間依賴性。現有基於強化學習(RL)的方法用於增強文本對齊通常需要可微的獎勵函數,或者受限於有限的提示,這限制了它們的可擴展性和適用性。在本文中,我們提出了一種名為Free^2Guide的新型無梯度框架,用於將生成的視頻與文本提示對齊,而無需額外的模型訓練。通過利用路徑積分控制原則,Free^2Guide使用不可微的獎勵函數來近似擴散模型的引導,從而實現將強大的黑盒大視覺語言模型(LVLMs)集成為獎勵模型。此外,我們的框架支持多個獎勵模型的靈活集成,包括大規模基於圖像的模型,以協同增強對齊而不會帶來重大的計算開銷。我們展示了Free^2Guide在各個維度上顯著改善了文本對齊,並增強了生成視頻的整體質量。
English
Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free^2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free^2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free^2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Summary

AI-Generated Summary

PDF122November 29, 2024