自由^2指南：無梯度路徑積分控制以增強大視覺語言模型的文本到視頻生成

摘要

擴散模型在生成任務中取得了令人印象深刻的成果，如文本到圖像（T2I）和文本到視頻（T2V）合成。然而，在T2V生成中實現準確的文本對齊仍然具有挑戰性，這是由於幀之間存在著複雜的時間依賴性。現有基於強化學習（RL）的方法用於增強文本對齊通常需要可微的獎勵函數，或者受限於有限的提示，這限制了它們的可擴展性和適用性。在本文中，我們提出了一種名為Free^2Guide的新型無梯度框架，用於將生成的視頻與文本提示對齊，而無需額外的模型訓練。通過利用路徑積分控制原則，Free^2Guide使用不可微的獎勵函數來近似擴散模型的引導，從而實現將強大的黑盒大視覺語言模型（LVLMs）集成為獎勵模型。此外，我們的框架支持多個獎勵模型的靈活集成，包括大規模基於圖像的模型，以協同增強對齊而不會帶來重大的計算開銷。我們展示了Free^2Guide在各個維度上顯著改善了文本對齊，並增強了生成視頻的整體質量。

English

Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free^2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free^2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free^2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

自由^2指南：無梯度路徑積分控制以增強大視覺語言模型的文本到視頻生成

Free^2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

摘要

Support