ChatPaper.aiChatPaper

自由^2指南:无梯度路径积分控制,用于增强基于大视觉-语言模型的文本到视频生成

Free^2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

November 26, 2024
作者: Jaemin Kim, Bryan S Kim, Jong Chul Ye
cs.AI

摘要

扩散模型在生成任务中取得了令人瞩目的成果,如文本到图像(T2I)和文本到视频(T2V)合成。然而,在T2V生成中实现准确的文本对齐仍然具有挑战性,因为帧间存在复杂的时间依赖关系。现有基于强化学习(RL)的方法用于增强文本对齐,通常需要可微分奖励函数或受限于有限提示,这限制了它们的可扩展性和适用性。本文提出了Free^2Guide,一种新颖的无梯度框架,用于将生成的视频与文本提示进行对齐,而无需额外的模型训练。利用路径积分控制原理,Free^2Guide使用不可微分奖励函数来近似扩散模型的引导,从而实现了将强大的黑盒大规模视觉语言模型(LVLMs)作为奖励模型的集成。此外,我们的框架支持灵活地集成多个奖励模型,包括大规模基于图像的模型,以协同增强对齐而不会带来重大的计算开销。我们展示了Free^2Guide显著改善了各个维度上的文本对齐,并提升了生成视频的整体质量。
English
Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free^2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free^2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free^2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Summary

AI-Generated Summary

PDF132November 29, 2024