ChatPaper.aiChatPaper

DRAGON:分布奖励优化扩散生成模型

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

April 21, 2025
作者: Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan
cs.AI

摘要

我们提出了分布奖励生成优化框架(DRAGON),这是一个用于微调媒体生成模型以实现预期目标的通用框架。与传统的基于人类反馈的强化学习(RLHF)或直接偏好优化(DPO)等成对偏好方法相比,DRAGON更具灵活性。它能够优化评估单个样本或样本分布的函数,使其兼容广泛的实例级、实例到分布及分布到分布的奖励机制。利用这一多功能性,我们通过选择编码器和一组参考样本来构建新颖的奖励函数,从而创建示例分布。当使用如CLAP这样的跨模态编码器时,参考样本可以来自不同模态(例如,文本与音频)。随后,DRAGON收集在线和策略生成的结果,对其进行评分以构建正例示范集和负例集,并利用两者间的对比来最大化奖励。为评估效果,我们使用20种不同的奖励函数微调了一个音频领域的文本到音乐扩散模型,其中包括定制的音乐美学模型、CLAP评分、Vendi多样性及Frechet音频距离(FAD)。我们进一步比较了实例级(每首歌曲)和全数据集FAD设置,同时消融了多种FAD编码器和参考集。在所有20个目标奖励上,DRAGON实现了81.45%的平均胜率。此外,基于示例集的奖励函数确实提升了生成质量,与基于模型的奖励相当。在合适的示例集下,DRAGON在没有人类偏好标注训练的情况下,获得了60.95%的人类投票音乐质量胜率。因此,DRAGON展示了一种设计和优化奖励函数以提升人类感知质量的新方法。声音示例请访问https://ml-dragon.github.io/web。
English
We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.

Summary

AI-Generated Summary

PDF102April 22, 2025