TangoFlux：具有流匹配和拍手排序偏好优化的超快速忠实文本转音频生成

摘要

我们介绍了TangoFlux，一种高效的文本转音频（TTA）生成模型，拥有5.15亿个参数，能够在单个A40 GPU上仅用3.7秒生成长达30秒的44.1kHz音频。在对齐TTA模型时的一个关键挑战在于创建偏好对的困难，因为TTA缺乏类似于大型语言模型（LLMs）可用的可验证奖励或黄金标准答案等结构化机制。为了解决这个问题，我们提出了CLAP-Ranked Preference Optimization（CRPO），这是一个新颖的框架，通过迭代生成和优化偏好数据来增强TTA的对齐性。我们展示了使用CRPO生成的音频偏好数据集优于现有的替代方案。借助这一框架，TangoFlux在客观和主观基准测试中均实现了最先进的性能。我们开源所有代码和模型，以支持TTA生成领域的进一步研究。

English

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

TangoFlux：具有流匹配和拍手排序偏好优化的超快速忠实文本转音频生成

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

摘要

Summary

Support