TangoFlux:透過流匹配和拍手排序偏好優化,實現超快速且忠實的文本轉語音生成。

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

December 30, 2024
作者: Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria
cs.AI

摘要

我們介紹了 TangoFlux,一個高效的文本轉語音(TTA)生成模型,擁有 5.15 億個參數,能夠在單個 A40 GPU 上僅需 3.7 秒內生成長達 30 秒的 44.1kHz 音頻。在對齊 TTA 模型時的一個關鍵挑戰在於創建偏好對,因為 TTA 缺乏像大型語言模型(LLMs)那樣的結構化機制,例如可驗證的獎勵或黃金標準答案。為了應對這一挑戰,我們提出了 CLAP-Ranked Preference Optimization(CRPO),一個新穎的框架,通過迭代生成和優化偏好數據來增強 TTA 對齊。我們展示了使用 CRPO 生成的音頻偏好數據優於現有的替代方案。通過這一框架,TangoFlux 在客觀和主觀基準測試中均實現了最先進的性能。我們開源所有代碼和模型,以支持進一步的 TTA 生成研究。
English
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

Summary

AI-Generated Summary

PDF234December 31, 2024