DiffRhythm:基于潜在扩散的极速且简洁的端到端全曲生成
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
March 3, 2025
作者: Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie
cs.AI
摘要
近期音乐生成领域的进展引起了广泛关注,但现有方法仍面临关键局限。当前部分生成模型仅能合成人声或伴奏音轨,而一些虽能同时生成人声与伴奏的模型,却通常依赖于精心设计的多阶段级联架构和复杂的数据处理流程,这限制了其可扩展性。此外,多数系统仅限于生成短音乐片段而非完整歌曲。再者,基于语言模型的常用方法存在推理速度慢的问题。为应对这些挑战,我们提出了DiffRhythm,这是首个基于潜在扩散的歌曲生成模型,能够在短短十秒内合成包含人声与伴奏、时长可达4分45秒的完整歌曲,同时保持高音乐性与清晰度。尽管功能卓越,DiffRhythm设计简洁优雅:它摒弃了复杂的数据预处理,采用直观的模型结构,推理时仅需歌词和风格提示。其非自回归结构确保了快速的推理速度,这种简洁性保障了DiffRhythm的可扩展性。此外,我们发布了完整的训练代码及基于大规模数据的预训练模型,以促进研究的可复现性与进一步探索。
English
Recent advancements in music generation have garnered significant attention,
yet existing approaches face critical limitations. Some current generative
models can only synthesize either the vocal track or the accompaniment track.
While some models can generate combined vocal and accompaniment, they typically
rely on meticulously designed multi-stage cascading architectures and intricate
data pipelines, hindering scalability. Additionally, most systems are
restricted to generating short musical segments rather than full-length songs.
Furthermore, widely used language model-based methods suffer from slow
inference speeds. To address these challenges, we propose DiffRhythm, the first
latent diffusion-based song generation model capable of synthesizing complete
songs with both vocal and accompaniment for durations of up to 4m45s in only
ten seconds, maintaining high musicality and intelligibility. Despite its
remarkable capabilities, DiffRhythm is designed to be simple and elegant: it
eliminates the need for complex data preparation, employs a straightforward
model structure, and requires only lyrics and a style prompt during inference.
Additionally, its non-autoregressive structure ensures fast inference speeds.
This simplicity guarantees the scalability of DiffRhythm. Moreover, we release
the complete training code along with the pre-trained model on large-scale data
to promote reproducibility and further research.Summary
AI-Generated Summary