YuE:面向长篇音乐生成的可扩展开放基础模型
YuE: Scaling Open Foundation Models for Long-Form Music Generation
March 11, 2025
作者: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang, Yatian Wang, Xiaowei Chi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Shansong Liu, Lingrui Mei, Peng Li, Junjie Wang, Jianwei Yu, Guojian Pang, Xu Li, Zihao Wang, Xiaohuan Zhou, Lijun Yu, Emmanouil Benetos, Yong Chen, Chenghua Lin, Xie Chen, Gus Xia, Zhaoxiang Zhang, Chao Zhang, Wenhu Chen, Xinyu Zhou, Xipeng Qiu, Roger Dannenberg, Jiaheng Liu, Jian Yang, Wenhao Huang, Wei Xue, Xu Tan, Yike Guo
cs.AI
摘要
我们致力于解决长篇幅音乐生成任务,尤其是极具挑战性的歌词转歌曲问题,为此引入了基于LLaMA2架构的开放基础模型系列——YuE。具体而言,YuE能够处理数万亿个标记,生成长达五分钟的音乐,同时保持歌词与旋律的精准对齐、连贯的音乐结构以及引人入胜的声乐旋律与恰当的伴奏。这一成就得益于以下三大创新:(1)采用音轨解耦的下一个标记预测技术,以克服密集混合信号带来的难题;(2)运用结构渐进式条件化方法,实现长上下文歌词对齐;(3)设计多任务、多阶段的预训练方案,确保模型收敛并具备良好的泛化能力。此外,我们重新设计了音乐生成中的上下文学习技术,使其能够灵活进行风格转换(例如,将日本城市流行乐转换为英语说唱,同时保留原伴奏)并支持双向生成。通过广泛的评估,我们证明YuE在音乐性和声乐灵活性方面与某些专有系统相当,甚至更胜一筹。此外,对YuE进行微调还能实现更多控制功能,并增强对少数语言的支持。更进一步,YuE不仅限于生成任务,其学习到的表示在音乐理解任务上也表现出色,在MARBLE基准测试中,YuE的结果达到或超越了当前最先进的方法。关键词:歌词转歌曲、歌曲生成、长篇幅、基础模型、音乐生成
English
We tackle the task of long-form music generation--particularly the
challenging lyrics-to-song problem--by introducing YuE, a family of
open foundation models based on the LLaMA2 architecture. Specifically, YuE
scales to trillions of tokens and generates up to five minutes of music while
maintaining lyrical alignment, coherent musical structure, and engaging vocal
melodies with appropriate accompaniment. It achieves this through (1)
track-decoupled next-token prediction to overcome dense mixture signals, (2)
structural progressive conditioning for long-context lyrical alignment, and (3)
a multitask, multiphase pre-training recipe to converge and generalize. In
addition, we redesign the in-context learning technique for music generation,
enabling versatile style transfer (e.g., converting Japanese city pop into an
English rap while preserving the original accompaniment) and bidirectional
generation. Through extensive evaluation, we demonstrate that YuE matches or
even surpasses some of the proprietary systems in musicality and vocal agility.
In addition, fine-tuning YuE enables additional controls and enhanced support
for tail languages. Furthermore, beyond generation, we show that YuE's learned
representations can perform well on music understanding tasks, where the
results of YuE match or exceed state-of-the-art methods on the MARBLE
benchmark. Keywords: lyrics2song, song generation, long-form, foundation model,
music generationSummary
AI-Generated Summary