ChatPaper.aiChatPaper

CLaMP 3:跨未对齐模态与未知语言的通用音乐信息检索

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

February 14, 2025
作者: Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, Xiaobing Li, Feng Yu, Maosong Sun
cs.AI

摘要

CLaMP 3 是一个为解决音乐信息检索中跨模态与跨语言泛化挑战而开发的统一框架。通过对比学习,它将所有主要音乐模态——包括乐谱、演奏信号和音频录音——与多语言文本对齐于一个共享的表示空间,实现了以文本为桥梁在未对齐模态间的检索。该框架配备了一个可适应未见语言的多语言文本编码器,展现出强大的跨语言泛化能力。借助检索增强生成技术,我们构建了M4-RAG,这是一个包含231万音乐-文本对的网络规模数据集,该数据集富含详细元数据,广泛代表了全球多样的音乐传统。为推进未来研究,我们发布了WikiMT-X基准,包含1000组乐谱、音频及丰富多样的文本描述三元组。实验表明,CLaMP 3在多项音乐信息
English
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, performance signals, and audio recordings--with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.

Summary

AI-Generated Summary

PDF42February 17, 2025