Kimi-Audio 技術報告
Kimi-Audio Technical Report
April 25, 2025
作者: KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou
cs.AI
摘要
我們推出Kimi-Audio,這是一款開源的音頻基礎模型,在音頻理解、生成與對話方面表現卓越。本文詳細介紹了構建Kimi-Audio的實踐過程,涵蓋模型架構、數據整理、訓練方案、推理部署及評估方法。具體而言,我們採用12.5Hz的音頻標記器,設計了一種新穎的基於大語言模型(LLM)的架構,該架構以連續特徵作為輸入,離散標記作為輸出,並開發了基於流匹配的分塊流式解碼器。我們精心整理了一個預訓練數據集,包含超過1300萬小時的音頻數據,覆蓋語音、聲音和音樂等多種模態,並構建了高質量、多樣化的後訓練數據管道。Kimi-Audio從預訓練的LLM初始化,通過多項精心設計的任務在音頻和文本數據上進行持續預訓練,隨後進行微調以支持多種音頻相關任務。廣泛的評估表明,Kimi-Audio在語音識別、音頻理解、音頻問答及語音對話等一系列音頻基準測試中達到了業界領先水平。我們在https://github.com/MoonshotAI/Kimi-Audio上公開了代碼、模型檢查點以及評估工具包。
English
We present Kimi-Audio, an open-source audio foundation model that excels in
audio understanding, generation, and conversation. We detail the practices in
building Kimi-Audio, including model architecture, data curation, training
recipe, inference deployment, and evaluation. Specifically, we leverage a
12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous
features as input and discrete tokens as output, and develop a chunk-wise
streaming detokenizer based on flow matching. We curate a pre-training dataset
that consists of more than 13 million hours of audio data covering a wide range
of modalities including speech, sound, and music, and build a pipeline to
construct high-quality and diverse post-training data. Initialized from a
pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text
data with several carefully designed tasks, and then fine-tuned to support a
diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio
achieves state-of-the-art performance on a range of audio benchmarks including
speech recognition, audio understanding, audio question answering, and speech
conversation. We release the codes, model checkpoints, as well as the
evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.Summary
AI-Generated Summary