Kimi-Audio 技術報告

摘要

我們推出Kimi-Audio，這是一款開源的音頻基礎模型，在音頻理解、生成與對話方面表現卓越。本文詳細介紹了構建Kimi-Audio的實踐過程，涵蓋模型架構、數據整理、訓練方案、推理部署及評估方法。具體而言，我們採用12.5Hz的音頻標記器，設計了一種新穎的基於大語言模型（LLM）的架構，該架構以連續特徵作為輸入，離散標記作為輸出，並開發了基於流匹配的分塊流式解碼器。我們精心整理了一個預訓練數據集，包含超過1300萬小時的音頻數據，覆蓋語音、聲音和音樂等多種模態，並構建了高質量、多樣化的後訓練數據管道。Kimi-Audio從預訓練的LLM初始化，通過多項精心設計的任務在音頻和文本數據上進行持續預訓練，隨後進行微調以支持多種音頻相關任務。廣泛的評估表明，Kimi-Audio在語音識別、音頻理解、音頻問答及語音對話等一系列音頻基準測試中達到了業界領先水平。我們在https://github.com/MoonshotAI/Kimi-Audio上公開了代碼、模型檢查點以及評估工具包。

English

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Kimi-Audio 技術報告

Kimi-Audio Technical Report

摘要

Summary

Support

Support