Ichigo：混合模態早期融合即時語音助理

摘要

大型語言模型（LLMs）已經革新了自然語言處理，但將它們應用於基於語音的任務仍然具有挑戰性，這是由於整合音頻和文本模態的複雜性。本文介紹了Ichigo，一種混合模態模型，可以無縫處理交錯的語音和文本序列。利用一種基於標記化的早期融合方法，Ichigo將語音量化為離散標記，並為語音和文本模態都採用統一的基於Transformer的架構。這種方法使得跨模態的聯合推理和生成成為可能，而無需單獨的適配器。我們提出了一種全面的訓練方法，包括在多語言語音識別數據集上進行預訓練，並在經過精心策劃的指令數據集上進行微調。Ichigo在語音問答基準測試中展現了最先進的性能，優於現有的開源語音語言模型，並實現了與級聯系統可比的結果。值得注意的是，Ichigo僅需111毫秒生成第一個標記的延遲時間明顯低於當前模型。我們的方法不僅推動了多模態人工智慧領域的發展，還為較小的研究團隊提供了一個有效貢獻開源語音語言模型的框架。

English

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

Ichigo：混合模態早期融合即時語音助理

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

摘要

Summary

Support

Support