Ichigo:混合模態早期融合即時語音助理
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
October 20, 2024
作者: Alan Dao, Dinh Bach Vu, Huy Hoang Ha
cs.AI
摘要
大型語言模型(LLMs)已經革新了自然語言處理,但將它們應用於基於語音的任務仍然具有挑戰性,這是由於整合音頻和文本模態的複雜性。本文介紹了Ichigo,一種混合模態模型,可以無縫處理交錯的語音和文本序列。利用一種基於標記化的早期融合方法,Ichigo將語音量化為離散標記,並為語音和文本模態都採用統一的基於Transformer的架構。這種方法使得跨模態的聯合推理和生成成為可能,而無需單獨的適配器。我們提出了一種全面的訓練方法,包括在多語言語音識別數據集上進行預訓練,並在經過精心策劃的指令數據集上進行微調。Ichigo在語音問答基準測試中展現了最先進的性能,優於現有的開源語音語言模型,並實現了與級聯系統可比的結果。值得注意的是,Ichigo僅需111毫秒生成第一個標記的延遲時間明顯低於當前模型。我們的方法不僅推動了多模態人工智慧領域的發展,還為較小的研究團隊提供了一個有效貢獻開源語音語言模型的框架。
English
Large Language Models (LLMs) have revolutionized natural language processing,
but their application to speech-based tasks remains challenging due to the
complexities of integrating audio and text modalities. This paper introduces
Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of
speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes
speech into discrete tokens and employs a uniform transformer-based
architecture for both speech and text modalities. This method enables joint
reasoning and generation across modalities without the need for separate
adapters. We present a comprehensive training methodology, including
pre-training on multilingual speech recognition datasets and fine-tuning on a
curated instruction dataset. Ichigo demonstrates state-of-the-art performance
on speech question-answering benchmarks, outperforming existing open-source
speech language models and achieving comparable results to cascaded systems.
Notably, Ichigo exhibits a latency of just 111 ms to first token generation,
significantly lower than current models. Our approach not only advances the
field of multimodal AI but also provides a framework for smaller research teams
to contribute effectively to open-source speech-language models.Summary
AI-Generated Summary