Samba-asr 是一種利用結構化狀態空間模型的最先進語音識別技術。

Samba-asr state-of-the-art speech recognition leveraging structured state-space models

January 6, 2025
作者: Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi
cs.AI

摘要

我們提出 Samba ASR,這是首個採用全新 Mamba 架構作為編碼器和解碼器的最先進自動語音識別(ASR)模型,建立在狀態空間模型(SSMs)的基礎上。與基於Transformer的ASR模型不同,後者依賴自注意機制來捕捉依賴關係,Samba ASR通過高效的狀態空間動態有效地建模本地和全局時間依賴關係,實現了顯著的性能提升。通過解決Transformer的限制,如輸入長度的二次擴展和難以處理長距離依賴性,Samba ASR實現了優越的準確性和效率。 實驗結果表明,Samba ASR在各種標準基準測試中優於現有的基於Transformer的開源ASR模型,確立了其作為ASR新的最先進技術的地位。對基準數據集的廣泛評估顯示,在字錯誤率(WER)方面取得了顯著改善,即使在資源有限的情況下,性能也具競爭力。此外,Mamba架構的計算效率和參數優化使Samba ASR成為多樣ASR任務的可擴展和堅固解決方案。 我們的貢獻包括: - 一種新的Samba ASR架構,展示了SSMs在語音序列處理中優於基於Transformer模型的優越性。 - 對公共基準測試的全面評估,展示了最先進的性能。 - 對計算效率、對噪聲的穩健性和序列泛化的分析。這項工作突顯了Mamba SSM作為高效準確ASR的無Transformer替代方案的可行性。通過利用狀態空間建模的進展,Samba ASR為ASR性能和未來研究設立了新的基準。
English
We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention mechanisms to capture dependencies, Samba ASR effectively models both local and global temporal dependencies using efficient state-space dynamics, achieving remarkable performance gains. By addressing the limitations of transformers, such as quadratic scaling with input length and difficulty in handling long-range dependencies, Samba ASR achieves superior accuracy and efficiency. Experimental results demonstrate that Samba ASR surpasses existing open-source transformer-based ASR models across various standard benchmarks, establishing it as the new state of the art in ASR. Extensive evaluations on benchmark datasets show significant improvements in Word Error Rate (WER), with competitive performance even in low-resource scenarios. Furthermore, the computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks. Our contributions include: A new Samba ASR architecture demonstrating the superiority of SSMs over transformer-based models for speech sequence processing. A comprehensive evaluation on public benchmarks showcasing state-of-the-art performance. An analysis of computational efficiency, robustness to noise, and sequence generalization. This work highlights the viability of Mamba SSMs as a transformer-free alternative for efficient and accurate ASR. By leveraging state-space modeling advancements, Samba ASR sets a new benchmark for ASR performance and future research.

Summary

AI-Generated Summary

PDF83January 7, 2025