Samba-asr是一种利用结构化状态空间模型的最先进语音识别技术。
Samba-asr state-of-the-art speech recognition leveraging structured state-space models
January 6, 2025
作者: Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi
cs.AI
摘要
我们提出了Samba ASR,这是第一个利用全新的Mamba架构作为编码器和解码器的最先进的自动语音识别(ASR)模型,建立在状态空间模型(SSMs)的基础上。与基于Transformer的ASR模型不同,后者依赖于自注意力机制来捕捉依赖关系,Samba ASR通过高效的状态空间动态有效地建模本地和全局时间依赖关系,实现了显著的性能提升。通过解决Transformer的限制,例如随输入长度二次扩展和难以处理长距离依赖关系等问题,Samba ASR实现了卓越的准确性和效率。
实验结果表明,Samba ASR在各种标准基准测试中均超越了现有的开源基于Transformer的ASR模型,将其确立为ASR领域的新的最先进技术。对基准数据集的广泛评估显示,在词错误率(WER)方面取得了显著的改进,在资源稀缺情况下甚至表现出竞争力。此外,Mamba架构的计算效率和参数优化使Samba ASR成为各种ASR任务的可扩展和强大解决方案。
我们的贡献包括:
- 展示了SSMs相对于基于Transformer的模型在语音序列处理中的优越性的新Samba ASR架构。
- 对公共基准测试进行了全面评估,展示了最先进的性能。
- 对计算效率、对噪声的稳健性和序列泛化能力进行了分析。这项工作突出了Mamba SSMs作为高效准确ASR的无Transformer替代方案的可行性。通过利用状态空间建模的进展,Samba ASR为ASR性能和未来研究设立了新的基准。
English
We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition
(ASR) model leveraging the novel Mamba architecture as both encoder and
decoder, built on the foundation of state-space models (SSMs). Unlike
transformer-based ASR models, which rely on self-attention mechanisms to
capture dependencies, Samba ASR effectively models both local and global
temporal dependencies using efficient state-space dynamics, achieving
remarkable performance gains. By addressing the limitations of transformers,
such as quadratic scaling with input length and difficulty in handling
long-range dependencies, Samba ASR achieves superior accuracy and efficiency.
Experimental results demonstrate that Samba ASR surpasses existing
open-source transformer-based ASR models across various standard benchmarks,
establishing it as the new state of the art in ASR. Extensive evaluations on
benchmark datasets show significant improvements in Word Error Rate (WER), with
competitive performance even in low-resource scenarios. Furthermore, the
computational efficiency and parameter optimization of the Mamba architecture
make Samba ASR a scalable and robust solution for diverse ASR tasks.
Our contributions include:
A new Samba ASR architecture demonstrating the superiority of SSMs over
transformer-based models for speech sequence processing. A comprehensive
evaluation on public benchmarks showcasing state-of-the-art performance. An
analysis of computational efficiency, robustness to noise, and sequence
generalization. This work highlights the viability of Mamba SSMs as a
transformer-free alternative for efficient and accurate ASR. By leveraging
state-space modeling advancements, Samba ASR sets a new benchmark for ASR
performance and future research.Summary
AI-Generated Summary