ChatPaper.aiChatPaper

基于套娃式多模态大语言模型的自适应视听语音识别

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

March 9, 2025
作者: Umberto Cappellazzo, Minsu Kim, Stavros Petridis
cs.AI

摘要

视听语音识别(AVSR)通过融合音频与视觉模态,显著提升了语音识别的鲁棒性,尤其在嘈杂环境中表现突出。近年来,大语言模型(LLMs)在语音识别领域,包括AVSR,展现了其卓越效能。然而,由于语音表征的长度显著,直接与LLMs整合会带来巨大的计算成本。先前的方法通过在输入LLMs前压缩语音表征来解决这一问题,但高压缩率往往导致性能下降,迫使在计算效率与识别精度之间做出权衡。为应对这一挑战,我们提出了Llama-MTSK,首个基于嵌套式(Matryoshka)表示学习的多模态LLM,专为AVSR设计,它能够根据具体计算限制灵活调整视听令牌分配,同时保持高性能。受嵌套表示学习启发,我们的方法在单一模型内以多粒度编码视听表征,无需为不同压缩级别训练独立模型。此外,为高效微调LLM,我们引入了三种基于LoRA的嵌套策略,采用全局及特定尺度LoRA模块。在两大AVSR数据集上的广泛评估表明,Llama-MTSK取得了最先进的成果,与在固定压缩级别下独立训练的模型相比,表现相当或更优。
English
Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, particularly in noisy environments. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs. Prior approaches address this by compressing speech representations before feeding them into LLMs. However, higher compression ratios often lead to performance degradation, necessitating a trade-off between computational efficiency and recognition accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation based on specific computational constraints while preserving high performance. Our approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model, eliminating the need to train separate models for different compression levels. Moreover, to efficiently fine-tune the LLM, we introduce three LoRA-based Matryoshka strategies using global and scale-specific LoRA modules. Extensive evaluations on the two largest AVSR datasets demonstrate that Llama-MTSK achieves state-of-the-art results, matching or surpassing models trained independently at fixed compression levels.

Summary

AI-Generated Summary

PDF22March 11, 2025