ADEM-VL: 효율적인 Vision-Language 튜닝을 위한 적응 및 내장 퓨전

초록

최근의 다중 모달 퓨전의 발전은 시각-언어 (VL) 모델의 놀라운 성공을 목격했습니다. 이러한 모델은 이미지 캡션 생성 및 시각적 질문 응답과 같은 다양한 다중 모달 응용 프로그램에서 뛰어납니다. 그러나 VL 모델을 구축하는 데는 상당한 하드웨어 자원이 필요하며, 효율성은 두 가지 주요 요인에 의해 제한됩니다. 언어 모델의 확장된 입력 시퀀스와 시각 특징은 더 많은 계산 작업을 필요로 하며, 많은 추가 학습 가능한 매개변수는 메모리 복잡성을 증가시킵니다. 이러한 도전에 의해 이러한 모델의 보다 광범위한 적용이 제한됩니다. 이 간극을 메우기 위해, 우리는 ADEM-VL이라는 효율적인 시각-언어 방법을 제안합니다. 이 방법은 사전 학습된 대규모 언어 모델 (LLM)을 기반으로 VL 모델을 조정하며, 다중 모달 퓨전에서 유사성 측정을 위해 매개변수가 없는 교차 주의 메커니즘을 채택합니다. 이 접근 방식은 시각 특징을 언어 공간에 임베딩하는 것만으로도 훈련 가능한 매개변수의 수를 크게 줄이고 훈련 및 추론 속도를 가속화합니다. 퓨전 모듈에서 표현 학습을 강화하기 위해 우리는 효율적인 다중 스케일 특징 생성 방법을 소개합니다. 이 방법은 시각 인코더를 통해 단일 전방향 패스만 필요로 합니다. 더불어, 우리는 각 텍스트 토큰에 대한 주의 점수를 기반으로 동적으로 덜 관련된 시각 정보를 버리는 적응형 퓨전 방법을 제안합니다. 이를 통해 퓨전 프로세스가 가장 관련성 높은 시각적 특징을 우선적으로 처리하도록 보장합니다. 시각적 질문 응답, 이미지 캡션 생성, 지시 따르기 등 다양한 작업에 대한 실험을 통해 우리의 프레임워크가 기존 방법을 능가함을 입증합니다. 특히, 우리의 방법은 ScienceQA 데이터셋에서 평균 정확도가 0.77% 향상되었으며, 훈련 및 추론 지연 시간이 줄어든 것을 보여 우리의 프레임워크의 우수성을 입증합니다. 코드는 https://github.com/Hao840/ADEM-VL에서 확인할 수 있습니다.

English

Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features demands more computational operations, and a large number of additional learnable parameters increase memory complexity. These challenges significantly restrict the broader applicability of such models. To bridge this gap, we propose ADEM-VL, an efficient vision-language method that tunes VL models based on pretrained large language models (LLMs) by adopting a parameter-free cross-attention mechanism for similarity measurements in multimodal fusion. This approach only requires embedding vision features into the language space, significantly reducing the number of trainable parameters and accelerating both training and inference speeds. To enhance representation learning in fusion module, we introduce an efficient multiscale feature generation scheme that requires only a single forward pass through the vision encoder. Moreover, we propose an adaptive fusion scheme that dynamically discards less relevant visual information for each text token based on its attention score. This ensures that the fusion process prioritizes the most pertinent visual features. With experiments on various tasks including visual question answering, image captioning, and instruction-following, we demonstrate that our framework outperforms existing approaches. Specifically, our method surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset, with reduced training and inference latency, demonstrating the superiority of our framework. The code is available at https://github.com/Hao840/ADEM-VL.

ADEM-VL: 효율적인 Vision-Language 튜닝을 위한 적응 및 내장 퓨전

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

초록

Summary

Support