FastVLM: 비전 언어 모델을 위한 효율적인 비전 인코딩

초록

입력 이미지 해상도를 확장하는 것은 Vision Language Models (VLMs)의 성능을 향상시키는 데 중요하며, 특히 텍스트가 풍부한 이미지 이해 작업에서 그렇습니다. 그러나 ViTs와 같은 인기 있는 시각 인코더는 높은 해상도에서 효율적이지 않아서, 많은 토큰과 쌓인 self-attention 레이어로 인한 높은 인코딩 지연으로 인해 비효율적입니다. VLM의 비전 인코더는 다양한 운영 해상도에서 두 가지 측면으로 최적화될 수 있습니다: 인코딩 지연 시간을 줄이고 LLM에 전달되는 시각적 토큰 수를 최소화하여 전체 지연 시간을 낮추는 것입니다. 이미지 해상도, 비전 지연 시간, 토큰 수 및 LLM 크기 사이의 상호 작용에 대한 포괄적인 효율성 분석을 기반으로, 우리는 FastVLM을 소개합니다. 이 모델은 지연 시간, 모델 크기 및 정확도 사이의 최적의 균형을 달성합니다. FastVLM은 고해상도 이미지의 인코딩 시간을 크게 줄이기 위해 적은 수의 토큰을 출력하는 혁신적인 하이브리드 비전 인코더인 FastViTHD를 통합합니다. 이전 방법과 달리 FastVLM은 입력 이미지의 크기를 조정함으로써 시각적 토큰 수와 이미지 해상도 사이의 최적의 균형을 달성하며, 추가적인 토큰 가지치기가 필요 없어 모델 설계를 간소화합니다. LLaVA-1.5 설정에서 FastVLM은 TTFT(첫 번째 토큰까지의 시간)에서 3.2배의 개선을 달성하면서 이전 작업과 비교하여 VLM 벤치마크에서 유사한 성능을 유지합니다. 최고 해상도(1152x1152)에서 LLaVa-OneVision과 비교했을 때, FastVLM은 SeedBench 및 MMMU와 같은 주요 벤치마크에서 유사한 성능을 달성하며, 0.5B LLM을 사용하되 85배 빠른 TTFT와 3.4배 작은 비전 인코더를 갖추고 있습니다.

English

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2times improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152times1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B LLM, but with 85times faster TTFT and a vision encoder that is 3.4times smaller.

FastVLM: 비전 언어 모델을 위한 효율적인 비전 인코딩

FastVLM: Efficient Vision Encoding for Vision Language Models

초록

Summary

Support

Support