FastVLM:用于视觉语言模型的高效视觉编码
FastVLM: Efficient Vision Encoding for Vision Language Models
December 17, 2024
作者: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
cs.AI
摘要
对于增强视觉语言模型(VLMs)的性能,尤其是在文本丰富的图像理解任务中,调整输入图像分辨率至关重要。然而,诸如ViTs之类的流行视觉编码器在高分辨率下变得低效,这是由于大量标记和由堆叠的自注意力层引起的高编码延迟。在不同操作分辨率下,VLM的视觉编码器可以沿着两个轴进行优化:降低编码延迟和最小化传递给LLM的视觉标记数量,从而降低总体延迟。通过对图像分辨率、视觉延迟、标记数量和LLM大小之间相互作用的全面效率分析,我们引入了FastVLM,这是一个在延迟、模型大小和准确性之间实现了优化权衡的模型。FastVLM融合了FastViTHD,这是一种新型的混合视觉编码器,旨在输出更少的标记并显著减少高分辨率图像的编码时间。与以往方法不同,FastVLM通过仅调整输入图像的方式实现了视觉标记数量和图像分辨率之间的最佳平衡,消除了额外的标记修剪需求,简化了模型设计。在LLaVA-1.5设置中,FastVLM在保持与以往作品相似的VLM基准性能的同时,将首个标记的时间(TTFT)提高了3.2倍。与最高分辨率(1152x1152)的LLaVa-OneVision相比,FastVLM在关键基准测试中(如SeedBench和MMMU)实现了可比的性能,使用相同的0.5B LLM,但首个标记的时间快了85倍,视觉编码器的大小减小了3.4倍。
English
Scaling the input image resolution is essential for enhancing the performance
of Vision Language Models (VLMs), particularly in text-rich image understanding
tasks. However, popular visual encoders such as ViTs become inefficient at high
resolutions due to the large number of tokens and high encoding latency caused
by stacked self-attention layers. At different operational resolutions, the
vision encoder of a VLM can be optimized along two axes: reducing encoding
latency and minimizing the number of visual tokens passed to the LLM, thereby
lowering overall latency. Based on a comprehensive efficiency analysis of the
interplay between image resolution, vision latency, token count, and LLM size,
we introduce FastVLM, a model that achieves an optimized trade-off between
latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel
hybrid vision encoder designed to output fewer tokens and significantly reduce
encoding time for high-resolution images. Unlike previous methods, FastVLM
achieves the optimal balance between visual token count and image resolution
solely by scaling the input image, eliminating the need for additional token
pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM
achieves 3.2times improvement in time-to-first-token (TTFT) while
maintaining similar performance on VLM benchmarks compared to prior works.
Compared to LLaVa-OneVision at the highest resolution (1152times1152),
FastVLM achieves comparable performance on key benchmarks like SeedBench and
MMMU, using the same 0.5B LLM, but with 85times faster TTFT and a vision
encoder that is 3.4times smaller.Summary
AI-Generated Summary