FastVLM：用于视觉语言模型的高效视觉编码

摘要

对于增强视觉语言模型（VLMs）的性能，尤其是在文本丰富的图像理解任务中，调整输入图像分辨率至关重要。然而，诸如ViTs之类的流行视觉编码器在高分辨率下变得低效，这是由于大量标记和由堆叠的自注意力层引起的高编码延迟。在不同操作分辨率下，VLM的视觉编码器可以沿着两个轴进行优化：降低编码延迟和最小化传递给LLM的视觉标记数量，从而降低总体延迟。通过对图像分辨率、视觉延迟、标记数量和LLM大小之间相互作用的全面效率分析，我们引入了FastVLM，这是一个在延迟、模型大小和准确性之间实现了优化权衡的模型。FastVLM融合了FastViTHD，这是一种新型的混合视觉编码器，旨在输出更少的标记并显著减少高分辨率图像的编码时间。与以往方法不同，FastVLM通过仅调整输入图像的方式实现了视觉标记数量和图像分辨率之间的最佳平衡，消除了额外的标记修剪需求，简化了模型设计。在LLaVA-1.5设置中，FastVLM在保持与以往作品相似的VLM基准性能的同时，将首个标记的时间（TTFT）提高了3.2倍。与最高分辨率（1152x1152）的LLaVa-OneVision相比，FastVLM在关键基准测试中（如SeedBench和MMMU）实现了可比的性能，使用相同的0.5B LLM，但首个标记的时间快了85倍，视觉编码器的大小减小了3.4倍。

English

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2times improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152times1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B LLM, but with 85times faster TTFT and a vision encoder that is 3.4times smaller.

FastVLM：用于视觉语言模型的高效视觉编码

FastVLM: Efficient Vision Encoding for Vision Language Models

摘要

Summary

Support