FastVLM：用於視覺語言模型的高效視覺編碼

摘要

對於增強視覺語言模型（VLMs）的性能，特別是在文本豐富的圖像理解任務中，調整輸入圖像的解析度至關重要。然而，流行的視覺編碼器如ViTs在高解析度下變得低效，這是由於大量標記和堆疊的自注意力層導致的高編碼延遲。在不同操作解析度下，VLM的視覺編碼器可以沿著兩個軸進行優化：降低編碼延遲和最小化傳遞給LLM的視覺標記數量，從而降低整體延遲。通過對圖像解析度、視覺延遲、標記數量和LLM大小之間相互作用的全面效率分析，我們引入了FastVLM，一個實現在延遲、模型大小和準確性之間優化折衷的模型。FastVLM融合了FastViTHD，一種新型的混合視覺編碼器，旨在輸出更少的標記，並顯著減少高解析度圖像的編碼時間。與以往方法不同，FastVLM僅通過調整輸入圖像的大小來實現視覺標記數量和圖像解析度之間的最佳平衡，消除了額外的標記修剪需求，簡化了模型設計。在LLaVA-1.5設置中，FastVLM在保持與先前作品相似的VLM基準性能的同時，將首個標記的時間（TTFT）提高了3.2倍。與最高解析度（1152x1152）的LLaVa-OneVision相比，FastVLM在關鍵基準測試如SeedBench和MMMU上實現了可比的性能，使用相同的0.5B LLM，但首個標記的時間快了85倍，視覺編碼器的大小也小了3.4倍。

English

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2times improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152times1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B LLM, but with 85times faster TTFT and a vision encoder that is 3.4times smaller.

FastVLM：用於視覺語言模型的高效視覺編碼

FastVLM: Efficient Vision Encoding for Vision Language Models

摘要

Support