推論最佳的 VLMs 只需要一個視覺標記，但需要更大的模型。

摘要

視覺語言模型（VLMs）已在各種視覺理解和推理任務中展現出強大的能力。然而，由於處理LLM大量輸入令牌（主要來自圖像）所需的計算量巨大，導致推論過程中的高延遲，因此它們在現實世界中的部署常常受到限制。為了降低推論成本，可以縮小LLM的規模，或者減少輸入圖像令牌的數量，後者已成為許多最近關於令牌壓縮的研究的焦點。然而，目前尚不清楚最佳的權衡是什麼，因為這兩個因素直接影響VLM的性能。我們首先通過建立捕捉這兩個因素對性能變化的擴展定律，來表徵視覺令牌數量和LLM參數之間的最佳權衡。我們的結果顯示了一個令人驚訝的趨勢：對於視覺推理任務，VLM中的推論最佳行為，即在任何固定推論計算下最小化下游錯誤，是在使用符合推論預算的最大LLM時實現的，同時最小化視覺令牌數量-通常為一個令牌。雖然令牌減少的文獻主要集中在通過適度減少令牌數量（例如5-10倍）來維持基本模型性能，但我們的結果表明，計算最佳推論範疇需要在更高的令牌壓縮比率下運作。基於這些見解，我們採取了一些初步步驟，以建立針對高令牌壓縮設置的方法。代碼可在https://github.com/locuslab/llava-token-compression找到。

English

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10times), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.

推論最佳的 VLMs 只需要一個視覺標記，但需要更大的模型。

Inference Optimal VLMs Need Only One Visual Token but Larger Models

摘要

Summary

Support

Support