推理最优的VLMs只需要一个视觉令牌，但模型更大。

摘要

视觉语言模型（VLMs）在各种视觉理解和推理任务中展现出强大的能力。然而，它们在实际部署中往往受到推理过程中的高延迟限制，这是由于处理大量输入标记（主要来自图像）所需的计算量较大。为了降低推理成本，可以通过缩小LLM或减少输入图像标记的数量来实现，后者是许多最近关于标记压缩的研究的重点。然而，目前尚不清楚最佳权衡是什么，因为这两个因素直接影响VLM的性能。我们首先通过建立捕捉这两个因素对性能变化的规律性的缩放定律来表征视觉标记数量和LLM参数之间的最佳权衡。我们的结果揭示了一个令人惊讶的趋势：对于视觉推理任务，VLM中的推理最优行为，即在任何给定的固定推理计算下实现最小的下游误差，是通过使用适合推理预算的最大LLM来实现的，同时最小化视觉标记数量-通常是一个标记。虽然标记减少的文献主要集中在通过适度减少标记数量（例如5-10倍）来维持基本模型性能，但我们的结果表明，计算最优推理范围要求在更高的标记压缩比下运行。基于这些见解，我们正在采取一些初步步骤，以构建适用于高标记压缩设置的方法。代码可在https://github.com/locuslab/llava-token-compression找到。

English

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10times), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.

推理最优的VLMs只需要一个视觉令牌，但模型更大。

Inference Optimal VLMs Need Only One Visual Token but Larger Models

摘要

Summary

Support

Support