추론 최적의 VLM은 하나의 시각 토큰만으로도 충분하지만 더 큰 모델이 필요합니다.

초록

시각 언어 모델(Vision Language Models, VLMs)은 다양한 시각 이해 및 추론 작업에서 강력한 능력을 보여주었습니다. 그러나 실제 세계에서의 배포는 종종 추론 중 높은 대기 시간으로 제약을 받습니다. 이는 LLM(Large Language Model)에 의해 입력 토큰(주로 이미지에서)을 처리하기 위해 필요한 상당한 계산 덕분입니다. 추론 비용을 줄이기 위해 LLM을 축소하거나 입력 이미지 토큰의 수를 줄일 수 있습니다. 후자는 최근 토큰 압축을 중심으로 한 많은 연구의 초점이 되었습니다. 그러나 최적의 교환 관계가 무엇인지는 명확하지 않습니다. 두 요소 모두 VLM 성능에 직접적인 영향을 미치기 때문입니다. 우리는 먼저 이 두 요소와 성능 변화를 포착하는 스케일링 법칙을 확립함으로써 시각 토큰 수와 LLM 매개변수 간의 최적의 교환 관계를 특성화합니다. 결과는 놀라운 추세를 보여줍니다. 시각 추론 작업에서 VLM의 추론 최적 동작, 즉 주어진 고정 추론 계산에서 최소 하류 오류를 달성하는 것은 추론 예산 내에 맞는 가장 큰 LLM을 사용하면서 시각 토큰 수를 최소화할 때 발생합니다. 토큰 감소 문헌은 주로 토큰 수를 적게 줄이면서 기본 모델 성능을 유지하는 데 초점을 맞추었지만, 우리의 결과는 계산 최적 추론 체제가 더 높은 토큰 압축 비율 하에서 작동을 필요로 한다는 것을 보여줍니다. 이러한 통찰력을 바탕으로 높은 토큰 압축 환경에 맞는 방법론을 구축하기 위해 초기 단계를 거쳐 나아가고 있습니다. 코드는 https://github.com/locuslab/llava-token-compression에서 사용할 수 있습니다.

English

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., 5-10times), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.

추론 최적의 VLM은 하나의 시각 토큰만으로도 충분하지만 더 큰 모델이 필요합니다.

Inference Optimal VLMs Need Only One Visual Token but Larger Models

초록

Summary

Support