SmolVLM：重新定義小型高效的多模態模型

摘要

大型視覺語言模型（VLMs）展現出卓越的性能，但需要大量的計算資源，這限制了其在移動和邊緣設備上的部署。較小的VLMs通常模仿大型模型的設計選擇，例如廣泛的圖像標記化，導致GPU記憶體使用效率低下，並限制了在設備上應用的實用性。我們介紹了SmolVLM，這是一系列專為資源高效推理而設計的緊湊型多模態模型。我們系統地探索了針對低計算開銷優化的架構配置、標記化策略和數據策展。通過這些探索，我們識別出關鍵的設計選擇，這些選擇在圖像和視頻任務上以最小的記憶體佔用實現了顯著的性能提升。我們最小的模型，SmolVLM-256M，在推理過程中使用的GPU記憶體少於1GB，並且在性能上超越了規模是其300倍的Idefics-80B模型，儘管兩者之間有18個月的開發差距。我們最大的模型，擁有22億參數，與消耗兩倍GPU記憶體的頂尖VLMs相媲美。SmolVLM模型不僅限於靜態圖像，還展示了強大的視頻理解能力。我們的結果強調，策略性的架構優化、激進但高效的標記化，以及精心策劃的訓練數據，顯著提升了多模態性能，促成了在顯著更小規模上實現實用且節能的部署。

English

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

SmolVLM：重新定義小型高效的多模態模型

SmolVLM: Redefining small and efficient multimodal models

摘要

Summary

Support

Support