SmolVLM:重新定義小型高效的多模態模型
SmolVLM: Redefining small and efficient multimodal models
April 7, 2025
作者: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
cs.AI
摘要
大型視覺語言模型(VLMs)展現出卓越的性能,但需要大量的計算資源,這限制了其在移動和邊緣設備上的部署。較小的VLMs通常模仿大型模型的設計選擇,例如廣泛的圖像標記化,導致GPU記憶體使用效率低下,並限制了在設備上應用的實用性。
我們介紹了SmolVLM,這是一系列專為資源高效推理而設計的緊湊型多模態模型。我們系統地探索了針對低計算開銷優化的架構配置、標記化策略和數據策展。通過這些探索,我們識別出關鍵的設計選擇,這些選擇在圖像和視頻任務上以最小的記憶體佔用實現了顯著的性能提升。
我們最小的模型,SmolVLM-256M,在推理過程中使用的GPU記憶體少於1GB,並且在性能上超越了規模是其300倍的Idefics-80B模型,儘管兩者之間有18個月的開發差距。我們最大的模型,擁有22億參數,與消耗兩倍GPU記憶體的頂尖VLMs相媲美。SmolVLM模型不僅限於靜態圖像,還展示了強大的視頻理解能力。
我們的結果強調,策略性的架構優化、激進但高效的標記化,以及精心策劃的訓練數據,顯著提升了多模態性能,促成了在顯著更小規模上實現實用且節能的部署。
English
Large Vision-Language Models (VLMs) deliver exceptional performance but
require significant computational resources, limiting their deployment on
mobile and edge devices. Smaller VLMs typically mirror design choices of larger
models, such as extensive image tokenization, leading to inefficient GPU memory
usage and constrained practicality for on-device applications.
We introduce SmolVLM, a series of compact multimodal models specifically
engineered for resource-efficient inference. We systematically explore
architectural configurations, tokenization strategies, and data curation
optimized for low computational overhead. Through this, we identify key design
choices that yield substantial performance gains on image and video tasks with
minimal memory footprints.
Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during
inference and outperforms the 300-times larger Idefics-80B model, despite an
18-month development gap. Our largest model, at 2.2B parameters, rivals
state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend
beyond static images, demonstrating robust video comprehension capabilities.
Our results emphasize that strategic architectural optimizations, aggressive
yet efficient tokenization, and carefully curated training data significantly
enhance multimodal performance, facilitating practical, energy-efficient
deployments at significantly smaller scales.Summary
AI-Generated Summary