ChatPaper.aiChatPaper

SmolVLM:重新定义小型高效多模态模型

SmolVLM: Redefining small and efficient multimodal models

April 7, 2025
作者: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
cs.AI

摘要

大型视觉-语言模型(VLMs)展现出卓越性能,但需要大量计算资源,这限制了其在移动和边缘设备上的部署。较小的VLMs通常模仿大型模型的设计选择,如广泛的图像标记化,导致GPU内存使用效率低下,并制约了设备端应用的实际可行性。 我们推出了SmolVLM系列,这是一组专为资源高效推理设计的紧凑型多模态模型。我们系统地探索了针对低计算开销优化的架构配置、标记化策略和数据筛选方法。通过这一过程,我们识别出关键设计选择,这些选择在图像和视频任务上带来了显著的性能提升,同时保持了极小的内存占用。 我们最小的模型SmolVLM-256M在推理过程中使用的GPU内存不足1GB,却超越了体积是其300倍的Idefics-80B模型,尽管两者之间存在18个月的开发差距。我们最大的模型拥有22亿参数,与消耗两倍GPU内存的顶尖VLMs相媲美。SmolVLM系列不仅限于静态图像处理,还展示了强大的视频理解能力。 我们的研究结果强调,通过策略性的架构优化、激进而高效的标记化处理,以及精心筛选的训练数据,可以显著提升多模态性能,从而在显著缩小的规模上实现实用且节能的部署。
English
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Summary

AI-Generated Summary

PDF1767April 8, 2025