Shakti-VLMs:面向企业AI的可扩展视觉语言模型
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
February 24, 2025
作者: Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi
cs.AI
摘要
我们推出Shakti VLM系列,这是一组参数规模为1B和4B的视觉语言模型,旨在解决多模态学习中的数据效率挑战。尽管近期视觉语言模型通过大量训练数据取得了强劲性能,但Shakti模型凭借架构创新,在较少token的情况下仍能获得竞争性结果。关键进展包括用于注意力稳定性的QK归一化、混合归一化技术以及增强的位置编码。三阶段训练策略进一步优化了学习效率。评估显示,Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取及通用多模态推理方面表现卓越。我们的成果表明,高性能可通过模型设计与训练策略实现,而非单纯依赖数据量,这使Shakti成为企业级多模态任务的高效解决方案。
English
We introduce Shakti VLM, a family of vision-language models in the capacity
of 1B and 4B parameters designed to address data efficiency challenges in
multimodal learning. While recent VLMs achieve strong performance through
extensive training data, Shakti models leverage architectural innovations to
attain competitive results with fewer tokens. Key advancements include
QK-Normalization for attention stability, hybrid normalization techniques, and
enhanced positional encoding. A three-stage training strategy further optimizes
learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and
Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR
extraction, and general multimodal reasoning. Our results highlight that high
performance can be achieved through model design and training strategy rather
than sheer data volume, making Shakti an efficient solution for
enterprise-scale multimodal tasks.Summary
AI-Generated Summary