Shakti-VLMs：面向企业AI的可扩展视觉语言模型

摘要

我们推出Shakti VLM系列，这是一组参数规模为1B和4B的视觉语言模型，旨在解决多模态学习中的数据效率挑战。尽管近期视觉语言模型通过大量训练数据取得了强劲性能，但Shakti模型凭借架构创新，在较少token的情况下仍能获得竞争性结果。关键进展包括用于注意力稳定性的QK归一化、混合归一化技术以及增强的位置编码。三阶段训练策略进一步优化了学习效率。评估显示，Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取及通用多模态推理方面表现卓越。我们的成果表明，高性能可通过模型设计与训练策略实现，而非单纯依赖数据量，这使Shakti成为企业级多模态任务的高效解决方案。

English

We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

Shakti-VLMs：面向企业AI的可扩展视觉语言模型

Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

摘要

Summary

Support

Support