ChatPaper.aiChatPaper

Kimi-VL 技术报告

Kimi-VL Technical Report

April 10, 2025
作者: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen
cs.AI

摘要

我们推出Kimi-VL,这是一款高效的开源专家混合(MoE)视觉语言模型(VLM),具备先进的多模态推理、长上下文理解及强大的代理能力,而其语言解码器仅激活2.8B参数(Kimi-VL-A3B)。Kimi-VL在多个挑战性领域展现出卓越性能:作为通用VLM,它在多轮代理任务(如OSWorld)中表现优异,与旗舰模型相媲美。此外,该模型在多样化的视觉语言任务中展现了非凡能力,包括大学级别的图像与视频理解、OCR、数学推理及多图像理解。在对比评估中,它有效竞争于前沿高效VLM如GPT-4o-mini、Qwen2.5-VL-7B和Gemma-3-12B-IT,并在多个关键领域超越GPT-4o。Kimi-VL在长上下文处理与清晰感知方面也取得进展,凭借128K扩展上下文窗口,能够处理多样化的长输入,在LongVideoBench和MMLongBench-Doc上分别获得64.5和35.1的优异成绩。其原生分辨率视觉编码器MoonViT,使其能够观察并理解超高分辨率视觉输入,在InfoVQA和ScreenSpot-Pro上分别达到83.2和34.5,同时保持较低的计算成本。基于Kimi-VL,我们进一步推出了高级长思维变体:Kimi-VL-Thinking。通过长链思维(CoT)监督微调(SFT)和强化学习(RL)开发,该模型展现出强大的长程推理能力,在MMMU、MathVision和MathVista上分别获得61.7、36.8和71.3的分数,同时维持紧凑的2.8B激活LLM参数,为高效多模态思维模型树立了新标杆。代码与模型已公开于https://github.com/MoonshotAI/Kimi-VL。
English
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Summary

AI-Generated Summary

PDF1212April 11, 2025