ChatPaper.aiChatPaper

Kimi-VL 技術報告

Kimi-VL Technical Report

April 10, 2025
作者: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen
cs.AI

摘要

我們推出Kimi-VL,這是一款高效的開源專家混合(MoE)視覺語言模型(VLM),具備先進的多模態推理能力、長上下文理解能力以及強大的代理功能——其語言解碼器僅激活2.8B參數(Kimi-VL-A3B)。Kimi-VL在多個挑戰性領域展現出卓越性能:作為一款通用VLM,它在多輪代理任務(如OSWorld)中表現出色,與旗艦模型相媲美。此外,它在多樣化的視覺語言任務中展現出顯著能力,包括大學級別的圖像和視頻理解、OCR、數學推理以及多圖像理解。在比較評估中,它有效地與GPT-4o-mini、Qwen2.5-VL-7B和Gemma-3-12B-IT等尖端高效VLM競爭,並在多個關鍵領域超越GPT-4o。Kimi-VL在處理長上下文和清晰感知方面也取得了進展。憑藉128K的擴展上下文窗口,Kimi-VL能夠處理多樣的長輸入,在LongVideoBench上獲得64.5分,在MMLongBench-Doc上獲得35.1分。其原生分辨率的視覺編碼器MoonViT進一步使其能夠看到並理解超高分辨率的視覺輸入,在InfoVQA上獲得83.2分,在ScreenSpot-Pro上獲得34.5分,同時在常見任務中保持較低的計算成本。基於Kimi-VL,我們推出了一款先進的長思維變體:Kimi-VL-Thinking。通過長鏈思維(CoT)監督微調(SFT)和強化學習(RL)開發,該模型展現出強大的長視野推理能力。它在MMMU上獲得61.7分,在MathVision上獲得36.8分,在MathVista上獲得71.3分,同時保持緊湊的2.8B激活LLM參數,為高效多模態思維模型設定了新標準。代碼和模型可在https://github.com/MoonshotAI/Kimi-VL公開訪問。
English
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Summary

AI-Generated Summary

PDF1132April 11, 2025