ChatPaper.aiChatPaper

Qwen2.5-VL技术报告

Qwen2.5-VL Technical Report

February 19, 2025
作者: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin
cs.AI

摘要

我们推出Qwen2.5-VL,作为Qwen视觉语言系列的最新旗舰模型,它在基础能力和创新功能上均实现了显著进步。Qwen2.5-VL通过增强的视觉识别、精确的目标定位、强大的文档解析以及长视频理解能力,在理解和交互世界方面迈出了重要一步。Qwen2.5-VL的一个突出特点是能够准确使用边界框或点来定位对象。它能够从发票、表格和表单中提取稳健的结构化数据,并对图表、图示和布局进行详细分析。为处理复杂输入,Qwen2.5-VL引入了动态分辨率处理和绝对时间编码,使其能够处理不同尺寸的图像和长达数小时的视频,并实现秒级事件定位。这使得模型能够原生感知空间尺度和时间动态,而无需依赖传统的归一化技术。通过从头训练原生动态分辨率的视觉Transformer(ViT)并结合窗口注意力机制,我们在保持原生分辨率的同时降低了计算开销。因此,Qwen2.5-VL不仅在静态图像和文档理解方面表现出色,还作为交互式视觉代理,能够在操作计算机和移动设备等现实场景中进行推理、工具使用和任务执行。Qwen2.5-VL提供三种规模,满足从边缘AI到高性能计算的多样化用例。旗舰型号Qwen2.5-VL-72B在文档和图表理解方面与GPT-4o和Claude 3.5 Sonnet等顶尖模型相媲美。此外,Qwen2.5-VL保持了强大的语言性能,延续了Qwen2.5大语言模型的核心语言能力。
English
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

Summary

AI-Generated Summary

PDF1647February 20, 2025