Qwen2.5-VL 기술 보고서

초록

Qwen2.5-VL은 Qwen 시각-언어 시리즈의 최신 플래그십 모델로, 기초 능력과 혁신적인 기능 모두에서 상당한 발전을 보여줍니다. Qwen2.5-VL은 향상된 시각 인식, 정확한 객체 위치 파악, 강력한 문서 구문 분석, 그리고 긴 영상 이해를 통해 세상을 이해하고 상호작용하는 데 있어 큰 도약을 이루었습니다. Qwen2.5-VL의 두드러진 특징 중 하나는 바운딩 박스나 점을 사용하여 객체를 정확하게 위치 파악할 수 있는 능력입니다. 이 모델은 송장, 양식, 테이블로부터 강력한 구조화된 데이터 추출을 제공하며, 차트, 다이어그램, 레이아웃에 대한 상세한 분석도 가능합니다. 복잡한 입력을 처리하기 위해 Qwen2.5-VL은 동적 해상도 처리와 절대 시간 인코딩을 도입하여 다양한 크기의 이미지와 긴 지속 시간(최대 몇 시간)의 영상을 초 단위 이벤트 위치 파악과 함께 처리할 수 있습니다. 이를 통해 전통적인 정규화 기법에 의존하지 않고도 공간적 규모와 시간적 동역학을 자연스럽게 인지할 수 있습니다. 동적 해상도 Vision Transformer(ViT)를 처음부터 학습하고 Window Attention을 통합함으로써, 우리는 원래 해상도를 유지하면서 계산 오버헤드를 줄였습니다. 그 결과, Qwen2.5-VL은 정적 이미지와 문서 이해뿐만 아니라 컴퓨터와 모바일 기기 작동과 같은 실제 시나리오에서 추론, 도구 사용, 작업 실행이 가능한 상호작용형 시각 에이전트로서도 뛰어난 성능을 발휘합니다. Qwen2.5-VL은 엣지 AI에서 고성능 컴퓨팅에 이르기까지 다양한 사용 사례를 해결하기 위해 세 가지 크기로 제공됩니다. 플래그십 모델인 Qwen2.5-VL-72B는 GPT-4o 및 Claude 3.5 Sonnet과 같은 최첨단 모델과 견줄 만하며, 특히 문서와 다이어그램 이해에서 탁월한 성능을 보입니다. 또한, Qwen2.5-VL은 Qwen2.5 LLM의 핵심 언어 능력을 유지하며 강력한 언어 성능을 유지합니다.

English

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

Qwen2.5-VL 기술 보고서

Qwen2.5-VL Technical Report

초록

Summary

Support