Qwen2-VL：增強視覺語言模型對世界的感知在任何解析度

摘要

我們介紹了 Qwen2-VL 系列，這是先前 Qwen-VL 模型的先進升級，重新定義了在視覺處理中傳統的預定義解析度方法。Qwen2-VL 引入了「天真動態解析度」機制，使模型能夠動態處理不同解析度的圖像，轉換成不同數量的視覺標記。這種方法使模型能夠生成更有效和準確的視覺表示，與人類感知過程密切相符。該模型還整合了多模態旋轉位置嵌入（M-RoPE），有助於有效融合文本、圖像和視頻之間的位置信息。我們採用了統一的範式來處理圖像和視頻，增強了模型的視覺感知能力。為了探索大型多模態模型的潛力，Qwen2-VL 研究了大視覺語言模型（LVLMs）的擴展定律。通過擴展模型大小（包括 2B、8B 和 72B 參數版本）和訓練數據量，Qwen2-VL 系列取得了極具競爭力的表現。值得注意的是，Qwen2-VL-72B 模型在各種多模態基準測試中取得了與領先模型（如 GPT-4o 和 Claude3.5-Sonnet）可比的結果，勝過其他通用模型。代碼可在 https://github.com/QwenLM/Qwen2-VL 找到。

English

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.

Qwen2-VL：增強視覺語言模型對世界的感知在任何解析度

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

摘要

Summary

Support

Support