ChatPaper.aiChatPaper

Eagle 2.5:提升前沿视觉-语言模型的长上下文后训练能力

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

April 21, 2025
作者: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu
cs.AI

摘要

我們推出Eagle 2.5,這是一系列專為長上下文多模態學習設計的前沿視覺-語言模型(VLMs)。我們的工作針對長視頻理解和高分辨率圖像識別中的挑戰,提出了一個適用於這兩項任務的通用框架。該訓練框架融合了自動降質採樣和圖像區域保留兩項技術,以保持上下文完整性和視覺細節。此外,框架還包含多項針對長上下文數據訓練的效率優化。最後,我們提出了Eagle-Video-110K,這是一個整合了故事級和片段級註解的新數據集,旨在促進長視頻理解。Eagle 2.5在長上下文多模態基準測試中展現了顯著提升,為現有VLMs的局限性提供了強有力的解決方案。值得注意的是,我們的最佳模型Eagle 2.5-8B在512輸入幀的Video-MME測試中達到了72.4%的成績,與頂級商業模型如GPT-4o以及大規模開源模型如Qwen2.5-VL-72B和InternVL2.5-78B的表現相當。
English
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

Summary

AI-Generated Summary

PDF655April 22, 2025