VARGPT-v1.1:通過迭代指令微調與強化學習提升視覺自回歸大型統一模型
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
April 3, 2025
作者: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
cs.AI
摘要
在本研究中,我們介紹了VARGPT-v1.1,這是一個基於先前框架VARGPT的先進統一視覺自回歸模型。該模型保留了用於視覺理解的下一標記預測和用於圖像合成的下一尺度生成的雙重範式。具體而言,VARGPT-v1.1整合了以下幾點:(1) 一種新穎的訓練策略,結合了迭代視覺指令調優與通過直接偏好優化(DPO)的強化學習,(2) 一個包含830萬視覺生成指令對的擴展訓練語料庫,(3) 使用Qwen2升級的語言模型骨幹,(4) 增強了圖像生成分辨率,以及(5) 無需架構修改即可實現的圖像編輯能力。這些進步使VARGPT-v1.1在多模態理解和文本到圖像指令跟隨任務中達到了最先進的性能,在理解和生成指標上均顯示出顯著提升。值得注意的是,通過視覺指令調優,該模型在保持與前代架構一致性的同時獲得了圖像編輯功能,揭示了統一視覺理解、生成和編輯的潛力。我們的研究表明,設計良好的統一視覺自回歸模型能夠有效採用大型語言模型(LLMs)的靈活訓練策略,展現出良好的可擴展性。代碼庫和模型權重已公開於https://github.com/VARGPT-family/VARGPT-v1.1。
English
In this work, we present VARGPT-v1.1, an advanced unified visual
autoregressive model that builds upon our previous framework VARGPT. The model
preserves the dual paradigm of next-token prediction for visual understanding
and next-scale generation for image synthesis. Specifically, VARGPT-v1.1
integrates: (1) a novel training strategy combining iterative visual
instruction tuning with reinforcement learning through Direct Preference
Optimization (DPO), (2) an expanded training corpus containing 8.3M
visual-generative instruction pairs, (3) an upgraded language model backbone
using Qwen2, (4) enhanced image generation resolution, and (5) emergent image
editing capabilities without architectural modifications. These advancements
enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal
understanding and text-to-image instruction-following tasks, demonstrating
significant improvements in both comprehension and generation metrics. Notably,
through visual instruction tuning, the model acquires image editing
functionality while maintaining architectural consistency with its predecessor,
revealing the potential for unified visual understanding, generation, and
editing. Our findings suggest that well-designed unified visual autoregressive
models can effectively adopt flexible training strategies from large language
models (LLMs), exhibiting promising scalability. The codebase and model weights
are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.Summary
AI-Generated Summary