VARGPT-v1.1:通过迭代指令调优与强化学习提升视觉自回归统一模型
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
April 3, 2025
作者: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
cs.AI
摘要
在本研究中,我们推出了VARGPT-v1.1,这是一款基于先前框架VARGPT构建的先进统一视觉自回归模型。该模型保留了用于视觉理解的下一个标记预测与用于图像合成的下一个尺度生成的双重范式。具体而言,VARGPT-v1.1整合了以下创新:(1) 一种结合迭代视觉指令调优与通过直接偏好优化(DPO)进行强化学习的新颖训练策略;(2) 包含830万视觉生成指令对的扩展训练语料库;(3) 采用Qwen2升级的语言模型骨干;(4) 增强的图像生成分辨率;以及(5) 无需架构修改即具备的图像编辑能力。这些进步使VARGPT-v1.1在多模态理解及文本到图像指令跟随任务中实现了业界领先的性能,在理解和生成指标上均展现出显著提升。尤为值得一提的是,通过视觉指令调优,模型在保持与前代架构一致性的同时,获得了图像编辑功能,揭示了统一视觉理解、生成与编辑的潜力。我们的研究结果表明,设计精良的统一视觉自回归模型能够有效借鉴大型语言模型(LLMs)的灵活训练策略,展现出良好的可扩展性。代码库及模型权重已公开发布于https://github.com/VARGPT-family/VARGPT-v1.1。
English
In this work, we present VARGPT-v1.1, an advanced unified visual
autoregressive model that builds upon our previous framework VARGPT. The model
preserves the dual paradigm of next-token prediction for visual understanding
and next-scale generation for image synthesis. Specifically, VARGPT-v1.1
integrates: (1) a novel training strategy combining iterative visual
instruction tuning with reinforcement learning through Direct Preference
Optimization (DPO), (2) an expanded training corpus containing 8.3M
visual-generative instruction pairs, (3) an upgraded language model backbone
using Qwen2, (4) enhanced image generation resolution, and (5) emergent image
editing capabilities without architectural modifications. These advancements
enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal
understanding and text-to-image instruction-following tasks, demonstrating
significant improvements in both comprehension and generation metrics. Notably,
through visual instruction tuning, the model acquires image editing
functionality while maintaining architectural consistency with its predecessor,
revealing the potential for unified visual understanding, generation, and
editing. Our findings suggest that well-designed unified visual autoregressive
models can effectively adopt flexible training strategies from large language
models (LLMs), exhibiting promising scalability. The codebase and model weights
are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.Summary
AI-Generated Summary