ChatPaper.aiChatPaper

SimpleAR:通过预训练、监督微调与强化学习推进自回归视觉生成的前沿

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

April 15, 2025
作者: Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

本研究提出了SimpleAR,一个无需复杂架构修改的朴素自回归视觉生成框架。通过对训练和推理优化的深入探索,我们展示了以下成果:1)仅用5亿参数,我们的模型就能生成1024x1024分辨率的高保真图像,并在具有挑战性的文本到图像基准测试中取得竞争力成绩,例如在GenEval上达到0.59分,在DPG上获得79.66分;2)无论是通过监督微调(SFT)还是群体相对策略优化(GRPO)训练,都能显著提升生成美学和提示对齐效果;3)当采用如vLLM等推理加速技术优化后,SimpleAR生成一张1024x1024图像的时间可缩短至约14秒。通过分享这些发现并开源代码,我们期望揭示自回归视觉生成的潜力,并鼓励更多研究者参与这一领域。代码已发布于https://github.com/wdrink/SimpleAR。
English
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.

Summary

AI-Generated Summary

PDF121April 16, 2025