ChatPaper.aiChatPaper

SimpleAR:透過預訓練、監督微調與強化學習推進自回歸視覺生成的前沿

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

April 15, 2025
作者: Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

本研究提出了SimpleAR,一個未經複雜架構修改的基礎自迴歸視覺生成框架。通過對訓練與推理優化的深入探索,我們展示了以下成果:1)僅需0.5B參數,我們的模型便能生成高保真度的1024x1024分辨率圖像,並在具挑戰性的文本到圖像基準測試中取得競爭力成績,例如在GenEval上達到0.59分,在DPG上獲得79.66分;2)無論是監督微調(SFT)還是群組相對策略優化(GRPO)訓練,均能顯著提升生成美學與提示對齊效果;3)當採用如vLLM等推理加速技術優化後,SimpleAR生成一張1024x1024圖像的時間可縮短至約14秒。通過分享這些發現並開源代碼,我們期望揭示自迴歸視覺生成的潛力,並鼓勵更多研究者參與此領域的探索。代碼已公開於https://github.com/wdrink/SimpleAR。
English
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.

Summary

AI-Generated Summary

PDF101April 16, 2025