ChatPaper.aiChatPaper

我们能用CoT生成图像吗?让我们逐步验证和强化图像生成。

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

January 23, 2025
作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng
cs.AI

摘要

链式推理(CoT)已被广泛应用于大型模型中,用于解决复杂的理解任务。然而,目前仍然存在一个问题,即这种策略是否可以应用于验证和加强图像生成场景。本文首次全面调查了CoT推理对增强自回归图像生成潜力的可能性。我们专注于三种技术:扩展测试时计算以进行验证、将模型偏好与直接偏好优化(DPO)对齐,以及将这些技术整合以产生互补效果。我们的结果表明,这些方法可以被有效地调整和结合,从而显著改善图像生成性能。此外,鉴于奖励模型在我们研究中的关键作用,我们提出了适用于自回归图像生成的潜在评估奖励模型(PARM)和PARM++。PARM通过潜在评估方法自适应评估每个生成步骤,融合了现有奖励模型的优势,而PARM++进一步引入了反射机制,以自我纠正生成的不理想图像。利用我们调查的推理策略,我们改进了基准模型Show-o,取得了优越的结果,在GenEval基准测试中实现了显著的+24%改进,超过了Stable Diffusion 3的+15%。我们希望我们的研究提供了独特的见解,并为将CoT推理与自回归图像生成相结合开辟了新的道路。代码和模型已发布在https://github.com/ZiyuGuo99/Image-Generation-CoT。
English
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

Summary

AI-Generated Summary

PDF422January 24, 2025