ORIGEN：文本到图像生成中的零样本三维方向定位

摘要

我们推出了ORIGEN，这是首个在跨多对象及多样类别的文本到图像生成中实现三维方向定位的零样本方法。尽管先前关于图像生成中空间定位的研究主要集中在二维定位上，却缺乏对三维方向的控制。为解决这一问题，我们提出了一种奖励引导的采样策略，该策略结合了预训练的判别模型用于三维方向估计，以及一步式文本到图像生成流模型。虽然基于梯度上升的优化是奖励引导的自然选择，但它难以保持图像的真实感。因此，我们采用了基于朗之万动力学的采样方法，该方法通过在梯度上升中简单地注入随机噪声来扩展优化过程——仅需添加一行代码。此外，我们还引入了基于奖励函数的自适应时间重缩放，以加速收敛。实验结果表明，ORIGEN在定量指标和用户研究上均优于基于训练和测试时引导的方法。

English

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

ORIGEN：文本到图像生成中的零样本三维方向定位

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

摘要

Summary

Support

Support