AnyStory：テキストから画像生成における単一および複数主題の個人化の統合に向けて

要旨

最近、大規模生成モデルは優れたテキストから画像生成能力を示しています。ただし、特定の被写体を持つ高品質な個人画像を生成することは、特に複数の被写体が関わる場合にはまだ課題が残っています。本論文では、個人化された被写体生成のための統一アプローチであるAnyStoryを提案します。AnyStoryは、単一の被写体に対する高品質な個人化だけでなく、複数の被写体に対しても、被写体の忠実度を犠牲にすることなく達成します。具体的には、AnyStoryは被写体の個人化問題を「エンコードしてからルーティングする」方法でモデル化します。エンコードステップでは、AnyStoryは普遍的で強力な画像エンコーダであるReferenceNetを使用し、CLIPビジョンエンコーダと組み合わせて被写体特徴の高品質なエンコードを実現します。ルーティングステップでは、AnyStoryは分離されたインスタンス認識型被写体ルータを使用して、潜在空間内で対応する被写体の潜在的な位置を正確に認識し予測し、被写体条件の導入をガイドします。詳細な実験結果は、当社の手法が被写体の詳細を保持し、テキストの説明と整合し、複数の被写体に対して個人化することで優れたパフォーマンスを発揮することを示しています。プロジェクトページは https://aigcdesigngroup.github.io/AnyStory/ にあります。

English

Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at https://aigcdesigngroup.github.io/AnyStory/ .

AnyStory：テキストから画像生成における単一および複数主題の個人化の統合に向けて

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

要旨

Support