TokenVerse:令牌调制空间中多概念个性化的多功能性
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
January 21, 2025
作者: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel
cs.AI
摘要
我们提出了TokenVerse——一种多概念个性化方法,利用预训练的文本到图像扩散模型。我们的框架可以从仅一幅图像中解开复杂的视觉元素和属性,同时实现从多幅图像中提取的概念组合的无缝插拔生成。与现有作品相反,TokenVerse可以处理具有多个概念的多幅图像,并支持包括物体、配饰、材料、姿势和光照在内的广泛概念。我们的工作利用了基于DiT的文本到图像模型,其中输入文本通过注意力和调制(移位和缩放)影响生成。我们观察到调制空间是语义的,并且能够对复杂概念进行局部控制。基于这一观察,我们设计了一个基于优化的框架,该框架接受一幅图像和一个文本描述作为输入,并为每个单词找到调制空间中的一个独特方向。然后可以使用这些方向生成新图像,以期望的配置结合学习到的概念。我们展示了TokenVerse在具有挑战性的个性化设置中的有效性,并展示了其相对于现有方法的优势。项目网页位于https://token-verse.github.io/
English
We present TokenVerse -- a method for multi-concept personalization,
leveraging a pre-trained text-to-image diffusion model. Our framework can
disentangle complex visual elements and attributes from as little as a single
image, while enabling seamless plug-and-play generation of combinations of
concepts extracted from multiple images. As opposed to existing works,
TokenVerse can handle multiple images with multiple concepts each, and supports
a wide-range of concepts, including objects, accessories, materials, pose, and
lighting. Our work exploits a DiT-based text-to-image model, in which the input
text affects the generation through both attention and modulation (shift and
scale). We observe that the modulation space is semantic and enables localized
control over complex concepts. Building on this insight, we devise an
optimization-based framework that takes as input an image and a text
description, and finds for each word a distinct direction in the modulation
space. These directions can then be used to generate new images that combine
the learned concepts in a desired configuration. We demonstrate the
effectiveness of TokenVerse in challenging personalization settings, and
showcase its advantages over existing methods. project's webpage in
https://token-verse.github.io/Summary
AI-Generated Summary