填充音调:填充令牌在T2I模型中的机制分析
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
January 12, 2025
作者: Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, Yonatan Belinkov
cs.AI
摘要
文本到图像(T2I)扩散模型依赖编码提示来指导图像生成过程。通常,这些提示通过在文本编码之前添加填充标记来扩展到固定长度。尽管这是一种默认做法,但填充标记对图像生成过程的影响尚未得到研究。在这项工作中,我们进行了对填充标记在T2I模型中扮演角色的首次深入分析。我们开发了两种因果技术来分析信息是如何在T2I流程的不同组件中的标记表示中编码的。利用这些技术,我们调查了填充标记何时以及如何影响图像生成过程。我们的研究结果揭示了三种不同的情况:填充标记可能在文本编码期间影响模型的输出,在扩散过程中产生影响,或者被有效地忽略。此外,我们确定了这些情况与模型架构(跨注意力或自注意力)以及其训练过程(冻结或训练文本编码器)之间的关键关系。这些见解有助于更深入地理解填充标记的机制,可能为T2I系统中未来模型设计和训练实践提供信息。
English
Text-to-image (T2I) diffusion models rely on encoded prompts to guide the
image generation process. Typically, these prompts are extended to a fixed
length by adding padding tokens before text encoding. Despite being a default
practice, the influence of padding tokens on the image generation process has
not been investigated. In this work, we conduct the first in-depth analysis of
the role padding tokens play in T2I models. We develop two causal techniques to
analyze how information is encoded in the representation of tokens across
different components of the T2I pipeline. Using these techniques, we
investigate when and how padding tokens impact the image generation process.
Our findings reveal three distinct scenarios: padding tokens may affect the
model's output during text encoding, during the diffusion process, or be
effectively ignored. Moreover, we identify key relationships between these
scenarios and the model's architecture (cross or self-attention) and its
training process (frozen or trained text encoder). These insights contribute to
a deeper understanding of the mechanisms of padding tokens, potentially
informing future model design and training practices in T2I systems.Summary
AI-Generated Summary