填充調性：對於 T2I 模型中填充標記的機制分析

摘要

文本到圖像（T2I）擴散模型依賴編碼提示來引導圖像生成過程。通常，這些提示會通過在文本編碼之前添加填充標記來擴展到固定長度。儘管這是一種默認做法，但填充標記對圖像生成過程的影響尚未受到研究。在這項工作中，我們進行了對T2I模型中填充標記作用的首次深入分析。我們開發了兩種因果技術，來分析信息如何在T2I流程的不同組件中的標記表示中被編碼。利用這些技術，我們研究了填充標記何時以及如何影響圖像生成過程。我們的研究結果揭示了三種不同的情況：填充標記可能在文本編碼期間影響模型的輸出，在擴散過程中產生影響，或者被有效地忽略。此外，我們確定了這些情況與模型架構（跨或自注意）以及訓練過程（凍結或訓練文本編碼器）之間的關鍵關係。這些見解有助於更深入地理解填充標記的機制，可能為T2I系統中未來模型設計和訓練實踐提供信息。

English

Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model's architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.

填充調性：對於 T2I 模型中填充標記的機制分析

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

摘要

Summary

Support