RepText：通過複製實現視覺文本渲染

摘要

儘管當代的文生圖模型在生成視覺效果吸引人的圖像方面取得了顯著突破，但其在生成精確且靈活的排版元素，尤其是非拉丁字母方面，仍存在限制。為解決這些限制，我們從一個初步假設出發，即文本理解僅是文本渲染的充分條件，而非必要條件。基於此，我們提出了RepText，旨在賦能預訓練的單語文生圖模型，使其能夠準確渲染，更確切地說，複製多語言視覺文本，而無需真正理解這些文本。具體而言，我們採用了ControlNet的設置，並額外整合了語言無關的字形和渲染文本的位置，以實現協調的視覺文本生成，允許用戶根據需求自定義文本內容、字體和位置。為提高準確性，我們採用了文本感知損失與擴散損失相結合的方法。此外，為穩定渲染過程，在推理階段，我們直接使用帶噪聲的字形潛變量進行初始化，而非隨機初始化，並採用區域遮罩將特徵注入限制在文本區域，以避免背景失真。我們進行了廣泛的實驗，以驗證RepText相較於現有工作的有效性，結果顯示我們的方法超越了現有的開源方法，並達到了與原生多語言閉源模型相當的效果。為更加公正，我們在最後也詳盡討論了其局限性。

English

Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

RepText：通過複製實現視覺文本渲染

RepText: Rendering Visual Text via Replicating

摘要

Summary

Support

Support