WAFFLE：用於自動化前端開發的多模態模型

摘要

網頁開發涉及將 UI 設計轉換為功能性網頁，對於初學者和有經驗的開發者來說都可能會面臨困難，因為 HTML 的階層結構和樣式的複雜性。雖然大型語言模型（LLMs）在生成原始碼方面表現出潛力，但在 UI 到 HTML 代碼生成方面仍存在兩個主要挑戰：（1）有效地為 LLMs 表示 HTML 的階層結構，以及（2）彌合 UI 設計的視覺特性與 HTML 代碼的文本格式之間的差距。為了應對這些挑戰，我們引入了一種新的微調策略 Waffle，該策略利用結構感知注意機制來改善 LLMs 對 HTML 結構的理解，並使用對比微調方法來對齊 LLMs 對 UI 圖像和 HTML 代碼的理解。通過 Waffle 進行微調的模型在我們的新基準測試 WebSight-Test 和現有基準設計2Code 上展示出高達 9.00 個百分點的 HTML 匹配度提高，CW-SSIM 提高 0.0982，CLIP 提高 32.99，以及 LLEM 提高 27.12 個百分點，優於當前的微調方法。

English

Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.

WAFFLE：用於自動化前端開發的多模態模型

WAFFLE: Multi-Modal Model for Automated Front-End Development

摘要

Summary

Support

Support