WAFFLE: 자동 프론트엔드 개발을 위한 멀티 모달 모델

초록

웹 개발은 UI 디자인을 기능적인 웹페이지로 변환하는 작업을 포함하며, HTML의 계층 구조와 스타일의 복잡성으로 인해 초보자와 경험 많은 개발자 모두에게 어려울 수 있습니다. 대형 언어 모델(Large Language Models, LLMs)은 소스 코드 생성에서 유망성을 보여주었지만, UI에서 HTML 코드로의 변환에서 두 가지 주요 도전 과제가 여전히 존재합니다: (1) LLMs를 위한 HTML의 계층 구조를 효과적으로 표현하는 것, 그리고 (2) UI 디자인의 시각적 성질과 HTML 코드의 텍스트 기반 형식 간의 간극을 좁히는 것입니다. 이러한 도전 과제를 해결하기 위해, 우리는 Waffle이라는 새로운 파인튜닝 전략을 소개합니다. 이 전략은 구조 인식 주의 메커니즘을 사용하여 LLMs가 HTML의 구조를 이해하는 능력을 향상시키고, 대조적인 파인튜닝 접근 방식을 사용하여 LLMs가 UI 이미지와 HTML 코드의 이해를 조정합니다. Waffle로 파인튜닝된 모델은 새로운 벤치마크인 WebSight-Test와 기존의 Design2Code 벤치마크에서 최대 9.00 pp(백분율 포인트) 높은 HTML 일치, 0.0982 높은 CW-SSIM, 32.99 높은 CLIP, 그리고 27.12 pp 높은 LLEM을 보여주며, 현재의 파인튜닝 방법을 능가합니다.

English

Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.

WAFFLE: 자동 프론트엔드 개발을 위한 멀티 모달 모델

WAFFLE: Multi-Modal Model for Automated Front-End Development

초록

Support