UI-TARS: 원시 에이전트와의 자동화된 GUI 상호작용을 개척하는 중

초록

본 논문은 UI-TARS를 소개하는데, 이는 스크린샷만을 입력으로 인식하고 키보드 및 마우스 조작과 같은 인간과 유사한 상호작용을 수행하는 네이티브 GUI 에이전트 모델입니다. UI-TARS는 전통적인 에이전트 프레임워크와는 달리 전문가가 작성한 프롬프트와 워크플로에 의존하는 상업용 모델(GPT-4o와 같은)을 사용하지 않고, 이러한 정교한 프레임워크들을 능가하는 엔드 투 엔드 모델입니다. 실험 결과는 UI-TARS의 우수한 성능을 입증합니다. UI-TARS는 지각, 그라운딩 및 GUI 작업 실행을 평가하는 10개 이상의 GUI 에이전트 벤치마크에서 SOTA 성능을 달성합니다. 특히 OSWorld 벤치마크에서 UI-TARS는 50단계에서 24.6, 15단계에서 22.7의 점수를 기록하여 Claude(각각 22.0 및 14.9)를 능가합니다. AndroidWorld에서는 UI-TARS가 46.6으로 GPT-4o(34.5)를 능가합니다. UI-TARS는 다음과 같은 여러 가지 주요 혁신을 통합하고 있습니다: (1) 향상된 지각: 대규모 GUI 스크린샷 데이터셋을 활용하여 UI 요소의 문맥을 이해하고 정확한 캡션을 생성합니다; (2) 통합된 액션 모델링: 플랫폼 간에 액션을 표준화하고 대규모 액션 추적을 통해 정확한 그라운딩과 상호작용을 달성합니다; (3) 시스템-2 추론: 다단계 의사 결정에 신중한 추론을 통합하여 작업 분해, 반성 사고, 중요한 단계 인식 등 다양한 추론 패턴을 포함합니다; (4) 반사적 온라인 추적을 통한 반복적 훈련: 수백 대의 가상 머신에서 새로운 상호작용 추적을 자동으로 수집, 필터링 및 반성적으로 정제하여 데이터 병목 현상에 대응합니다. 반복적인 훈련과 반성 튜닝을 통해 UI-TARS는 지속적으로 실수로부터 학습하고 최소한의 인간 개입으로 예기치 못한 상황에 적응합니다. 또한 GUI 에이전트의 진화 경로를 분석하여 이 도메인의 추가 발전을 안내합니다.

English

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

UI-TARS: 원시 에이전트와의 자동화된 GUI 상호작용을 개척하는 중

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

초록

Summary

Support