ChatPaper.aiChatPaper

VEM:基于价值环境模型的无环境探索式GUI代理训练

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

February 26, 2025
作者: Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
cs.AI

摘要

训练面向图形用户界面(GUI)代理的视觉-语言模型(VLMs)通过强化学习(RL)面临关键挑战:基于环境的RL需要高成本的交互,而无环境方法则难以应对分布偏移和奖励泛化问题。我们提出了一种无环境RL框架,该框架利用预训练的价值环境模型(VEM)将价值估计与策略优化解耦。VEM直接从离线数据中预测状态-动作价值,提炼出类似人类对GUI交互结果的先验知识,而无需预测下一状态或环境反馈。这避免了误差累积,并通过专注于语义推理(例如,此操作是否推进了用户目标?)增强了对UI变化的适应能力。该框架分两个阶段运行:(1)预训练VEM以估计长期动作效用;(2)利用冻结的VEM信号引导策略探索,实现布局无关的GUI自动化。在Android-in-the-Wild基准测试中,VEM在离线和在线设置下均达到了最先进的性能,显著优于无环境基线,并与基于环境的方法相媲美,且无需交互成本。重要的是,VEM证明了语义感知的价值估计能够达到与在线训练方法相当的性能。
English
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

Summary

AI-Generated Summary

PDF112February 27, 2025