改进Transformer世界模型以提高数据效率的强化学习
Improving Transformer World Models for Data-Efficient RL
February 3, 2025
作者: Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
cs.AI
摘要
我们提出了一种基于模型的强化学习方法,该方法在具有挑战性的Craftax-classic基准测试中取得了新的最先进表现。Craftax-classic是一个开放世界的2D生存游戏,需要代理展示一系列广泛的通用能力,如强大的泛化能力、深度探索和长期推理能力。通过一系列旨在提高样本效率的谨慎设计选择,我们的基于模型的强化学习算法在仅100万个环境步骤后实现了67.4%的奖励,明显优于DreamerV3的53.2%,并且首次超过了人类的65.0%的表现。我们的方法首先构建了一个SOTA无模型基线,使用了结合了CNN和RNN的新颖策略架构。然后,我们对标准的基于模型的强化学习设置进行了三项改进:(a)“Dyna with warmup”,该方法在真实数据和虚拟数据上训练策略,(b)在图像块上使用“最近邻标记器”,改进了创建变压器世界模型(TWM)输入的方案,以及(c)“块教师强迫”,使TWM能够联合推理下一个时间步的未来标记。
English
We present an approach to model-based RL that achieves a new state of the art
performance on the challenging Craftax-classic benchmark, an open-world 2D
survival game that requires agents to exhibit a wide range of general abilities
-- such as strong generalization, deep exploration, and long-term reasoning.
With a series of careful design choices aimed at improving sample efficiency,
our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
time, exceeds human performance of 65.0%. Our method starts by constructing a
SOTA model-free baseline, using a novel policy architecture that combines CNNs
and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
with warmup", which trains the policy on real and imaginary data, (b) "nearest
neighbor tokenizer" on image patches, which improves the scheme to create the
transformer world model (TWM) inputs, and (c) "block teacher forcing", which
allows the TWM to reason jointly about the future tokens of the next timestep.Summary
AI-Generated Summary