FAST:用于视觉-语言-动作模型的高效动作标记化
FAST: Efficient Action Tokenization for Vision-Language-Action Models
January 16, 2025
作者: Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine
cs.AI
摘要
自回归序列模型,如基于Transformer的视觉-语言-动作(VLA)策略,可以极大地有效地捕捉复杂且可泛化的机器人行为。然而,这些模型要求我们选择连续动作信号的标记化方式,这决定了模型预测的离散符号如何映射到连续的机器人动作。我们发现,基于简单的每维、每时间步长的分箱方案的当前机器人动作标记化方法,在从高频率机器人数据中学习熟练技能时通常表现不佳。为了解决这一挑战,我们提出了一种基于离散余弦变换的新型基于压缩的机器人动作标记化方案。我们的标记化方法,即频率空间动作序列标记化(FAST),使我们能够为高度熟练且高频率任务训练自回归VLA,而标准的离散化方法完全无法胜任。基于FAST,我们发布了FAST+,一个通用的机器人动作标记器,经过100万个真实机器人动作轨迹的训练。它可以作为黑盒标记器用于各种机器人动作序列,涵盖多样的动作空间和控制频率。最后,我们展示了当与pi0 VLA结合时,我们的方法可以扩展到对1万小时机器人数据进行训练,并与扩散VLA的性能相匹配,同时将训练时间缩短多达5倍。
English
Autoregressive sequence models, such as Transformer-based vision-language
action (VLA) policies, can be tremendously effective for capturing complex and
generalizable robotic behaviors. However, such models require us to choose a
tokenization of our continuous action signals, which determines how the
discrete symbols predicted by the model map to continuous robot actions. We
find that current approaches for robot action tokenization, based on simple
per-dimension, per-timestep binning schemes, typically perform poorly when
learning dexterous skills from high-frequency robot data. To address this
challenge, we propose a new compression-based tokenization scheme for robot
actions, based on the discrete cosine transform. Our tokenization approach,
Frequency-space Action Sequence Tokenization (FAST), enables us to train
autoregressive VLAs for highly dexterous and high-frequency tasks where
standard discretization methods fail completely. Based on FAST, we release
FAST+, a universal robot action tokenizer, trained on 1M real robot action
trajectories. It can be used as a black-box tokenizer for a wide range of robot
action sequences, with diverse action spaces and control frequencies. Finally,
we show that, when combined with the pi0 VLA, our method can scale to training
on 10k hours of robot data and match the performance of diffusion VLAs, while
reducing training time by up to 5x.Summary
AI-Generated Summary