FAST：用于视觉-语言-动作模型的高效动作标记化

摘要

自回归序列模型，如基于Transformer的视觉-语言-动作（VLA）策略，可以极大地有效地捕捉复杂且可泛化的机器人行为。然而，这些模型要求我们选择连续动作信号的标记化方式，这决定了模型预测的离散符号如何映射到连续的机器人动作。我们发现，基于简单的每维、每时间步长的分箱方案的当前机器人动作标记化方法，在从高频率机器人数据中学习熟练技能时通常表现不佳。为了解决这一挑战，我们提出了一种基于离散余弦变换的新型基于压缩的机器人动作标记化方案。我们的标记化方法，即频率空间动作序列标记化（FAST），使我们能够为高度熟练且高频率任务训练自回归VLA，而标准的离散化方法完全无法胜任。基于FAST，我们发布了FAST+，一个通用的机器人动作标记器，经过100万个真实机器人动作轨迹的训练。它可以作为黑盒标记器用于各种机器人动作序列，涵盖多样的动作空间和控制频率。最后，我们展示了当与pi0 VLA结合时，我们的方法可以扩展到对1万小时机器人数据进行训练，并与扩散VLA的性能相匹配，同时将训练时间缩短多达5倍。

English

Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.

FAST：用于视觉-语言-动作模型的高效动作标记化

FAST: Efficient Action Tokenization for Vision-Language-Action Models

摘要

Support