FAST：用於視覺-語言-動作模型的高效動作標記化

摘要

自回歸序列模型，如基於Transformer的視覺語言行動（VLA）策略，對於捕捉複雜且可泛化的機器人行為非常有效。然而，這些模型要求我們選擇對連續動作信號進行標記化，這決定了模型預測的離散符號如何映射到連續的機器人動作。我們發現，基於簡單的每維度、每時間步長的分箱方案的當前機器人動作標記化方法，在從高頻率機器人數據學習靈巧技能時通常表現不佳。為應對這一挑戰，我們提出了一種基於離散余弦變換的新壓縮式機器人動作標記化方案。我們的標記化方法，即頻率空間行動序列標記化（FAST），使我們能夠訓練自回歸VLA，用於高靈巧和高頻率任務，標準離散化方法完全失敗的情況。基於FAST，我們推出了FAST+，一種通用的機器人動作標記化器，訓練於100萬個真實機器人動作軌跡。它可作為黑盒標記化器，用於各種機器人動作序列，包括不同的動作空間和控制頻率。最後，我們展示，當與pi0 VLA結合時，我們的方法能夠擴展至訓練10000小時的機器人數據，並與擴散VLA的性能相匹配，同時將訓練時間降低多達5倍。

English

Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.

FAST：用於視覺-語言-動作模型的高效動作標記化

FAST: Efficient Action Tokenization for Vision-Language-Action Models

摘要

Support