DeeR-VLA：用於高效機器人執行的多模態大型語言模型動態推論

摘要

MLLMs展示了對複雜語言和視覺數據的出色理解和推理能力。這些進展激發了建立通才機器人MLLM的願景，該機器人擅長理解複雜的人類指令並完成各種具體任務。然而，為現實世界的機器人開發MLLM是具有挑戰性的，因為機器人平台通常具有有限的計算和記憶容量。相比之下，MLLM的推理涉及存儲數十億參數並執行巨大計算，對硬件提出了重大要求。在我們的論文中，我們提出了一個用於機器人視覺-語言-動作模型的動態早期退出框架（DeeR-VLA，或簡稱DeeR），它根據當前情況自動調整啟動的MLLM大小。該方法利用MLLM中的多退出架構，該架構允許模型在為特定情況啟動適當大小的模型後終止處理，從而避免進一步冗餘計算。此外，我們開發了新算法，為DeeR建立了早期終止標準，條件是預定要求，如平均計算成本（即功耗），以及峰值計算消耗（即延遲）和GPU內存使用量。這些增強確保DeeR在不同資源限制下高效運行，同時保持競爭性能。在CALVIN機器人操作基準測試中，DeeR將LLM的計算成本降低了5.2-6.5倍，將LLM的GPU內存降低了2-6倍，而不影響性能。代碼和檢查點可在https://github.com/yueyang130/DeeR-VLA找到。

English

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

DeeR-VLA：用於高效機器人執行的多模態大型語言模型動態推論

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

摘要

Summary

Support

Support