DeeR-VLA：用于高效机器人执行的多模态大型语言模型动态推理

摘要

MLLMs展示了出色的理解和推理能力，能够处理复杂的语言和视觉数据。这些进展推动了建立一种通用的机器人MLLM的愿景，使其擅长理解复杂的人类指令并完成各种具体任务。然而，为现实世界的机器人开发MLLM是具有挑战性的，因为机器人平台通常具有有限的计算和内存容量。相比之下，MLLM的推断涉及存储数十亿个参数并进行巨大计算，对硬件的要求很高。在我们的论文中，我们提出了一个用于机器人视觉-语言-动作模型的动态早期退出框架（DeeR-VLA，或简称DeeR），它根据每种情况自动调整激活的MLLM的大小。该方法利用MLLM中的多出口架构，允许模型在为特定情况激活了适当大小的模型后终止处理，从而避免进一步冗余计算。此外，我们开发了建立DeeR的早期终止标准的新算法，这些标准取决于预定义的需求，如平均计算成本（即功耗）、峰值计算消耗（即延迟）和GPU内存使用量。这些增强措施确保DeeR在不同资源约束下高效运行，同时保持竞争性能。在CALVIN机器人操纵基准测试中，DeeR将LLM的计算成本降低了5.2-6.5倍，将LLM的GPU内存降低了2-6倍，而不影响性能。代码和检查点可在https://github.com/yueyang130/DeeR-VLA找到。

English

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

DeeR-VLA：用于高效机器人执行的多模态大型语言模型动态推理

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

摘要

Summary

Support

Support