DeeR-VLA: 효율적인 로봇 실행을 위한 다중 모달 대형 언어 모델의 동적 추론

초록

MLLM는 복잡한 언어 및 시각 데이터에 대한 놀라운 이해력과 추론 능력을 보여주었습니다. 이러한 발전은 복잡한 인간 지시를 이해하고 다양한 신체적 작업을 수행하는 능숙한 일반적인 로봇 MLLM을 구축하는 비전을 촉발시켰습니다. 그러나 실제 로봇용 MLLM을 개발하는 것은 일반적으로 로봇 플랫폼에서 사용 가능한 제한된 계산 및 메모리 용량 때문에 어렵습니다. 반면, MLLM의 추론은 수십억 개의 매개변수를 저장하고 방대한 계산을 수행하는 것을 포함하여 상당한 하드웨어 요구 사항을 부과합니다. 저희 논문에서는 각 상황에 따라 활성화된 MLLM의 크기를 자동으로 조정하는 로봇 비전-언어-행동 모델 (DeeR-VLA 또는 간단히 DeeR)용 동적 조기 종료 프레임워크를 제안합니다. 이 접근 방식은 MLLM의 다중 종료 아키텍처를 활용하여 모델이 특정 상황에 대해 활성화된 모델의 적절한 크기가 활성화되면 처리를 중단할 수 있도록 하여 불필요한 계산을 피합니다. 추가로, DeeR에 대한 조기 종료 기준을 설정하는 새로운 알고리즘을 개발하였는데, 이는 평균 계산 비용 (즉, 전력 소비) 및 최대 계산 소비 (즉, 지연 시간) 및 GPU 메모리 사용량과 같은 사전 정의된 요구 사항에 의존합니다. 이러한 향상된 기능은 DeeR이 경쟁력 있는 성능을 유지하면서 다양한 자원 제약 조건 하에서 효율적으로 작동하도록 보장합니다. CALVIN 로봇 조작 벤치마크에서, DeeR은 LLM의 계산 비용을 5.2-6.5배, LLM의 GPU 메모리를 2-6배 줄이면서 성능을 희생하지 않음을 보여줍니다. 코드 및 체크포인트는 https://github.com/yueyang130/DeeR-VLA에서 사용할 수 있습니다.

English

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

DeeR-VLA: 효율적인 로봇 실행을 위한 다중 모달 대형 언어 모델의 동적 추론

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

초록

Support