AlphaMaze: GRPO를 통한 대형 언어 모델의 공간 지능 강화

초록

대형 언어 모델(LLMs)은 언어 처리에서 인상적인 능력을 보여왔지만, 진정한 시각적 공간 추론이 필요한 작업에서는 종종 어려움을 겪습니다. 본 논문에서는 표준 LLMs에 미로 탐색을 위한 시각적 추론 능력을 부여하기 위해 설계된 새로운 2단계 훈련 프레임워크를 소개합니다. 먼저, 토큰화된 미로 표현으로 구성된 데이터셋에 대한 지도 미세 조정(SFT)을 활용하여 모델이 단계별 이동 명령을 예측하도록 학습시킵니다. 다음으로, DeepSeekR1에서 사용된 그룹 상대 정책 최적화(GRPO) 기법을 신중하게 설계된 보상 함수와 함께 적용하여 모델의 순차적 의사결정을 개선하고 사고의 연쇄적 행동을 유도합니다. 합성적으로 생성된 미로에 대한 실험 결과는, 기준 모델이 미로를 탐색하지 못한 반면 SFT로 훈련된 모델은 86%의 정확도를 달성했으며, 추가 GRPO 미세 조정을 통해 정확도가 93%로 향상되었음을 보여줍니다. 질적 분석은 GRPO가 더 견고하고 자기 수정적인 추론을 촉진함을 밝혀내어, 언어 모델과 시각적 공간 작업 간의 격차를 해소할 수 있는 우리의 접근법의 잠재력을 강조합니다. 이러한 발견은 로봇공학, 자율 주행 및 시각적 및 순차적 추론이 통합된 다른 분야의 응용에 유망한 시사점을 제공합니다.

English

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

AlphaMaze: GRPO를 통한 대형 언어 모델의 공간 지능 강화

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

초록

Support