ChatPaper.aiChatPaper

AlphaMaze:通过GRPO增强大语言模型的空间智能

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

February 20, 2025
作者: Alan Dao, Dinh Bach Vu
cs.AI

摘要

大型语言模型(LLMs)在语言处理方面展现了卓越的能力,但在需要真正视觉空间推理的任务上往往表现欠佳。本文提出了一种新颖的两阶段训练框架,旨在为标准的LLMs赋予迷宫导航所需的视觉推理能力。首先,我们利用监督微调(SFT)技术,在一个经过精心挑选的符号化迷宫表示数据集上训练模型,使其能够预测逐步移动指令。接着,我们采用深度探索R1(DeepSeekR1)中的组相对策略优化(GRPO)方法,结合精心设计的奖励函数,以优化模型的序列决策能力,并促使其产生链式思维行为。在合成生成的迷宫上的实验结果表明,基线模型无法完成迷宫导航,而经过SFT训练的模型达到了86%的准确率,进一步的GRPO微调则将准确率提升至93%。定性分析显示,GRPO促进了更稳健且具备自我修正能力的推理过程,凸显了我们的方法在弥合语言模型与视觉空间任务之间差距的潜力。这些发现为机器人技术、自主导航以及其他需要整合视觉与序列推理的领域应用提供了有前景的启示。
English
Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

Summary

AI-Generated Summary

PDF112February 21, 2025