语言模型是隐藏的推理者：通过自我奖励来释放潜在的推理能力。

摘要

大型语言模型（LLMs）展示了令人印象深刻的能力，但仍然在需要多步骤的复杂推理任务中遇到困难。虽然基于提示的方法如“Chain-of-Thought”（CoT）可以改善LLM在推理时的推理能力，但在训练期间优化推理能力仍然具有挑战性。我们引入了LaTent Reasoning Optimization（LaTRO），这是一个原则性框架，将推理表述为从潜在分布中采样，并通过变分方法进行优化。LaTRO使LLMs能够同时改善其推理过程和评估推理质量的能力，而无需外部反馈或奖励模型。我们通过在GSM8K和ARC-Challenge数据集上使用多种模型架构进行的实验证实了LaTRO。在GSM8K上，LaTRO将零样本准确率平均提高了12.5%，比基础模型提高了9.6%，分别是Phi-3.5-mini、Mistral-7B和Llama-3.1-8B。我们的发现表明，预训练的LLMs具有潜在的推理能力，可以通过我们提出的自我改进方法来释放和增强。LaTRO的代码可在https://github.com/SalesforceAIResearch/LaTRO 上找到。

English

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

语言模型是隐藏的推理者：通过自我奖励来释放潜在的推理能力。

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

摘要

Summary

Support

Support