语言模型是隐藏的推理者:通过自我奖励来释放潜在的推理能力。
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
November 6, 2024
作者: Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI
摘要
大型语言模型(LLMs)展示了令人印象深刻的能力,但仍然在需要多步骤的复杂推理任务中遇到困难。虽然基于提示的方法如“Chain-of-Thought”(CoT)可以改善LLM在推理时的推理能力,但在训练期间优化推理能力仍然具有挑战性。我们引入了LaTent Reasoning Optimization(LaTRO),这是一个原则性框架,将推理表述为从潜在分布中采样,并通过变分方法进行优化。LaTRO使LLMs能够同时改善其推理过程和评估推理质量的能力,而无需外部反馈或奖励模型。我们通过在GSM8K和ARC-Challenge数据集上使用多种模型架构进行的实验证实了LaTRO。在GSM8K上,LaTRO将零样本准确率平均提高了12.5%,比基础模型提高了9.6%,分别是Phi-3.5-mini、Mistral-7B和Llama-3.1-8B。我们的发现表明,预训练的LLMs具有潜在的推理能力,可以通过我们提出的自我改进方法来释放和增强。LaTRO的代码可在https://github.com/SalesforceAIResearch/LaTRO 上找到。
English
Large language models (LLMs) have shown impressive capabilities, but still
struggle with complex reasoning tasks requiring multiple steps. While
prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at
inference time, optimizing reasoning capabilities during training remains
challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled
framework that formulates reasoning as sampling from a latent distribution and
optimizes it via variational approaches. LaTRO enables LLMs to concurrently
improve both their reasoning process and ability to evaluate reasoning quality,
without requiring external feedback or reward models. We validate LaTRO through
experiments on GSM8K and ARC-Challenge datasets using multiple model
architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of
12.5% over base models and 9.6% over supervised fine-tuning across
Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that
pre-trained LLMs possess latent reasoning capabilities that can be unlocked and
enhanced through our proposed optimization approach in a self-improvement
manner. The code of LaTRO is available at
https://github.com/SalesforceAIResearch/LaTRO.Summary
AI-Generated Summary