LIMO:推理中的“少即是多”
LIMO: Less is More for Reasoning
February 5, 2025
作者: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
cs.AI
摘要
我们提出了一项基础性发现,挑战了我们对大型语言模型中复杂推理是如何产生的理解。尽管传统观点认为复杂的推理任务需要大量训练数据(>100,000个示例),但我们证明了,复杂的数学推理能力可以通过极少的示例有效地引发。通过全面的实验,我们提出的模型LIMO在数学推理方面表现出空前的性能。仅仅通过817个精心筛选的训练样本,LIMO在AIME上达到了57.1%的准确率,在MATH上达到了94.8%,相较于先前基于SFT的模型分别提高了6.5%和59.2%,同时仅使用了先前方法所需训练数据的1%。LIMO展示了出色的超出分布泛化能力,在10个不同基准测试中取得了40.5%的绝对改进,胜过了使用100倍数据训练的模型,挑战了SFT导致记忆而非泛化的观念。基于这些结果,我们提出了Less-Is-More推理假设(LIMO假设):在基础模型中,领域知识已经在预训练阶段得到全面编码时,复杂推理能力可以通过最少但精确编排的认知过程示范出现。该假设认为,复杂推理的引发门槛由两个关键因素决定:(1)模型在预训练期间编码知识基础的完整性,以及(2)后续训练示例作为“认知模板”的有效性,展示给模型如何利用其知识库解决复杂推理任务。为了促进高效数据推理的可重现性和未来研究,我们将LIMO作为一个全面的开源套件发布在https://github.com/GAIR-NLP/LIMO。
English
We present a fundamental discovery that challenges our understanding of how
complex reasoning emerges in large language models. While conventional wisdom
suggests that sophisticated reasoning tasks demand extensive training data
(>100,000 examples), we demonstrate that complex mathematical reasoning
abilities can be effectively elicited with surprisingly few examples. Through
comprehensive experiments, our proposed model LIMO demonstrates unprecedented
performance in mathematical reasoning. With merely 817 curated training
samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from
previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of
the training data required by previous approaches. LIMO demonstrates
exceptional out-of-distribution generalization, achieving 40.5% absolute
improvement across 10 diverse benchmarks, outperforming models trained on 100x
more data, challenging the notion that SFT leads to memorization rather than
generalization. Based on these results, we propose the Less-Is-More Reasoning
Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has
been comprehensively encoded during pre-training, sophisticated reasoning
capabilities can emerge through minimal but precisely orchestrated
demonstrations of cognitive processes. This hypothesis posits that the
elicitation threshold for complex reasoning is determined by two key factors:
(1) the completeness of the model's encoded knowledge foundation during
pre-training, and (2) the effectiveness of post-training examples as "cognitive
templates" that show the model how to utilize its knowledge base to solve
complex reasoning tasks. To facilitate reproducibility and future research in
data-efficient reasoning, we release LIMO as a comprehensive open-source suite
at https://github.com/GAIR-NLP/LIMO.Summary
AI-Generated Summary