SoS1:O1与R1类推理大语言模型是平方和求解器
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
February 27, 2025
作者: Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu
cs.AI
摘要
大型语言模型(LLMs)已在多种任务中达到人类水平,但其在严格数学问题解决方面的能力仍是一个开放性的挑战。本研究探讨了一个基础但计算上棘手的问题:判定给定多元多项式是否非负。该问题与希尔伯特第十七问题密切相关,在全球多项式优化中扮演着关键角色,并在多个领域具有应用价值。首先,我们引入了SoS-1K,这是一个精心策划的包含约1000个多项式的数据集,并附有基于五个逐步增加难度的标准设计的专家级推理指导。评估多个最先进的LLMs后,我们发现,在没有结构化指导的情况下,所有模型的性能仅略高于随机猜测的基线50%。然而,高质量的推理指导显著提升了准确率,将性能提升至81%。此外,我们的7B模型SoS-7B,仅在SoS-1K上微调4小时,便在准确率上超越了671B的DeepSeek-V3和GPT-4o-mini,同时仅需分别对应字母计算时间的1.8%和5%。我们的发现凸显了LLMs在拓展数学推理边界及应对NP难问题方面的潜力。
English
Large Language Models (LLMs) have achieved human-level proficiency across
diverse tasks, but their ability to perform rigorous mathematical problem
solving remains an open challenge. In this work, we investigate a fundamental
yet computationally intractable problem: determining whether a given
multivariate polynomial is nonnegative. This problem, closely related to
Hilbert's Seventeenth Problem, plays a crucial role in global polynomial
optimization and has applications in various fields. First, we introduce
SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials,
along with expert-designed reasoning instructions based on five progressively
challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that
without structured guidance, all models perform only slightly above the random
guess baseline 50%. However, high-quality reasoning instructions significantly
improve accuracy, boosting performance up to 81%. Furthermore, our 7B model,
SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3
and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation
time needed for letters, respectively. Our findings highlight the potential of
LLMs to push the boundaries of mathematical reasoning and tackle NP-hard
problems.Summary
AI-Generated Summary