大型推理模型的竞争性编程

摘要

我们展示了将强化学习应用于大型语言模型（LLMs）可以显著提升在复杂编码和推理任务上的性能。此外，我们比较了两个通用推理模型 - OpenAI o1 和 o3 的早期检查点 - 以及一个特定领域系统 o1-ioi，后者使用为参加2024年国际信息学奥林匹克竞赛（IOI）而设计的手工推理策略。我们在2024年的IOI现场比赛中使用 o1-ioi，并通过手工设计的测试时策略获得了第49百分位。在放宽比赛约束条件下，o1-ioi 获得了金牌。然而，当评估后续模型如 o3 时，我们发现 o3 在没有手工设计的特定领域策略或放松约束条件的情况下也能获得金牌。我们的研究结果表明，尽管诸如 o1-ioi 这样的专门流水线可以带来显著改进，但规模化的通用 o3 模型超越了这些结果，而无需依赖手工设计的推理启发式。值得注意的是，o3 在2024年的IOI上获得了金牌，并且在 Codeforces 等级上与顶尖人类竞争者持平。总的来说，这些结果表明，在推理领域，如竞赛编程，通过扩展通用强化学习而不是依赖特定领域技术，提供了通向最先进人工智能的稳健途径。

English

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.

大型推理模型的竞争性编程

Competitive Programming with Large Reasoning Models

摘要

Summary

Support