m1：释放测试时缩放潜力，助力大语言模型在医疗推理中的应用

摘要

测试时扩展技术已成为提升大型语言模型推理能力的有力手段。然而，其在医疗推理领域的有效性尚不明确，因为医疗领域在知识表示和决策过程方面与数学任务存在根本差异。本文首次对测试时扩展在医疗推理中的应用进行了全面研究，并提出了m1，一种简单而有效的方法，能在推理阶段提升模型的医疗推理能力。我们在多种医疗任务上的评估表明，测试时扩展持续增强了医疗推理，使参数不足100亿的轻量级微调模型达到了新的最先进性能，而我们的320亿参数模型则与先前700亿规模的医疗大语言模型相媲美。然而，我们发现推理令牌预算存在一个约4K的最佳值，超过此值，性能可能因过度思考而下降。预算强制通过迭代提示扩展测试时计算，虽有助于模型复核答案，但未必能整体提升医疗问答性能，在某些情况下甚至会将错误引入原本正确的回答中。我们的个案分析指出，医疗知识不足是阻碍通过测试时扩展进一步获得性能提升的关键瓶颈。我们发现，增加数据规模、提升数据质量以及扩展模型容量，均能持续强化医疗知识基础，特别是在小型模型达到饱和的挑战性医疗基准上，实现持续的性能改进。这些发现凸显了医疗与数学推理在大语言模型中的根本差异，强调除了增加推理深度外，丰富的医疗知识对于实现测试时扩展的益处至关重要。

English

Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

m1：释放测试时缩放潜力，助力大语言模型在医疗推理中的应用

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

摘要

Summary

Support

Support