O1 复制之旅 -- 第3部分：用于医学推理的推理时间缩放

摘要

在我们之前对O1复制（第1部分：旅程学习[秦等，2024年]和第2部分：蒸馏[黄等，2024年]）的调查基础上，本研究探讨了大型语言模型（LLMs）在医学推理任务中推理时间缩放的潜力，涵盖了从诊断决策到治疗规划的范围。通过对医学基准（MedQA、Medbullets和JAMA临床挑战）进行大量实验，我们的调查揭示了几个关键见解：（1）增加推理时间确实会提高性能。在一个适度的训练集（500个样本）的情况下，我们的模型实现了6%-11%的显著性能改进。（2）任务复杂度直接与所需推理链的长度相关，证实了对于具有挑战性问题，需要进行扩展思维过程的必要性。（3）我们模型生成的不同诊断符合假设演绎法则，产生了一个可能解释患者症状的潜在疾病列表，并通过评估证据系统地缩小这些可能性。这些发现展示了推理时间缩放与旅程学习在提升LLMs在现实世界临床推理能力方面的有希望的协同作用。

English

Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.

O1 复制之旅 -- 第3部分：用于医学推理的推理时间缩放

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

摘要

Support