O1 复制之旅 -- 第3部分:用于医学推理的推理时间缩放
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
January 11, 2025
作者: Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
cs.AI
摘要
在我们之前对O1复制(第1部分:旅程学习[秦等,2024年]和第2部分:蒸馏[黄等,2024年])的调查基础上,本研究探讨了大型语言模型(LLMs)在医学推理任务中推理时间缩放的潜力,涵盖了从诊断决策到治疗规划的范围。通过对医学基准(MedQA、Medbullets和JAMA临床挑战)进行大量实验,我们的调查揭示了几个关键见解:(1)增加推理时间确实会提高性能。在一个适度的训练集(500个样本)的情况下,我们的模型实现了6%-11%的显著性能改进。(2)任务复杂度直接与所需推理链的长度相关,证实了对于具有挑战性问题,需要进行扩展思维过程的必要性。(3)我们模型生成的不同诊断符合假设演绎法则,产生了一个可能解释患者症状的潜在疾病列表,并通过评估证据系统地缩小这些可能性。这些发现展示了推理时间缩放与旅程学习在提升LLMs在现实世界临床推理能力方面的有希望的协同作用。
English
Building upon our previous investigations of O1 replication (Part 1: Journey
Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]),
this work explores the potential of inference-time scaling in large language
models (LLMs) for medical reasoning tasks, ranging from diagnostic
decision-making to treatment planning. Through extensive experiments on medical
benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical
Challenges), our investigation reveals several key insights: (1) Increasing
inference time does lead to improved performance. With a modest training set of
500 samples, our model yields substantial performance improvements of 6%-11%.
(2) Task complexity directly correlates with the required length of reasoning
chains, confirming the necessity of extended thought processes for challenging
problems. (3) The differential diagnoses generated by our model adhere to the
principles of the hypothetico-deductive method, producing a list of potential
conditions that may explain a patient's symptoms and systematically narrowing
these possibilities by evaluating the evidence. These findings demonstrate the
promising synergy between inference-time scaling and journey learning in
advancing LLMs' real-world clinical reasoning capabilities.Summary
AI-Generated Summary