O1 複製之旅 -- 第3部分：醫學推理的推論時間擴展

摘要

在我們先前對O1複製的研究基礎上（第1部分：旅程學習[Qin等，2024年]和第2部分：蒸餾[Huang等，2024年]），本研究探討了大型語言模型（LLMs）在醫學推理任務中推理時間縮放的潛力，範圍涵蓋從診斷決策到治療計劃。通過對醫學基準測試（MedQA、Medbullets和JAMA臨床挑戰）進行廣泛實驗，我們的研究揭示了幾個關鍵見解：（1）增加推理時間確實會提高性能。在一個僅有500個樣本的適度訓練集下，我們的模型實現了6%-11%的顯著性能改善。（2）任務複雜度與所需推理鏈的長度直接相關，這證實了對於具有挑戰性問題的延伸思考過程的必要性。（3）我們模型生成的不同診斷符合假設性演繹法則，通過評估證據，提出可能解釋患者症狀的潛在疾病列表，並系統地縮小這些可能性。這些發現展示了推理時間縮放與旅程學習在提升LLMs在現實世界臨床推理能力方面的潛在協同作用。

English

Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.

O1 複製之旅 -- 第3部分：醫學推理的推論時間擴展

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

摘要

Summary

Support