从Medprompt到o1:医疗挑战问题及其它领域的运行时策略探索
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
November 6, 2024
作者: Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
cs.AI
摘要
像Medprompt这样的运行时导向策略对引导大型语言模型(LLMs)在具有挑战性的任务上达到最佳性能非常有价值。Medprompt展示了通过使用提示来引发涉及思维链推理和集成的运行时策略,可以将通用LLM集中到在医学等专业领域提供最先进性能。OpenAI的o1-preview模型代表了一个新的范式,其中一个模型被设计为在生成最终响应之前进行运行时推理。我们试图了解o1-preview在各种医学挑战问题基准上的行为。在Medprompt与GPT-4的研究基础上,我们系统评估了o1-preview模型在各种医学基准上的表现。值得注意的是,即使没有提示技术,o1-preview在很大程度上优于具有Medprompt的GPT-4系列。我们进一步系统研究了经典提示工程策略的有效性,如Medprompt所代表的,在推理模型的新范式中。我们发现少样本提示阻碍了o1的性能,这表明在上下文学习可能不再是推理本地模型的有效导向方法。虽然集成仍然可行,但它需要大量资源,并需要仔细的成本性能优化。我们在运行时策略跨成本和准确性的分析中揭示了帕累托前沿,GPT-4o代表了一个更经济的选择,而o1-preview在更高成本下实现了最先进的性能。尽管o1-preview提供了最佳性能,但像Medprompt这样的导向策略使GPT-4o在特定背景下仍具有价值。此外,我们注意到o1-preview模型在许多现有医学基准上已接近饱和,强调了对新的具有挑战性的基准的需求。最后,我们对LLMs的推理时间计算的一般方向进行了反思。
English
Run-time steering strategies like Medprompt are valuable for guiding large
language models (LLMs) to top performance on challenging tasks. Medprompt
demonstrates that a general LLM can be focused to deliver state-of-the-art
performance on specialized domains like medicine by using a prompt to elicit a
run-time strategy involving chain of thought reasoning and ensembling. OpenAI's
o1-preview model represents a new paradigm, where a model is designed to do
run-time reasoning before generating final responses. We seek to understand the
behavior of o1-preview on a diverse set of medical challenge problem
benchmarks. Following on the Medprompt study with GPT-4, we systematically
evaluate the o1-preview model across various medical benchmarks. Notably, even
without prompting techniques, o1-preview largely outperforms the GPT-4 series
with Medprompt. We further systematically study the efficacy of classic prompt
engineering strategies, as represented by Medprompt, within the new paradigm of
reasoning models. We found that few-shot prompting hinders o1's performance,
suggesting that in-context learning may no longer be an effective steering
approach for reasoning-native models. While ensembling remains viable, it is
resource-intensive and requires careful cost-performance optimization. Our cost
and accuracy analysis across run-time strategies reveals a Pareto frontier,
with GPT-4o representing a more affordable option and o1-preview achieving
state-of-the-art performance at higher cost. Although o1-preview offers top
performance, GPT-4o with steering strategies like Medprompt retains value in
specific contexts. Moreover, we note that the o1-preview model has reached
near-saturation on many existing medical benchmarks, underscoring the need for
new, challenging benchmarks. We close with reflections on general directions
for inference-time computation with LLMs.Summary
AI-Generated Summary