從Medprompt到o1:探索醫學挑戰問題及更多情境下的運行時策略
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
November 6, 2024
作者: Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
cs.AI
摘要
像Medprompt這樣的運行時導向策略對引導大型語言模型(LLMs)在具有挑戰性任務上取得頂尖表現非常有價值。Medprompt展示了通用LLM可以通過使用提示來引出涉及思維鏈推理和集成的運行時策略,從而實現在醫學等專業領域提供最先進表現。OpenAI的o1-preview模型代表了一種新範式,其中一個模型被設計為在生成最終回應之前進行運行時推理。我們致力於了解o1-preview在各種醫學挑戰問題基準上的行為。在Medprompt與GPT-4的研究基礎上,我們系統評估o1-preview模型在各種醫學基準上的表現。值得注意的是,即使沒有提示技術,o1-preview在很大程度上優於具有Medprompt的GPT-4系列。我們進一步系統研究了經典提示工程策略的有效性,如Medprompt所代表的,在推理模型的新範式中。我們發現少量提示阻礙了o1的表現,這表明在場景內學習可能不再是推理本地模型的有效導向方法。雖然集成仍然可行,但它需要大量資源並需要仔細的成本效能優化。我們在運行時策略上的成本和準確性分析顯示了帕累托前沿,GPT-4o代表了一個更經濟實惠的選擇,而o1-preview在更高成本下實現了最先進的表現。儘管o1-preview提供了頂尖表現,但像Medprompt這樣的導向策略使GPT-4o在特定情境中仍具有價值。此外,我們注意到o1-preview模型在許多現有醫學基準上已接近飽和,突顯了對新的具有挑戰性的基準的需求。我們最後對與LLMs的推理時計算的一般方向進行反思。
English
Run-time steering strategies like Medprompt are valuable for guiding large
language models (LLMs) to top performance on challenging tasks. Medprompt
demonstrates that a general LLM can be focused to deliver state-of-the-art
performance on specialized domains like medicine by using a prompt to elicit a
run-time strategy involving chain of thought reasoning and ensembling. OpenAI's
o1-preview model represents a new paradigm, where a model is designed to do
run-time reasoning before generating final responses. We seek to understand the
behavior of o1-preview on a diverse set of medical challenge problem
benchmarks. Following on the Medprompt study with GPT-4, we systematically
evaluate the o1-preview model across various medical benchmarks. Notably, even
without prompting techniques, o1-preview largely outperforms the GPT-4 series
with Medprompt. We further systematically study the efficacy of classic prompt
engineering strategies, as represented by Medprompt, within the new paradigm of
reasoning models. We found that few-shot prompting hinders o1's performance,
suggesting that in-context learning may no longer be an effective steering
approach for reasoning-native models. While ensembling remains viable, it is
resource-intensive and requires careful cost-performance optimization. Our cost
and accuracy analysis across run-time strategies reveals a Pareto frontier,
with GPT-4o representing a more affordable option and o1-preview achieving
state-of-the-art performance at higher cost. Although o1-preview offers top
performance, GPT-4o with steering strategies like Medprompt retains value in
specific contexts. Moreover, we note that the o1-preview model has reached
near-saturation on many existing medical benchmarks, underscoring the need for
new, challenging benchmarks. We close with reflections on general directions
for inference-time computation with LLMs.Summary
AI-Generated Summary