從Medprompt到o1：探索醫學挑戰問題及更多情境下的運行時策略

摘要

像Medprompt這樣的運行時導向策略對引導大型語言模型（LLMs）在具有挑戰性任務上取得頂尖表現非常有價值。Medprompt展示了通用LLM可以通過使用提示來引出涉及思維鏈推理和集成的運行時策略，從而實現在醫學等專業領域提供最先進表現。OpenAI的o1-preview模型代表了一種新範式，其中一個模型被設計為在生成最終回應之前進行運行時推理。我們致力於了解o1-preview在各種醫學挑戰問題基準上的行為。在Medprompt與GPT-4的研究基礎上，我們系統評估o1-preview模型在各種醫學基準上的表現。值得注意的是，即使沒有提示技術，o1-preview在很大程度上優於具有Medprompt的GPT-4系列。我們進一步系統研究了經典提示工程策略的有效性，如Medprompt所代表的，在推理模型的新範式中。我們發現少量提示阻礙了o1的表現，這表明在場景內學習可能不再是推理本地模型的有效導向方法。雖然集成仍然可行，但它需要大量資源並需要仔細的成本效能優化。我們在運行時策略上的成本和準確性分析顯示了帕累托前沿，GPT-4o代表了一個更經濟實惠的選擇，而o1-preview在更高成本下實現了最先進的表現。儘管o1-preview提供了頂尖表現，但像Medprompt這樣的導向策略使GPT-4o在特定情境中仍具有價值。此外，我們注意到o1-preview模型在許多現有醫學基準上已接近飽和，突顯了對新的具有挑戰性的基準的需求。我們最後對與LLMs的推理時計算的一般方向進行反思。

English

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

從Medprompt到o1：探索醫學挑戰問題及更多情境下的運行時策略

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

摘要

Summary

Support

Support