ChatPaper.aiChatPaper

从Medprompt到o1:医疗挑战问题及其它领域的运行时策略探索

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

November 6, 2024
作者: Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
cs.AI

摘要

像Medprompt这样的运行时导向策略对引导大型语言模型(LLMs)在具有挑战性的任务上达到最佳性能非常有价值。Medprompt展示了通过使用提示来引发涉及思维链推理和集成的运行时策略,可以将通用LLM集中到在医学等专业领域提供最先进性能。OpenAI的o1-preview模型代表了一个新的范式,其中一个模型被设计为在生成最终响应之前进行运行时推理。我们试图了解o1-preview在各种医学挑战问题基准上的行为。在Medprompt与GPT-4的研究基础上,我们系统评估了o1-preview模型在各种医学基准上的表现。值得注意的是,即使没有提示技术,o1-preview在很大程度上优于具有Medprompt的GPT-4系列。我们进一步系统研究了经典提示工程策略的有效性,如Medprompt所代表的,在推理模型的新范式中。我们发现少样本提示阻碍了o1的性能,这表明在上下文学习可能不再是推理本地模型的有效导向方法。虽然集成仍然可行,但它需要大量资源,并需要仔细的成本性能优化。我们在运行时策略跨成本和准确性的分析中揭示了帕累托前沿,GPT-4o代表了一个更经济的选择,而o1-preview在更高成本下实现了最先进的性能。尽管o1-preview提供了最佳性能,但像Medprompt这样的导向策略使GPT-4o在特定背景下仍具有价值。此外,我们注意到o1-preview模型在许多现有医学基准上已接近饱和,强调了对新的具有挑战性的基准的需求。最后,我们对LLMs的推理时间计算的一般方向进行了反思。
English
Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

Summary

AI-Generated Summary

PDF101November 13, 2024