从Medprompt到o1：医疗挑战问题及其它领域的运行时策略探索

摘要

像Medprompt这样的运行时导向策略对引导大型语言模型（LLMs）在具有挑战性的任务上达到最佳性能非常有价值。Medprompt展示了通过使用提示来引发涉及思维链推理和集成的运行时策略，可以将通用LLM集中到在医学等专业领域提供最先进性能。OpenAI的o1-preview模型代表了一个新的范式，其中一个模型被设计为在生成最终响应之前进行运行时推理。我们试图了解o1-preview在各种医学挑战问题基准上的行为。在Medprompt与GPT-4的研究基础上，我们系统评估了o1-preview模型在各种医学基准上的表现。值得注意的是，即使没有提示技术，o1-preview在很大程度上优于具有Medprompt的GPT-4系列。我们进一步系统研究了经典提示工程策略的有效性，如Medprompt所代表的，在推理模型的新范式中。我们发现少样本提示阻碍了o1的性能，这表明在上下文学习可能不再是推理本地模型的有效导向方法。虽然集成仍然可行，但它需要大量资源，并需要仔细的成本性能优化。我们在运行时策略跨成本和准确性的分析中揭示了帕累托前沿，GPT-4o代表了一个更经济的选择，而o1-preview在更高成本下实现了最先进的性能。尽管o1-preview提供了最佳性能，但像Medprompt这样的导向策略使GPT-4o在特定背景下仍具有价值。此外，我们注意到o1-preview模型在许多现有医学基准上已接近饱和，强调了对新的具有挑战性的基准的需求。最后，我们对LLMs的推理时间计算的一般方向进行了反思。

English

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

从Medprompt到o1：医疗挑战问题及其它领域的运行时策略探索

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

摘要

Summary

Support

Support