Medprompt에서 o1로: 의료 도전 문제 및 그 이상의 런타임 전략 탐색

초록

런타임 조종 전략인 Medprompt와 같은 전략은 어려운 작업에서 대형 언어 모델(LLMs)을 최상의 성능으로 이끌어내는 데 유용합니다. Medprompt는 일반적인 LLM이 프롬프트를 사용하여 의학과 같은 전문 분야에서 최첨단 성능을 제공할 수 있도록 집중시킬 수 있음을 보여줍니다. 이는 사고 체인 및 앙상블링을 포함하는 런타임 전략을 유도하기 위한 프롬프트를 사용합니다. OpenAI의 o1-preview 모델은 최종 응답을 생성하기 전에 런타임 추론을 수행하도록 설계된 새로운 패러다임을 대표합니다. 우리는 o1-preview가 다양한 의료 도전 문제 벤치마크에서 어떻게 작동하는지 이해하려고 합니다. GPT-4와 Medprompt를 사용한 Medprompt 연구를 바탕으로, 우리는 o1-preview 모델을 다양한 의료 벤치마크에서 체계적으로 평가합니다. 특히, 프롬프트 기술을 사용하지 않아도 o1-preview는 대부분의 경우 Medprompt와 함께 GPT-4 시리즈를 크게 능가합니다. 우리는 새로운 추론 모델 패러다임 내에서 Medprompt로 표현된 클래식 프롬프트 엔지니어링 전략의 효과를 체계적으로 연구합니다. 우리는 몇몇 프롬프트를 통한 학습이 o1의 성능을 저해한다는 것을 발견했으며, 이는 문맥 내 학습이 추론 원천 모델에 대한 효과적인 조종 접근이 아닐 수 있음을 시사합니다. 앙상블링은 여전히 실행 가능하지만, 리소스가 많이 소모되며 비용-성능 최적화가 필요합니다. 런타임 전략을 통한 비용 및 정확도 분석을 통해 GPT-4o가 더 저렴한 옵션을 대표하고 o1-preview가 높은 비용으로 최첨단 성능을 달성하는 파레토 프론티어를 보여줍니다. o1-preview가 최상의 성능을 제공하지만, Medprompt와 같은 조종 전략을 사용하는 GPT-4o는 특정 맥락에서 가치를 유지합니다. 더욱이, 우리는 o1-preview 모델이 많은 기존 의료 벤치마크에서 거의 포화 상태에 도달했음을 강조하며, 새로운 도전적인 벤치마크의 필요성을 강조합니다. 우리는 LLMs와 추론 시간 계산에 대한 일반적인 방향에 대한 고찰로 마무리합니다.

English

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

Medprompt에서 o1로: 의료 도전 문제 및 그 이상의 런타임 전략 탐색

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

초록

Support