ChatPaper.aiChatPaper

醫學中 o1 的初步研究:我們離 AI 醫生更近了嗎?

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

September 23, 2024
作者: Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou
cs.AI

摘要

大型語言模型(LLMs)展現出在各個領域和任務中的卓越能力,推動了我們在學習和認知方面知識的邊界。最新的模型,OpenAI的o1,以內部化的思維鏈技術和使用強化學習策略而脫穎而出。儘管在各種通用語言任務上展現出驚人的能力,但其在醫學等專業領域的表現仍然未知。為此,本報告對o1在不同醫學場景上進行了全面探索,檢視了三個關鍵方面:理解、推理和多語能力。具體而言,我們的評估包括了6個任務,使用了來自37個醫學數據集的數據,其中包括兩個基於《新英格蘭醫學雜誌》(NEJM)和《柳葉刀》的專業醫學測驗所構建的更具挑戰性的問答(QA)任務。這些數據集與標準醫學QA基準(如MedQA)相比,具有更大的臨床相關性,更有效地轉化為現實世界的臨床效用。我們對o1的分析表明,LLMs的增強推理能力可能(顯著地)有助於其理解各種醫學指示並推理複雜的臨床場景。值得注意的是,o1在19個數據集和兩個新創建的複雜QA場景中平均準確率分別超過了先前的GPT-4 6.2%和6.6%。但與此同時,我們發現了模型能力和現有評估協議中的一些弱點,包括幻覺、多語能力不一致以及評估指標的差異。我們將我們的原始數據和模型輸出發布在https://ucsc-vlaa.github.io/o1_medicine/,供未來研究使用。
English
Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future research.

Summary

AI-Generated Summary

PDF382November 16, 2024