언어 모델 및 두 번째 의견 사용 사례: 포켓 전문가

초록

본 연구는 전문가 결정에 대한 공식적인 둘째 의견 도구로서의 대형 언어 모델(LLMs)의 역할을 검증하며, 특히 숙련된 의사조차 동료 상담을 필요로 하는 복잡한 의료 케이스에 초점을 맞추었습니다. 이 연구는 Medscape에서 20개월 동안 수행된 183건의 어려운 의료 케이스를 분석하여, 여러 LLM의 성능을 대중의 집단 소싱 의사 응답과 비교하였습니다. 주요 발견 중 하나는 최신의 기본 모델에서 가능한 높은 전체 점수였으며(합의 의견과 비교하여 80% 이상의 정확도), 이는 동일한 임상 케이스에 대해 보고된 대부분의 인간 측정 항목을 능가합니다(환자 프로필 450페이지, 검사 결과). 연구는 LLM의 성능 차이를 평가하였는데, 직관적인 케이스(>81% 정확도)와 복잡한 시나리오(43% 정확도) 사이에서 특히 두드러지며, 휴먼 의사들 사이에서 심각한 논쟁을 유발하는 이러한 케이스에서 더욱 크게 나타났습니다. 이 연구는 LLM이 주 진단 도구로서보다는 포괄적인 차별 진단 생성기로서 가치가 있을 수 있음을 입증하며, 임상 의사 결정의 인지적 편향을 극복하고, 인지적 부담을 줄이며, 이로써 의료 오류의 일부 원인을 제거하는 데 도움이 될 수 있음을 보여줍니다. 또한 제2의 비교적인 법적 데이터 세트(대법원 사건, N=21)의 포함은 둘째 의견 촉진을 위한 AI 사용에 대한 추가적인 경험적 맥락을 제공하였으나, 이러한 법적 도전은 LLM이 분석하기에 상당히 쉬웠습니다. LLM의 정확도에 대한 초기 증거뿐만 아니라, 이 연구는 다른 사람들이 LLM과 의견이 분분한 인간 실무자 사이의 고도로 논란되는 질문과 답변 신뢰도를 평가할 수 있는 새로운 기준을 집계하였습니다. 이러한 결과는 전문적인 환경에서 LLM의 최적 배치가 현재의 루틴 작업 자동화를 강조하는 현재 방식과 상당히 다를 수 있다는 것을 시사합니다.

English

This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (>80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs' performance disparity between straightforward cases (>81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.

언어 모델 및 두 번째 의견 사용 사례: 포켓 전문가

Language Models And A Second Opinion Use Case: The Pocket Professional

초록

Support