利用私人微调的大型语言模型在患者病历上进行问答

摘要

医疗系统持续产生大量的电子健康记录（EHRs），通常存储在快速医疗互操作性资源（FHIR）标准中。尽管这些记录中包含大量信息，但其复杂性和数量使用户难以检索和解释重要的健康见解。最近大型语言模型（LLMs）的进展提供了一种解决方案，实现对医疗数据的语义问答（QA），使用户能够更有效地与其健康记录进行交互。然而，确保隐私和合规性需要在边缘和私有部署LLMs。本文提出了一种新颖的方法，通过首先识别用户查询中最相关的FHIR资源（任务1），然后基于这些资源回答查询（任务2），实现对EHRs的语义QA。我们探讨了私人托管、经过精细调整的LLMs的性能，将它们与基准模型（如GPT-4和GPT-4o）进行评估。我们的结果表明，尽管精细调整的LLMs体积缩小了250倍，但在任务1的F1分数上优于GPT-4系列模型0.55%，在任务2的Meteor任务上优于42%。此外，我们还研究了LLM使用的高级方面，包括顺序微调、模型自我评估（自恋评估）以及训练数据规模对性能的影响。模型和数据集可在此处获得：https://huggingface.co/genloop

English

Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs. This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop

利用私人微调的大型语言模型在患者病历上进行问答

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

摘要

Summary

Support