RuCCoD:迈向俄语ICD编码自动化
RuCCoD: Towards Automated ICD Coding in Russian
February 28, 2025
作者: Aleksandr Nesterov, Andrey Sakhovskiy, Ivan Sviridov, Airat Valiev, Vladimir Makharev, Petr Anokhin, Galina Zubkova, Elena Tutubalina
cs.AI
摘要
本研究探讨了在生物医学资源有限的俄语环境中实现临床编码自动化的可行性。我们提出了一个用于ICD编码的新数据集,该数据集包含来自电子健康记录(EHRs)的诊断字段,标注了超过10,000个实体和1,500多个独特的ICD编码。此数据集作为多个先进模型的基准,包括BERT、采用LoRA的LLaMA以及RAG,并进行了跨领域(从PubMed摘要到医疗诊断)和跨术语(从UMLS概念到ICD编码)的迁移学习实验。随后,我们将表现最佳的模型应用于标注一个内部EHR数据集,该数据集涵盖了2017年至2021年的患者病史。在精心挑选的测试集上进行的实验表明,与医生手动标注的数据相比,使用自动化预测编码进行训练显著提高了准确性。我们相信,这些发现为在资源有限的语言(如俄语)中自动化临床编码的潜力提供了宝贵见解,有望提升此类环境下的临床效率和数据准确性。
English
This study investigates the feasibility of automating clinical coding in
Russian, a language with limited biomedical resources. We present a new dataset
for ICD coding, which includes diagnosis fields from electronic health records
(EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD
codes. This dataset serves as a benchmark for several state-of-the-art models,
including BERT, LLaMA with LoRA, and RAG, with additional experiments examining
transfer learning across domains (from PubMed abstracts to medical diagnosis)
and terminologies (from UMLS concepts to ICD codes). We then apply the
best-performing model to label an in-house EHR dataset containing patient
histories from 2017 to 2021. Our experiments, conducted on a carefully curated
test set, demonstrate that training with the automated predicted codes leads to
a significant improvement in accuracy compared to manually annotated data from
physicians. We believe our findings offer valuable insights into the potential
for automating clinical coding in resource-limited languages like Russian,
which could enhance clinical efficiency and data accuracy in these contexts.Summary
AI-Generated Summary