PlainQAFact:生物医学简明语言摘要生成中的自动事实性评估指标
PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation
March 11, 2025
作者: Zhiwen You, Yue Guo
cs.AI
摘要
在医疗领域,语言模型产生的幻觉输出对非专业受众做出健康相关决策构成风险。现有的真实性评估方法,如基于蕴含和问答(QA)的方法,在处理通俗语言摘要(PLS)生成时面临挑战,这主要归因于解释性扩展现象——即引入源文档中未包含的外部内容(如定义、背景、示例)以增强理解。为解决这一问题,我们提出了PlainQAFact框架,该框架基于精细标注的人工数据集PlainFact进行训练,旨在评估源简化句和解释性扩展句的真实性。PlainQAFact首先分类真实性类型,随后采用检索增强的QA评分方法评估真实性。我们的方法轻量且计算高效。实证结果表明,现有真实性指标难以有效评估PLS中的真实性,特别是对于解释性扩展内容,而PlainQAFact则实现了最先进的性能。我们进一步分析了其在不同外部知识源、答案提取策略、重叠度测量及文档粒度层次上的有效性,从而优化了其整体真实性评估能力。
English
Hallucinated outputs from language models pose risks in the medical domain,
especially for lay audiences making health-related decisions. Existing
factuality evaluation methods, such as entailment- and question-answering-based
(QA), struggle with plain language summary (PLS) generation due to elaborative
explanation phenomenon, which introduces external content (e.g., definitions,
background, examples) absent from the source document to enhance comprehension.
To address this, we introduce PlainQAFact, a framework trained on a
fine-grained, human-annotated dataset PlainFact, to evaluate the factuality of
both source-simplified and elaboratively explained sentences. PlainQAFact first
classifies factuality type and then assesses factuality using a
retrieval-augmented QA-based scoring method. Our approach is lightweight and
computationally efficient. Empirical results show that existing factuality
metrics fail to effectively evaluate factuality in PLS, especially for
elaborative explanations, whereas PlainQAFact achieves state-of-the-art
performance. We further analyze its effectiveness across external knowledge
sources, answer extraction strategies, overlap measures, and document
granularity levels, refining its overall factuality assessment.Summary
AI-Generated Summary