QE4PE:面向人工后编辑的词级质量评估
QE4PE: Word-level Quality Estimation for Human Post-Editing
March 4, 2025
作者: Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
cs.AI
摘要
词级质量评估(QE)用于检测机器翻译中的错误片段,从而指导并促进人工后期编辑。尽管词级QE系统的准确性已得到广泛评估,但其可用性以及对人工后期编辑的速度、质量和编辑选择的下游影响仍研究不足。我们的QE4PE研究在涉及42位专业后期编辑、跨越两种翻译方向的真实场景中,探讨了词级QE对机器翻译(MT)后期编辑的影响。我们比较了四种错误片段高亮模式,包括监督式和基于不确定性的词级QE方法,用于识别最先进神经MT模型输出中的潜在错误。通过行为日志估算后期编辑的工作量和生产力,而质量提升则通过词级和句段级的人工标注进行评估。研究发现,领域、语言及编辑速度是决定高亮效果的关键因素,人工与自动化QE高亮之间存在的细微差异,凸显了专业工作流程中准确性与实用性之间的差距。
English
Word-level quality estimation (QE) detects erroneous spans in machine
translations, which can direct and facilitate human post-editing. While the
accuracy of word-level QE systems has been assessed extensively, their
usability and downstream influence on the speed, quality and editing choices of
human post-editing remain understudied. Our QE4PE study investigates the impact
of word-level QE on machine translation (MT) post-editing in a realistic
setting involving 42 professional post-editors across two translation
directions. We compare four error-span highlight modalities, including
supervised and uncertainty-based word-level QE methods, for identifying
potential errors in the outputs of a state-of-the-art neural MT model.
Post-editing effort and productivity are estimated by behavioral logs, while
quality improvements are assessed by word- and segment-level human annotation.
We find that domain, language and editors' speed are critical factors in
determining highlights' effectiveness, with modest differences between
human-made and automated QE highlights underlining a gap between accuracy and
usability in professional workflows.Summary
AI-Generated Summary