模型编辑的幻觉:重访野外评估
The Mirage of Model Editing: Revisiting Evaluation in the Wild
February 16, 2025
作者: Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng
cs.AI
摘要
尽管在人工评估中取得了接近完美的结果,但模型编辑在实际应用中的有效性仍未被探索。为弥合这一差距,我们提出通过建立严格的评估实践来研究问答(QA)中的模型编辑,以评估编辑方法在纠正LLMs错误方面的有效性。这包括QAEdit,一个从流行的QA数据集衍生出的新基准,以及一个标准化的评估框架。我们的单一编辑实验表明,当前的编辑方法表现明显低于先前报告的结果(38.5% vs. ~96%)。通过模块分析和对照实验,我们证明了这种性能下降源于先前编辑研究中评估实践存在问题。一个关键问题是在测试中不当地使用教师强制,通过将地面真实标记(在实际场景中无法访问)作为输入,阻止了错误的传播。此外,我们通过顺序编辑模拟了实际部署,揭示了当前方法在仅进行1000次编辑时的严重失败。我们的分析对现有模型编辑方法的实际应用性和评估实践进行了基本重新审视,并建立了一个严格的评估框架,提供了关键见解,推动可靠且实用的模型编辑研究。
English
Despite near-perfect results in artificial evaluations, the effectiveness of
model editing in real-world applications remains unexplored. To bridge this
gap, we propose to study model editing in question answering (QA) by
establishing a rigorous evaluation practice to assess the effectiveness of
editing methods in correcting LLMs' errors. It consists of QAEdit, a new
benchmark derived from popular QA datasets, and a standardized evaluation
framework. Our single editing experiments indicate that current editing methods
perform substantially worse than previously reported (38.5% vs. ~96%). Through
module analysis and controlled experiments, we demonstrate that this
performance decline stems from issues in evaluation practices of prior editing
research. One key issue is the inappropriate use of teacher forcing in testing
prevents error propagation by feeding ground truth tokens (inaccessible in
real-world scenarios) as input. Furthermore, we simulate real-world deployment
by sequential editing, revealing that current approaches fail drastically with
only 1000 edits. Our analysis provides a fundamental reexamination of both the
real-world applicability of existing model editing methods and their evaluation
practices, and establishes a rigorous evaluation framework with key insights to
advance reliable and practical model editing research.Summary
AI-Generated Summary