Mask-DPO：面向大语言模型的通用细粒度事实性对齐方法

摘要

大型语言模型（LLMs）在作为AI助手服务于各领域时，常出现幻觉现象，即生成不忠实或无意义的信息。由于这些幻觉总是伴随着模型响应中的真实内容，以往基于响应级别偏好学习的事实性对齐方法在训练过程中不可避免地引入了噪声。为此，本文提出了一种基于直接偏好优化（DPO）的细粒度事实性对齐方法，称为Mask-DPO。通过将句子级别的事实性作为掩码信号，Mask-DPO仅从优选样本中事实正确的句子中学习，并避免对非优选样本中事实内容的惩罚，从而解决了偏好学习中的模糊性问题。大量实验结果表明，Mask-DPO能显著提升LLMs对来自域内及域外数据集问题的回答事实性，尽管这些问题及其相关主题在训练期间未曾见过。仅在ANAH训练集上训练后，Llama3.1-8B-Instruct在ANAH测试集上的得分从49.19%提升至77.53%，甚至超过了Llama3.1-70B-Instruct的得分（53.44%），同时其在域外传记数据集上的FactScore也从30.29%提高到了39.39%。我们进一步研究了Mask-DPO在不同训练样本扩展策略下的泛化特性，发现扩展数据集中的主题数量比问题数量更为有效。我们提出了关于LLMs事实性对齐作用机制的假设，探讨了这一现象的意义，并通过概念验证实验加以证实。希望该方法及发现能为未来扩展事实性对齐的研究铺平道路。

English

Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.

Mask-DPO：面向大语言模型的通用细粒度事实性对齐方法

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

摘要

Summary

Support

Support