如何引导大语言模型的潜在空间以检测幻觉现象?
How to Steer LLM Latents for Hallucination Detection?
March 1, 2025
作者: Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li
cs.AI
摘要
大语言模型(LLM)中的幻觉问题对其在现实世界应用中的安全部署构成了重大挑战。近期研究尝试利用LLM的潜在空间进行幻觉检测,但由于其嵌入主要优化于语言连贯性而非事实准确性,往往难以清晰区分真实与幻觉内容。为此,我们提出了真实性分离向量(Truthfulness Separator Vector, TSV),这是一种轻量且灵活的导向向量,在推理过程中重塑LLM的表示空间,以增强真实输出与幻觉输出之间的分离度,而无需修改模型参数。我们的两阶段框架首先在一小组标注样本上训练TSV,形成紧凑且分离良好的聚类。随后,通过引入未标注的LLM生成数据,采用基于最优传输的伪标签算法结合置信度过滤过程,扩充样本集。大量实验表明,TSV在仅需少量标注数据的情况下即达到了最先进的性能,展现出跨数据集的强大泛化能力,为LLM的实际应用提供了切实可行的解决方案。
English
Hallucinations in LLMs pose a significant concern to their safe deployment in
real-world applications. Recent approaches have leveraged the latent space of
LLMs for hallucination detection, but their embeddings, optimized for
linguistic coherence rather than factual accuracy, often fail to clearly
separate truthful and hallucinated content. To this end, we propose the
Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector
that reshapes the LLM's representation space during inference to enhance the
separation between truthful and hallucinated outputs, without altering model
parameters. Our two-stage framework first trains TSV on a small set of labeled
exemplars to form compact and well-separated clusters. It then augments the
exemplar set with unlabeled LLM generations, employing an optimal
transport-based algorithm for pseudo-labeling combined with a confidence-based
filtering process. Extensive experiments demonstrate that TSV achieves
state-of-the-art performance with minimal labeled data, exhibiting strong
generalization across datasets and providing a practical solution for
real-world LLM applications.Summary
AI-Generated Summary