如何引导大语言模型的潜在空间以检测幻觉现象？

摘要

大语言模型（LLM）中的幻觉问题对其在现实世界应用中的安全部署构成了重大挑战。近期研究尝试利用LLM的潜在空间进行幻觉检测，但由于其嵌入主要优化于语言连贯性而非事实准确性，往往难以清晰区分真实与幻觉内容。为此，我们提出了真实性分离向量（Truthfulness Separator Vector, TSV），这是一种轻量且灵活的导向向量，在推理过程中重塑LLM的表示空间，以增强真实输出与幻觉输出之间的分离度，而无需修改模型参数。我们的两阶段框架首先在一小组标注样本上训练TSV，形成紧凑且分离良好的聚类。随后，通过引入未标注的LLM生成数据，采用基于最优传输的伪标签算法结合置信度过滤过程，扩充样本集。大量实验表明，TSV在仅需少量标注数据的情况下即达到了最先进的性能，展现出跨数据集的强大泛化能力，为LLM的实际应用提供了切实可行的解决方案。

English

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

如何引导大语言模型的潜在空间以检测幻觉现象？

How to Steer LLM Latents for Hallucination Detection?

摘要

Summary

Support