ChatPaper.aiChatPaper

稀疏自编码器在人工文本检测中的特征级洞察

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

March 5, 2025
作者: Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
cs.AI

摘要

随着大型语言模型(LLMs)的快速发展,人工文本检测(ATD)的重要性日益凸显。尽管已有诸多尝试,但尚无单一算法能在面对不同类型未见文本时表现始终优异,或确保对新LLMs的有效泛化。在这一过程中,可解释性扮演着关键角色。本研究中,我们通过使用稀疏自编码器(SAE)从Gemma-2-2b的残差流中提取特征,增强了ATD的可解释性。我们识别出既具可解释性又高效的特征,并通过领域及模型特定的统计、一种导向方法以及人工或基于LLM的解释,深入分析了这些特征的语义及其相关性。我们的方法为理解不同模型生成的文本与人类书写内容之间的差异提供了宝贵洞见。研究表明,即便现代LLMs能够通过个性化提示生成类人输出,它们仍具有独特的写作风格,尤其是在信息密集的领域中。
English
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

Summary

AI-Generated Summary

PDF2082March 11, 2025