利用指令跟随检索器进行恶意信息获取

摘要

指令跟随检索器已与大型语言模型（LLMs）一同广泛应用于现实场景中，然而，针对其日益增强的搜索能力所带来的安全风险，相关研究却相对匮乏。我们通过实证研究，探讨了检索器在直接使用及基于检索增强生成（RAG）框架下满足恶意查询的能力。具体而言，我们考察了包括NV-Embed和LLM2Vec在内的六种主流检索器，发现面对恶意请求时，多数检索器能够（针对超过50%的查询）筛选出相关有害段落。例如，LLM2Vec在我们的恶意查询中，正确选取段落的比例高达61.35%。此外，我们揭示了一种新兴风险，即通过利用指令跟随检索器的指令遵循特性，可以轻易获取高度相关的有害信息。最后，我们证明，即便是经过安全对齐的LLMs，如Llama3，在上下文中接收到有害检索段落时，也能满足恶意请求。综上所述，我们的研究结果凸显了随着检索器能力提升而伴随的恶意滥用风险。

English

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

利用指令跟随检索器进行恶意信息获取

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

摘要

Summary

Support