NoLiMa：超越字面匹配的长上下文评估

摘要

最近的大型语言模型（LLMs）支持从128K到1M个标记的长上下文。评估这些能力的一种流行方法是“草堆中的针”（NIAH）测试，涉及从“草堆”（长无关上下文）中检索“针”（相关信息）。这种方法的扩展包括增加干扰项、事实链接和上下文推理。然而，在这些基准测试中，模型可以利用针和草堆之间的现有文字匹配来简化任务。为解决这个问题，我们引入了NoLiMa，这是一个通过精心设计的针集扩展了NIAH的基准测试，其中问题和针之间的词汇重叠最小，需要模型推断潜在关联以定位草堆中的针。我们评估了声称支持至少128K标记上下文的12个流行LLMs。虽然它们在短上下文（<1K）中表现良好，但随着上下文长度的增加，性能明显下降。例如，在32K时，有10个模型的表现低于其强短长度基线的50%。即使是表现最佳的例外之一GPT-4o，也从几乎完美的99.3%基线降至69.7%。我们的分析表明，这些下降源于当长上下文中不存在文字匹配时，注意力机制面临的困难增加，使得检索相关信息变得更加困难。

English

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

NoLiMa：超越字面匹配的长上下文评估

NoLiMa: Long-Context Evaluation Beyond Literal Matching

摘要

Summary

Support