LongBench v2：朝向對現實長文本多任務進行更深入理解和推理

摘要

本文介紹了LongBench v2，這是一個旨在評估LLM處理需要深度理解和推理的長篇文本問題的基準測試。LongBench v2 包含503個具有挑戰性的多選題，涵蓋範圍從8k到2M字，跨越六個主要任務類別：單篇文件問答、多篇文件問答、長篇文本學習、長對話歷史理解、程式碼庫理解和長結構化數據理解。為確保廣度和實用性，我們從近100位具有多樣專業背景的高度受過教育的個人收集數據。我們採用自動化和手動審查流程來保持高質量和難度，結果顯示在15分鐘的時間限制下，專家僅達到53.7%的準確率。我們的評估顯示，當直接回答問題時，表現最佳的模型僅達到50.1%的準確率。相比之下，包括更長推理的o1-preview模型達到57.7%，超越人類基準4%。這些結果突顯了增強推理能力和擴展推理時間計算的重要性，以應對LongBench v2中的長篇文本挑戰。該項目可在https://longbench2.github.io 上找到。

English

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.

LongBench v2：朝向對現實長文本多任務進行更深入理解和推理

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

摘要

Support