CLIPPER：通过压缩实现长上下文合成数据生成

摘要

大型语言模型（LLM）开发者日益依赖合成数据，但为复杂的长上下文推理任务生成高质量数据仍具挑战性。我们提出了CLIPPER，一种基于压缩的方法，专门用于生成针对叙事性声明验证的合成数据——这一任务要求通过整本书的推理来验证给定声明。与直接从书籍原始文本生成声明（这会导致声明充满人工痕迹）不同，CLIPPER首先将书籍压缩为章节概要和书籍摘要，然后利用这些中间表示来生成复杂的声明及相应的思维链。与简单方法相比，CLIPPER生成的声明更加有效、有据可依且复杂。借助CLIPPER，我们构建了一个包含19,000条合成书籍声明的数据集，每条声明均与其源文本及思维链推理配对，并利用该数据集微调了三个开放权重模型。我们的最佳模型在叙事性声明验证上取得了突破性成果（测试集准确率从28%提升至76%），并在NoCha排行榜上为10B以下模型设立了新的技术标杆。进一步分析表明，我们的模型生成了更为详尽且基于事实的思维链推理，同时在其他叙事理解任务（如NarrativeQA）上的表现也有所提升。

English

LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

CLIPPER：通过压缩实现长上下文合成数据生成

CLIPPER: Compression enables long-context synthetic data generation

摘要

Summary

Support