ChatPaper.aiChatPaper

CLIPPER:通过压缩实现长上下文合成数据生成

CLIPPER: Compression enables long-context synthetic data generation

February 20, 2025
作者: Chau Minh Pham, Yapei Chang, Mohit Iyyer
cs.AI

摘要

大型语言模型(LLM)开发者日益依赖合成数据,但为复杂的长上下文推理任务生成高质量数据仍具挑战性。我们提出了CLIPPER,一种基于压缩的方法,专门用于生成针对叙事性声明验证的合成数据——这一任务要求通过整本书的推理来验证给定声明。与直接从书籍原始文本生成声明(这会导致声明充满人工痕迹)不同,CLIPPER首先将书籍压缩为章节概要和书籍摘要,然后利用这些中间表示来生成复杂的声明及相应的思维链。与简单方法相比,CLIPPER生成的声明更加有效、有据可依且复杂。借助CLIPPER,我们构建了一个包含19,000条合成书籍声明的数据集,每条声明均与其源文本及思维链推理配对,并利用该数据集微调了三个开放权重模型。我们的最佳模型在叙事性声明验证上取得了突破性成果(测试集准确率从28%提升至76%),并在NoCha排行榜上为10B以下模型设立了新的技术标杆。进一步分析表明,我们的模型生成了更为详尽且基于事实的思维链推理,同时在其他叙事理解任务(如NarrativeQA)上的表现也有所提升。
English
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

Summary

AI-Generated Summary

PDF82February 21, 2025