如何让您的LLM生成用于评估的挑战性问题

摘要

大型语言模型（LLMs）的快速发展催生了对严谨且全面评估方法的新需求。传统的人工标注因生成高质量、高难度问题所涉及的复杂性和成本而日益显得不切实际。本研究中，我们提出了CHASE框架，这是一个无需人工干预、利用LLMs合成生成挑战性问题的统一框架。针对特定任务，我们的方法从简单组件自底向上构建难题。此外，该框架将生成过程分解为可独立验证的子任务，从而确保了高质量与正确性。我们应用CHASE在三个不同领域创建了评估基准：(1)基于文档的问答，(2)仓库级别的代码补全，以及(3)数学推理。在这些合成基准上，当前最先进的LLMs表现出的准确率介于40%至60%之间，有效证明了我们框架在生成挑战性问题方面的效能。我们公开了这些基准及代码。

English

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

如何让您的LLM生成用于评估的挑战性问题

How to Get Your LLM to Generate Challenging Problems for Evaluation

摘要

Summary

Support