LLM이 평가를 위한 도전적인 문제를 생성하도록 하는 방법

초록

대규모 언어 모델(LLM)의 진화 속도는 엄격하고 포괄적인 평가를 위한 새로운 접근 방식을 요구하고 있습니다. 고품질의 도전적인 문제를 생성하는 데 따른 복잡성과 비용으로 인해 전통적인 인간 주석 방식은 점점 더 실현하기 어려워지고 있습니다. 본 연구에서는 인간의 개입 없이 LLM을 사용하여 도전적인 문제를 합성적으로 생성하는 통합 프레임워크인 CHASE를 소개합니다. 주어진 작업에 대해 우리의 접근 방식은 더 단순한 구성 요소로부터 하향식으로 어려운 문제를 구축합니다. 또한, 우리의 프레임워크는 생성 과정을 독립적으로 검증 가능한 하위 작업으로 분해함으로써 높은 수준의 품질과 정확성을 보장합니다. 우리는 CHASE를 구현하여 세 가지 다양한 영역에서 평가 벤치마크를 생성했습니다: (1) 문서 기반 질문 응답, (2) 저장소 수준 코드 완성, (3) 수학적 추론. 최신 LLM이 이러한 합성 벤치마크에서 보인 성능은 40-60%의 정확도 범위에 있어, 우리의 프레임워크가 도전적인 문제를 생성하는 데 효과적임을 입증합니다. 우리는 벤치마크와 코드를 공개합니다.

English

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

LLM이 평가를 위한 도전적인 문제를 생성하도록 하는 방법

How to Get Your LLM to Generate Challenging Problems for Evaluation

초록

Summary

Support