HelloBench: 대규모 언어 모델의 장문 텍스트 생성 능력 평가

초록

최근 몇 년간 대규모 언어 모델(Large Language Models, LLMs)은 다양한 작업(예: 긴 문맥 이해)에서 놀라운 능력을 보여주었으며 많은 벤치마크가 제안되었습니다. 그러나 우리는 긴 텍스트 생성 능력이 충분히 조사되지 않았다는 점을 관찰했습니다. 따라서 우리는 계층적 긴 텍스트 생성 벤치마크(Hierarchical Long Text Generation Benchmark, HelloBench)를 소개합니다. 이는 LLMs의 긴 텍스트 생성 성능을 평가하기 위한 포괄적이고 현장에서 이루어지며 개방적인 벤치마크입니다. Bloom의 분류법을 기반으로 HelloBench는 긴 텍스트 생성 작업을 열 다섯 하위 작업으로 분류합니다: 개방형 QA, 요약, 채팅, 텍스트 완성, 그리고 휴리스틱 텍스트 생성. 또한 우리는 인간과 일치하는 평가 방법인 계층적 긴 텍스트 평가(Hierarchical Long Text Evaluation, HelloEval)를 제안합니다. 이 방법은 인간 평가에 필요한 시간과 노력을 크게 줄이면서 인간 평가와 높은 상관 관계를 유지합니다. 우리는 주요 LLMs 약 30개를 대상으로 광범위한 실험을 실시했으며 현재 LLMs가 긴 텍스트 생성 능력이 부족하다는 것을 관찰했습니다. 특히, 첫째, 명시적 또는 암시적 길이 제약이 포함된 지시에 관계없이 대부분의 LLMs가 4000단어보다 긴 텍스트를 생성할 수 없다는 점을 관찰했습니다. 둘째, 일부 LLMs가 더 긴 텍스트를 생성할 수 있지만 심각한 반복과 품질 저하와 같은 여러 문제가 존재한다는 점을 관찰했습니다. 셋째, HelloEval의 효과를 입증하기 위해 HelloEval을 전통적인 메트릭(예: ROUGE, BLEU 등) 및 LLM-as-a-Judge 방법과 비교하여 HelloEval이 인간 평가와 가장 높은 상관 관계를 가지고 있음을 보여줍니다. 우리의 코드는 https://github.com/Quehry/HelloBench에서 공개되어 있습니다.

English

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.

HelloBench: 대규모 언어 모델의 장문 텍스트 생성 능력 평가

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

초록

Summary

Support

Support