HelloBench:評估大型語言模型的長文本生成能力
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
September 24, 2024
作者: Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng, Zhaoxiang Zhang, Songyang Zhang, Kai Chen
cs.AI
摘要
近年來,大型語言模型(LLMs)在各種任務(例如長文本理解)中展現出卓越的能力,並提出了許多基準。然而,我們觀察到長文本生成能力尚未受到充分探討。因此,我們引入了階層式長文本生成基準(HelloBench),這是一個全面、野外和開放式的基準,用於評估LLMs在生成長文本方面的表現。基於布魯姆的分類法,HelloBench將長文本生成任務分為五個子任務:開放式問答、摘要、對話、文本補全和啟發式文本生成。此外,我們提出了階層式長文本評估(HelloEval),這是一種與人類對齊的評估方法,可以顯著減少人工評估所需的時間和工作量,同時保持與人工評估的高相關性。我們對約30個主流LLMs進行了廣泛實驗,觀察到目前的LLMs缺乏長文本生成能力。具體而言,首先,無論指示中是否包含明確或隱含的長度限制,我們觀察到大多數LLMs無法生成超過4000個字的文本。其次,我們觀察到,雖然一些LLMs可以生成較長的文本,但存在許多問題(例如嚴重的重複和質量下降)。第三,為了展示HelloEval的有效性,我們將HelloEval與傳統指標(例如ROUGE、BLEU等)和LLM作為評判方法進行比較,結果顯示HelloEval與人工評估之間具有最高相關性。我們在https://github.com/Quehry/HelloBench 上發布了我們的代碼。
English
In recent years, Large Language Models (LLMs) have demonstrated remarkable
capabilities in various tasks (e.g., long-context understanding), and many
benchmarks have been proposed. However, we observe that long text generation
capabilities are not well investigated. Therefore, we introduce the
Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive,
in-the-wild, and open-ended benchmark to evaluate LLMs' performance in
generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long
text generation tasks into five subtasks: open-ended QA, summarization, chat,
text completion, and heuristic text generation. Besides, we propose
Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation
method that significantly reduces the time and effort required for human
evaluation while maintaining a high correlation with human evaluation. We have
conducted extensive experiments across around 30 mainstream LLMs and observed
that the current LLMs lack long text generation capabilities. Specifically,
first, regardless of whether the instructions include explicit or implicit
length constraints, we observe that most LLMs cannot generate text that is
longer than 4000 words. Second, we observe that while some LLMs can generate
longer text, many issues exist (e.g., severe repetition and quality
degradation). Third, to demonstrate the effectiveness of HelloEval, we compare
HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge
methods, which show that HelloEval has the highest correlation with human
evaluation. We release our code in https://github.com/Quehry/HelloBench.Summary
AI-Generated Summary