写作基准:生成式写作的综合评测体系
WritingBench: A Comprehensive Benchmark for Generative Writing
March 7, 2025
作者: Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, SHaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang
cs.AI
摘要
近期,大型语言模型(LLMs)的显著进展极大地提升了文本生成能力,然而,评估其在生成性写作中的表现仍面临挑战。现有基准主要集中于通用文本生成或有限的写作任务,未能全面反映跨领域高质量写作内容的多样化需求。为填补这一空白,我们推出了WritingBench,一个旨在评估LLMs在6大核心写作领域及100个子领域表现的综合性基准,涵盖创意、说服、信息传递及技术写作。我们进一步提出了一种查询依赖的评估框架,使LLMs能够动态生成针对具体实例的评估标准。该框架辅以一个微调的批评模型,用于基于标准的评分,支持在风格、格式和长度等多维度进行评价。通过其数据整理能力,该框架的有效性得到了进一步验证,使得7B参数模型能够逼近当前最先进(SOTA)性能。我们开源了此基准,连同评估工具及模块化框架组件,以推动LLMs在写作领域的发展。
English
Recent advancements in large language models (LLMs) have significantly
enhanced text generation capabilities, yet evaluating their performance in
generative writing remains a challenge. Existing benchmarks primarily focus on
generic text generation or limited in writing tasks, failing to capture the
diverse requirements of high-quality written contents across various domains.
To bridge this gap, we present WritingBench, a comprehensive benchmark designed
to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing
creative, persuasive, informative, and technical writing. We further propose a
query-dependent evaluation framework that empowers LLMs to dynamically generate
instance-specific assessment criteria. This framework is complemented by a
fine-tuned critic model for criteria-aware scoring, enabling evaluations in
style, format and length. The framework's validity is further demonstrated by
its data curation capability, which enables 7B-parameter models to approach
state-of-the-art (SOTA) performance. We open-source the benchmark, along with
evaluation tools and modular framework components, to advance the development
of LLMs in writing.Summary
AI-Generated Summary