写作基准：生成式写作的综合评测体系

摘要

近期，大型语言模型（LLMs）的显著进展极大地提升了文本生成能力，然而，评估其在生成性写作中的表现仍面临挑战。现有基准主要集中于通用文本生成或有限的写作任务，未能全面反映跨领域高质量写作内容的多样化需求。为填补这一空白，我们推出了WritingBench，一个旨在评估LLMs在6大核心写作领域及100个子领域表现的综合性基准，涵盖创意、说服、信息传递及技术写作。我们进一步提出了一种查询依赖的评估框架，使LLMs能够动态生成针对具体实例的评估标准。该框架辅以一个微调的批评模型，用于基于标准的评分，支持在风格、格式和长度等多维度进行评价。通过其数据整理能力，该框架的有效性得到了进一步验证，使得7B参数模型能够逼近当前最先进（SOTA）性能。我们开源了此基准，连同评估工具及模块化框架组件，以推动LLMs在写作领域的发展。

English

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.

写作基准：生成式写作的综合评测体系

WritingBench: A Comprehensive Benchmark for Generative Writing

摘要

Summary

Support

Support