ChatPaper.aiChatPaper

PROMPTEVALS:面向定制化生产大语言模型管道的断言与防护机制数据集

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

April 20, 2025
作者: Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran
cs.AI

摘要

大型语言模型(LLMs)正日益被部署于跨多个领域的专业生产数据处理流程中,如金融、营销和电子商务。然而,当在众多输入的生产环境中运行时,它们往往难以遵循指令或满足开发者的期望。为了提高这些应用中的可靠性,为LLM输出创建断言或防护栏以与流程并行运行至关重要。然而,确定能准确捕捉开发者任务需求的断言集合颇具挑战。本文中,我们介绍了PROMPTEVALS,这是一个包含2087个LLM流程提示及12623条相应断言标准的数据集,数据来源于使用我们开源LLM流程工具的开发者。该数据集规模是先前集合的5倍。利用PROMPTEVALS的保留测试集作为基准,我们评估了闭源与开源模型在生成相关断言方面的表现。值得注意的是,我们微调后的Mistral和Llama 3模型平均比GPT-4o高出20.93%,不仅降低了延迟,还提升了性能。我们相信,该数据集将推动LLM可靠性、对齐及提示工程领域的进一步研究。
English
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

Summary

AI-Generated Summary

PDF32April 22, 2025