RuleArena:LLM在现实场景中规则引导推理的基准测试
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
December 12, 2024
作者: Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
cs.AI
摘要
本文介绍了RuleArena,这是一个新颖且具有挑战性的基准,旨在评估大型语言模型(LLMs)在推理中遵循复杂真实世界规则的能力。RuleArena涵盖了三个实际领域——航空公司行李费、NBA交易和税收法规——评估LLMs处理需要长上下文理解、逻辑推理和准确数学计算的复杂自然语言指令的能力。RuleArena与传统基于规则推理基准的两个关键特点有所不同:(1)它超越了标准的一阶逻辑表示,(2)它基于真实的实际场景,为LLMs在实际应用中的适用性和可靠性提供了见解。我们的研究结果揭示了LLMs存在几个显著局限性:(1)它们难以识别和应用适当的规则,经常被相似但不同的法规所困扰,(2)即使正确识别相关规则,它们也无法始终执行准确的数学计算,(3)总体而言,在基准测试中表现不佳。这些结果突显了在推进LLMs在现实应用中的规则引导推理能力方面面临的重大挑战。
English
This paper introduces RuleArena, a novel and challenging benchmark designed
to evaluate the ability of large language models (LLMs) to follow complex,
real-world rules in reasoning. Covering three practical domains -- airline
baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs'
proficiency in handling intricate natural language instructions that demand
long-context understanding, logical reasoning, and accurate mathematical
computation. Two key attributes distinguish RuleArena from traditional
rule-based reasoning benchmarks: (1) it extends beyond standard first-order
logic representations, and (2) it is grounded in authentic, practical
scenarios, providing insights into the suitability and reliability of LLMs for
real-world applications. Our findings reveal several notable limitations in
LLMs: (1) they struggle to identify and apply the appropriate rules, frequently
becoming confused by similar but distinct regulations, (2) they cannot
consistently perform accurate mathematical computations, even when they
correctly identify the relevant rules, and (3) in general, they perform poorly
in the benchmark. These results highlight significant challenges in advancing
LLMs' rule-guided reasoning capabilities in real-life applications.Summary
AI-Generated Summary