RuleArena:一個用於在現實世界情境中使用LLMs進行規則導向推理的基準測試
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
December 12, 2024
作者: Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
cs.AI
摘要
本文介紹了RuleArena,一個新穎且具挑戰性的基準,旨在評估大型語言模型(LLMs)遵循複雜現實世界規則的能力。RuleArena涵蓋三個實際領域--航空公司行李費、NBA交易和稅務法規--評估LLMs處理需要長篇上下文理解、邏輯推理和準確數學計算的複雜自然語言指令的能力。RuleArena與傳統基於規則的推理基準有兩個關鍵特點:(1)它超越了標準的一階邏輯表示,(2)它基於真實實際情境,提供了LLMs應用於現實應用的適用性和可靠性見解。我們的研究發現了LLMs的幾個顯著限制:(1)它們難以識別並應用適當的規則,經常因為相似但不同的法規而感到困惑,(2)即使正確識別相關規則,它們也無法一致執行準確的數學計算,(3)總的來說,在基準測試中表現不佳。這些結果突顯了在現實應用中推進LLMs的規則導向推理能力所面臨的重大挑戰。
English
This paper introduces RuleArena, a novel and challenging benchmark designed
to evaluate the ability of large language models (LLMs) to follow complex,
real-world rules in reasoning. Covering three practical domains -- airline
baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs'
proficiency in handling intricate natural language instructions that demand
long-context understanding, logical reasoning, and accurate mathematical
computation. Two key attributes distinguish RuleArena from traditional
rule-based reasoning benchmarks: (1) it extends beyond standard first-order
logic representations, and (2) it is grounded in authentic, practical
scenarios, providing insights into the suitability and reliability of LLMs for
real-world applications. Our findings reveal several notable limitations in
LLMs: (1) they struggle to identify and apply the appropriate rules, frequently
becoming confused by similar but distinct regulations, (2) they cannot
consistently perform accurate mathematical computations, even when they
correctly identify the relevant rules, and (3) in general, they perform poorly
in the benchmark. These results highlight significant challenges in advancing
LLMs' rule-guided reasoning capabilities in real-life applications.Summary
AI-Generated Summary