RuleArena：LLMによるルールに基づく推論のための実世界シナリオのベンチマーク

要旨

本論文では、複雑で現実世界のルールに従う能力を評価するために設計された革新的で厳しいベンチマークであるRuleArenaを紹介します。航空会社の手荷物料金、NBAの取引、税制規則といった3つの実践的な領域をカバーし、RuleArenaは、LLM（大規模言語モデル）が長い文脈の理解、論理的推論、正確な数学的計算を要求する入り組んだ自然言語の指示を処理する能力を評価します。RuleArenaを従来の基準論理表現を超える点と、現実的な実践的シナリオに基づいており、LLMの実世界の適用に対する適合性と信頼性についての洞察を提供する点の2つの重要な属性があります。我々の研究結果は、LLMにいくつかの注目すべき制限があることを明らかにしました：（1）適切なルールを特定して適用するのに苦労し、似ているが異なる規則によく混乱する、（2）関連するルールを正しく特定した場合でも、一貫して正確な数学的計算を行うことができない、そして（3）一般的に、ベンチマークでのパフォーマンスが低いです。これらの結果は、LLMの現実世界のアプリケーションにおけるルールに基づく推論能力を向上させる上で重要な課題を浮き彫りにしています。

English

This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.

RuleArena：LLMによるルールに基づく推論のための実世界シナリオのベンチマーク

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

要旨

Summary

Support

Support