RuleArena: 현실 세계 시나리오에서 LLMs와 함께 규칙 지도 추론을 위한 벤치마크

초록

본 논문은 RuleArena을 소개하는데, 이는 복잡하고 현실적인 규칙을 추론하는 능력을 평가하기 위해 설계된 혁신적이고 도전적인 벤치마크입니다. 항공사 수하물 수수료, NBA 거래, 그리고 세법규정이라는 세 가지 실용적인 영역을 다루며, RuleArena은 LLMs가 긴 문맥을 이해하고 논리적 추론과 정확한 수학적 계산을 요구하는 복잡한 자연어 지시를 처리하는 능력을 평가합니다. RuleArena을 전통적인 규칙 기반 추론 벤치마크와 구별하는 두 가지 주요 특징은 다음과 같습니다: (1) 표준 일차 논리 표현을 넘어선다는 점, 그리고 (2) 실제 실무 시나리오에 근거하여 구축되어 LLMs의 실제 응용 가능성과 신뢰성에 대한 통찰을 제공합니다. 우리의 연구 결과는 LLMs의 여러 주목할만한 한계를 드러냅니다: (1) 적절한 규칙을 식별하고 적용하는 데 어려움을 겪으며 종종 유사하지만 구별되는 규정에 혼란을 겪는다는 점, (2) 관련 규칙을 올바르게 식별하더라도 정확한 수학적 계산을 일관되게 수행하지 못하며, (3) 일반적으로 벤치마크에서 성능이 저조합니다. 이러한 결과는 LLMs의 실제 응용 프로그램에서의 규칙 기반 추론 능력을 발전시키는 데 중요한 도전을 강조합니다.

English

This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.

RuleArena: 현실 세계 시나리오에서 LLMs와 함께 규칙 지도 추론을 위한 벤치마크

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

초록

Support