Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
Abstract
Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces Chinese Time Reasoning (CTM), a benchmark for evaluating temporal reasoning in Large Language Models (LLMs) within the context of Chinese dynastic chronology.
- Emphasizes cross-entity relationships, pairwise temporal alignment, and culturally-grounded reasoning.
- Provides a comprehensive evaluation of LLMs' temporal reasoning capabilities.
Research Context
- Temporal reasoning is fundamental to human cognition and crucial for real-world applications.
- Existing benchmarks lack contextual depth and involve a limited range of temporal entities.
- CTM addresses these limitations by focusing on Chinese dynastic chronology, which spans a longer historical scope and includes culturally-grounded knowledge.
Keywords
- Temporal reasoning
- Chinese dynastic chronology
- Cross-entity relationships
- Pairwise temporal alignment
- Culturally-grounded reasoning
- Large Language Models (LLMs)
Background
Research Gap
- Existing benchmarks rely on rule-based construction and lack contextualization.
- Limited range of temporal entities in current evaluations.
- Need for a benchmark that evaluates temporal reasoning within a culturally rich and historically extensive context.
Technical Challenges
- Accurately modeling temporal relationships across a broad historical scope.
- Incorporating culturally-grounded knowledge into temporal reasoning tasks.
- Evaluating LLMs' ability to align entities across different temporal dimensions.
Prior Approaches
- Rule-based benchmarks like TIMEQA, TEMPLAMA, and TEMPREASON.
- LLM-based benchmarks like SITUATEDGEN and TIMEBENCH.
- These benchmarks primarily focus on English and lack the depth and cultural context provided by CTM.
Methodology
Technical Architecture
- CTM is built on a curated Chinese cultural entity repository with over 4,700 entities.
- Includes entities such as historical figures, places, allusions, ingredients, and intangible cultural heritage.
Implementation Details
- Tasks include Question-Answering (QA) and Timeline Ito Game.
- QA tasks cover Entity-based Dynasty Determination, Plausibility Judgment, Temporal Order Understanding, Relation Reasoning, Script Error Correction, Entity Evolution Understanding, Time Interval Calculation, Temporal Entity Selection, and Long Script Error Correction.
- Timeline Ito Game evaluates LLMs' ability to align entities across temporal and other dimensions.
Innovation Points
- Focus on contextualization and cross-entity relationships.
- Use of culturally-grounded and historical knowledge.
- Introduction of the Timeline Ito Game for evaluating temporal alignment.
Results
Experimental Setup
- Evaluated twelve mainstream LLMs, including both closed-source and open-source models.
- Conducted experiments under zero-shot and chain-of-thought (CoT) settings.
Key Findings
- Performance declines as the number of entities increases, with Time Interval Calculation being the most challenging task.
- CoT improves performance but can negatively impact small LLMs or tasks with excessively long contexts.
- InternLM2.5 performs well among small open-source models.
- Temporal alignment is highly challenging, with even powerful models like GPT-4o struggling to exceed 40 on the Pass@8 metric.
Limitations
- Prompt design and evaluation settings may vary across tasks and models.
- Dataset scale and coverage could be expanded to include more complex temporal scenarios and longer historical events.
- Future work could explore dynamic prompt designs and more diverse few-shot and zero-shot settings.