ChatPaper.aiChatPaper

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties

February 24, 2025
Authors: Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou
cs.AI

Abstract

Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.

Summary

AI-Generated Summary

Paper Overview

Core Contribution

  • Introduces Chinese Time Reasoning (CTM), a benchmark for evaluating temporal reasoning in Large Language Models (LLMs) within the context of Chinese dynastic chronology.
  • Emphasizes cross-entity relationships, pairwise temporal alignment, and culturally-grounded reasoning.
  • Provides a comprehensive evaluation of LLMs' temporal reasoning capabilities.

Research Context

  • Temporal reasoning is fundamental to human cognition and crucial for real-world applications.
  • Existing benchmarks lack contextual depth and involve a limited range of temporal entities.
  • CTM addresses these limitations by focusing on Chinese dynastic chronology, which spans a longer historical scope and includes culturally-grounded knowledge.

Keywords

  • Temporal reasoning
  • Chinese dynastic chronology
  • Cross-entity relationships
  • Pairwise temporal alignment
  • Culturally-grounded reasoning
  • Large Language Models (LLMs)

Background

Research Gap

  • Existing benchmarks rely on rule-based construction and lack contextualization.
  • Limited range of temporal entities in current evaluations.
  • Need for a benchmark that evaluates temporal reasoning within a culturally rich and historically extensive context.

Technical Challenges

  • Accurately modeling temporal relationships across a broad historical scope.
  • Incorporating culturally-grounded knowledge into temporal reasoning tasks.
  • Evaluating LLMs' ability to align entities across different temporal dimensions.

Prior Approaches

  • Rule-based benchmarks like TIMEQA, TEMPLAMA, and TEMPREASON.
  • LLM-based benchmarks like SITUATEDGEN and TIMEBENCH.
  • These benchmarks primarily focus on English and lack the depth and cultural context provided by CTM.

Methodology

Technical Architecture

  • CTM is built on a curated Chinese cultural entity repository with over 4,700 entities.
  • Includes entities such as historical figures, places, allusions, ingredients, and intangible cultural heritage.

Implementation Details

  • Tasks include Question-Answering (QA) and Timeline Ito Game.
  • QA tasks cover Entity-based Dynasty Determination, Plausibility Judgment, Temporal Order Understanding, Relation Reasoning, Script Error Correction, Entity Evolution Understanding, Time Interval Calculation, Temporal Entity Selection, and Long Script Error Correction.
  • Timeline Ito Game evaluates LLMs' ability to align entities across temporal and other dimensions.

Innovation Points

  • Focus on contextualization and cross-entity relationships.
  • Use of culturally-grounded and historical knowledge.
  • Introduction of the Timeline Ito Game for evaluating temporal alignment.

Results

Experimental Setup

  • Evaluated twelve mainstream LLMs, including both closed-source and open-source models.
  • Conducted experiments under zero-shot and chain-of-thought (CoT) settings.

Key Findings

  • Performance declines as the number of entities increases, with Time Interval Calculation being the most challenging task.
  • CoT improves performance but can negatively impact small LLMs or tasks with excessively long contexts.
  • InternLM2.5 performs well among small open-source models.
  • Temporal alignment is highly challenging, with even powerful models like GPT-4o struggling to exceed 40 on the Pass@8 metric.

Limitations

  • Prompt design and evaluation settings may vary across tasks and models.
  • Dataset scale and coverage could be expanded to include more complex temporal scenarios and longer historical events.
  • Future work could explore dynamic prompt designs and more diverse few-shot and zero-shot settings.

Featured Papers

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu WeiFeb 27, 2024610142

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202435311

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253485

PDF74February 25, 2025