日期逻辑质量评估:在大型语言模型中对时间偏差进行基准测试
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
December 17, 2024
作者: Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi
cs.AI
摘要
本文介绍了DateLogicQA,一个涵盖多种日期格式、时间背景和推理类型的基准测试,共包含190个问题。我们提出了语义完整性度量标准,用于评估标记化质量,并分析了两种偏见:影响嵌入的表示级偏见和影响推理输出的逻辑级偏见。我们的研究结果全面评估了LLMs在时间推理中的能力和局限性,突出了处理时间数据准确性的关键挑战。我们的工作GitHub存储库可在以下网址找到:https://github.com/gagan3012/EAIS-Temporal-Bias
English
This paper introduces DateLogicQA, a benchmark with 190 questions covering
diverse date formats, temporal contexts, and reasoning types. We propose the
Semantic Integrity Metric to assess tokenization quality and analyse two
biases: Representation-Level Bias, affecting embeddings, and Logical-Level
Bias, influencing reasoning outputs. Our findings provide a comprehensive
evaluation of LLMs' capabilities and limitations in temporal reasoning,
highlighting key challenges in handling temporal data accurately. The GitHub
repository for our work is available at
https://github.com/gagan3012/EAIS-Temporal-BiasSummary
AI-Generated Summary