日期邏輯 QA:在大型語言模型中基準測試時間偏差
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
December 17, 2024
作者: Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi
cs.AI
摘要
本文介紹了DateLogicQA,一個包含190個問題,涵蓋多樣的日期格式、時間背景和推理類型的基準。我們提出了語義完整性度量標準,用於評估標記化質量並分析兩種偏見:影響嵌入的表示層偏見和影響推理輸出的邏輯層偏見。我們的研究結果全面評估了LLMs在時間推理方面的能力和局限性,突出處理時間數據準確性的關鍵挑戰。我們的工作GitHub存儲庫可在以下網址找到:https://github.com/gagan3012/EAIS-Temporal-Bias
English
This paper introduces DateLogicQA, a benchmark with 190 questions covering
diverse date formats, temporal contexts, and reasoning types. We propose the
Semantic Integrity Metric to assess tokenization quality and analyse two
biases: Representation-Level Bias, affecting embeddings, and Logical-Level
Bias, influencing reasoning outputs. Our findings provide a comprehensive
evaluation of LLMs' capabilities and limitations in temporal reasoning,
highlighting key challenges in handling temporal data accurately. The GitHub
repository for our work is available at
https://github.com/gagan3012/EAIS-Temporal-BiasSummary
AI-Generated Summary