日期邏輯 QA:在大型語言模型中基準測試時間偏差

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

December 17, 2024
作者: Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi
cs.AI

摘要

本文介紹了DateLogicQA,一個包含190個問題,涵蓋多樣的日期格式、時間背景和推理類型的基準。我們提出了語義完整性度量標準,用於評估標記化質量並分析兩種偏見:影響嵌入的表示層偏見和影響推理輸出的邏輯層偏見。我們的研究結果全面評估了LLMs在時間推理方面的能力和局限性,突出處理時間數據準確性的關鍵挑戰。我們的工作GitHub存儲庫可在以下網址找到:https://github.com/gagan3012/EAIS-Temporal-Bias
English
This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias

Summary

AI-Generated Summary

PDF22December 20, 2024