日期邏輯 QA：在大型語言模型中基準測試時間偏差

摘要

本文介紹了DateLogicQA，一個包含190個問題，涵蓋多樣的日期格式、時間背景和推理類型的基準。我們提出了語義完整性度量標準，用於評估標記化質量並分析兩種偏見：影響嵌入的表示層偏見和影響推理輸出的邏輯層偏見。我們的研究結果全面評估了LLMs在時間推理方面的能力和局限性，突出處理時間數據準確性的關鍵挑戰。我們的工作GitHub存儲庫可在以下網址找到：https://github.com/gagan3012/EAIS-Temporal-Bias

English

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias

日期邏輯 QA：在大型語言模型中基準測試時間偏差

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

摘要

Support