DateLogicQA: 대형 언어 모델에서의 시간적 편향 벤치마킹

초록

본 논문은 다양한 날짜 형식, 시간적 맥락 및 추론 유형을 다루는 190개의 질문을 포함한 DateLogicQA 벤치마크를 소개합니다. 우리는 토큰화 품질을 평가하기 위한 의미 무결성 측정 지표를 제안하고, 임베딩에 영향을 주는 표현 수준 편향과 추론 결과에 영향을 미치는 논리 수준 편향 두 가지 편향을 분석합니다. 우리의 연구 결과는 LLMs의 시간적 추론 능력과 한계를 종합적으로 평가하며, 시간적 데이터를 정확하게 처리하는 데 중요한 도전 과제를 강조합니다. 저희의 작업에 대한 GitHub 저장소는 https://github.com/gagan3012/EAIS-Temporal-Bias에서 확인할 수 있습니다.

English

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias

DateLogicQA: 대형 언어 모델에서의 시간적 편향 벤치마킹

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

초록

Support