迷失在时间中:多模态LLM中的时钟和日历理解挑战
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
February 7, 2025
作者: Rohit Saxena, Aryo Pradipta Gema, Pasquale Minervini
cs.AI
摘要
从视觉表征中理解时间是一项基本的认知技能,但对于多模态大型语言模型(MLLMs)而言仍然是一个挑战。在这项工作中,我们调查了MLLMs在通过模拟时钟和年历来解释时间和日期方面的能力。为此,我们精心策划了一个结构化数据集,包括两个子集:1)ClockQA,其中包括各种类型的时钟样式-标准、黑色表盘、无秒针、罗马数字和箭头指针时钟,配对有与时间相关的问题;和2)CalendarQA,其中包含年历图片,问题涵盖了从众所周知的日期(例如圣诞节、元旦)到计算推导的日期(例如一年中的第100或第153天)。我们旨在分析MLLMs在面对与时间相关的视觉数据时如何执行视觉识别、数值推理和时间推理。我们的评估表明,尽管最近取得了进展,但对于MLLMs而言,可靠地理解时间仍然是一个重大挑战。
English
Understanding time from visual representations is a fundamental cognitive
skill, yet it remains a challenge for multimodal large language models (MLLMs).
In this work, we investigate the capabilities of MLLMs in interpreting time and
date through analogue clocks and yearly calendars. To facilitate this, we
curated a structured dataset comprising two subsets: 1) ClockQA,
which comprises various types of clock styles-standard, black-dial,
no-second-hand, Roman numeral, and arrow-hand clocks-paired with time related
questions; and 2) CalendarQA, which consists of yearly calendar
images with questions ranging from commonly known dates (e.g., Christmas, New
Year's Day) to computationally derived ones (e.g., the 100th or 153rd day of
the year). We aim to analyse how MLLMs can perform visual recognition,
numerical reasoning, and temporal inference when presented with time-related
visual data. Our evaluations show that despite recent advancements, reliably
understanding time remains a significant challenge for MLLMs.Summary
AI-Generated Summary