ChatPaper.aiChatPaper

零基准:当代大型多模态模型面临的视觉挑战极限

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

February 13, 2025
作者: Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie
cs.AI

摘要

大型多模态模型(LMMs)在图像理解方面存在显著不足,在某些衡量标准下,其空间认知能力甚至不及幼童或动物。尽管如此,这些模型在许多流行的视觉基准测试中仍能取得高分,而模型性能的持续快速提升正迅速缩小这一差距。为解决这一问题,亟需开发难度更高且能长期保持相关性的基准测试。我们将这一理念推向极致,推出了ZeroBench——一个轻量级的视觉推理基准测试,对当前最前沿的LMMs而言完全无法解答。该基准测试包含100道精心设计的问题及334道难度较低的次级问题。我们对20个LMMs进行了ZeroBench评估,所有模型得分均为0.0%,并对错误进行了深入分析。为促进视觉理解领域的进步,我们公开了ZeroBench基准测试。
English
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Summary

AI-Generated Summary

PDF395February 17, 2025