论大型多模态模型作为开放世界图像分类器的应用

摘要

传统的图像分类方法依赖于预先定义的语义类别列表。相比之下，大型多模态模型（LMMs）能够绕过这一限制，直接利用自然语言对图像进行分类（例如，回答提示“图像中的主要物体是什么？”）。尽管具备这一显著能力，现有关于LMM分类性能的研究大多出人意料地局限于封闭世界设定，即假设存在一组预定义的类别。本研究中，我们通过全面评估LMM在真正开放世界设定下的分类性能，填补了这一空白。我们首先形式化了这一任务，并引入了一套评估协议，定义了多种指标来衡量预测类别与真实类别之间的对齐程度。随后，我们在10个基准测试上评估了13个模型，涵盖了原型、非原型、细粒度及极细粒度类别，展示了LMM在此任务中面临的挑战。基于所提出指标的进一步分析揭示了LMM所犯错误的类型，强调了与粒度和细粒度能力相关的挑战，并展示了如何通过定制提示和推理来缓解这些问题。

English

Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

论大型多模态模型作为开放世界图像分类器的应用

On Large Multimodal Models as Open-World Image Classifiers

摘要

Summary

Support

Support