CAMEL-Bench:一个全面的阿拉伯语语言模型基准测试
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
October 24, 2024
作者: Sara Ghaboura, Ahmed Heakl, Omkar Thawakar, Ali Alharthi, Ines Riahi, Abduljalil Saif, Jorma Laaksonen, Fahad S. Khan, Salman Khan, Rao M. Anwer
cs.AI
摘要
近年来,人们对开发能够执行各种视觉推理和理解任务的大型多模态模型(LMMs)表现出了显著兴趣。这导致了引入多个LMM基准来评估LMM在不同任务上的表现。然而,大多数现有的LMM评估基准主要以英语为中心。在这项工作中,我们为阿拉伯语开发了一个全面的LMM评估基准,以代表超过4亿使用者的大型人口。所提出的基准命名为CAMEL-Bench,包括八个不同领域和38个子领域,包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学成像、植物疾病和基于遥感的土地利用理解,以评估广泛的场景泛化能力。我们的CAMEL-Bench包括大约29,036个问题,这些问题是从更大的样本池中筛选出来的,质量由母语使用者手动验证,以确保可靠的模型评估。我们对闭源模型(包括GPT-4系列)和开源LMMs进行评估。我们的分析显示,尤其是在最佳开源模型中,需要实质性改进,即使是闭源的GPT-4o也只能获得62%的总体得分。我们的基准和评估脚本是开源的。
English
Recent years have witnessed a significant interest in developing large
multimodal models (LMMs) capable of performing various visual reasoning and
understanding tasks. This has led to the introduction of multiple LMM
benchmarks to evaluate LMMs on different tasks. However, most existing LMM
evaluation benchmarks are predominantly English-centric. In this work, we
develop a comprehensive LMM evaluation benchmark for the Arabic language to
represent a large population of over 400 million speakers. The proposed
benchmark, named CAMEL-Bench, comprises eight diverse domains and 38
sub-domains including, multi-image understanding, complex visual perception,
handwritten document understanding, video understanding, medical imaging, plant
diseases, and remote sensing-based land use understanding to evaluate broad
scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions
that are filtered from a larger pool of samples, where the quality is manually
verified by native speakers to ensure reliable model assessment. We conduct
evaluations of both closed-source, including GPT-4 series, and open-source
LMMs. Our analysis reveals the need for substantial improvement, especially
among the best open-source models, with even the closed-source GPT-4o achieving
an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.Summary
AI-Generated Summary