CAMEL-Bench：一个全面的阿拉伯语语言模型基准测试

摘要

近年来，人们对开发能够执行各种视觉推理和理解任务的大型多模态模型（LMMs）表现出了显著兴趣。这导致了引入多个LMM基准来评估LMM在不同任务上的表现。然而，大多数现有的LMM评估基准主要以英语为中心。在这项工作中，我们为阿拉伯语开发了一个全面的LMM评估基准，以代表超过4亿使用者的大型人口。所提出的基准命名为CAMEL-Bench，包括八个不同领域和38个子领域，包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学成像、植物疾病和基于遥感的土地利用理解，以评估广泛的场景泛化能力。我们的CAMEL-Bench包括大约29,036个问题，这些问题是从更大的样本池中筛选出来的，质量由母语使用者手动验证，以确保可靠的模型评估。我们对闭源模型（包括GPT-4系列）和开源LMMs进行评估。我们的分析显示，尤其是在最佳开源模型中，需要实质性改进，即使是闭源的GPT-4o也只能获得62%的总体得分。我们的基准和评估脚本是开源的。

English

Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

CAMEL-Bench：一个全面的阿拉伯语语言模型基准测试

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

摘要

Summary

Support