CAMEL-Bench：一個全面的阿拉伯語語言模型基準測試

摘要

近年來，開發能夠執行各種視覺推理和理解任務的大型多模型（LMMs）引起了顯著興趣。這導致引入了多個LMM基準來評估LMM在不同任務上的表現。然而，大多數現有的LMM評估基準主要以英語為中心。在這項工作中，我們為阿拉伯語開發了一個全面的LMM評估基準，以代表超過4億說話者的大眾。所提出的基準，名為CAMEL-Bench，包括八個不同領域和38個子領域，包括多圖像理解、複雜視覺感知、手寫文件理解、視頻理解、醫學影像、植物疾病和基於遙感的土地利用理解，以評估廣泛的場景泛化能力。我們的CAMEL-Bench包含約29,036個問題，這些問題是從更大樣本池中篩選出來的，其中質量由母語人士手動驗證，以確保可靠的模型評估。我們對閉源模型（包括GPT-4系列）和開源LMM進行評估。我們的分析顯示，尤其是在最佳開源模型中，需要大幅改進，即使是閉源的GPT-4o也達到了62%的總分。我們的基準和評估腳本是開源的。

English

Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

CAMEL-Bench：一個全面的阿拉伯語語言模型基準測試

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

摘要

Summary

Support

Support