HumanEval-V:透過編碼任務評估大型多模型的視覺理解和推理能力
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
October 16, 2024
作者: Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu, Yue Wang, Bei Chen, Jacky Keung
cs.AI
摘要
編碼任務對評估大型語言模型(LLMs)非常有價值,因為它們要求理解高級指令、複雜推理以及實現功能性程式 - 這是推動人工通用智能的核心能力。儘管大型多模型模型(LMMs)取得了進展,將LLMs與視覺感知和理解能力相結合,但在嚴格評估這些模型的編碼基準方面仍存在明顯缺乏,特別是在強調視覺推理的任務中。為了填補這一空白,我們引入了HumanEval-V,這是一個新穎且輕量級的基準,專門設計用於評估LMMs的視覺理解和推理能力,通過代碼生成。HumanEval-V包括108個精心設計的入門級Python編碼任務,這些任務源自CodeForces和Stack Overflow等平台。通過修改原始問題的上下文和算法模式,重新繪製視覺元素以確保與來源的區別,防止潛在的數據泄漏。LLMs需要根據提供的視覺上下文和預定義的Python函數簽名來完成代碼解決方案,詳細說明任務要求。每個任務都配備了精心製作的測試用例,以確保對模型生成的解決方案進行全面可靠的評估。我們使用HumanEval-V評估了19個最先進的LLMs,揭示了重大挑戰。像GPT-4o這樣的專有模型僅達到13%的pass@1和36.4%的pass@10,而具有700億參數的開放權重模型在pass@1方面得分低於4%。消融研究進一步揭示了當前LMMs在視覺推理和編碼能力方面的局限性。這些結果突顯了未來研究增強LMMs能力的關鍵領域。我們已在https://github.com/HumanEval-V/HumanEval-V-Benchmark上開源了我們的代碼和基準。
English
Coding tasks have been valuable for evaluating Large Language Models (LLMs),
as they demand the comprehension of high-level instructions, complex reasoning,
and the implementation of functional programs -- core capabilities for
advancing Artificial General Intelligence. Despite the progress in Large
Multimodal Models (LMMs), which extend LLMs with visual perception and
understanding capabilities, there remains a notable lack of coding benchmarks
that rigorously assess these models, particularly in tasks that emphasize
visual reasoning. To address this gap, we introduce HumanEval-V, a novel and
lightweight benchmark specifically designed to evaluate LMMs' visual
understanding and reasoning capabilities through code generation. HumanEval-V
includes 108 carefully crafted, entry-level Python coding tasks derived from
platforms like CodeForces and Stack Overflow. Each task is adapted by modifying
the context and algorithmic patterns of the original problems, with visual
elements redrawn to ensure distinction from the source, preventing potential
data leakage. LMMs are required to complete the code solution based on the
provided visual context and a predefined Python function signature outlining
the task requirements. Every task is equipped with meticulously handcrafted
test cases to ensure a thorough and reliable evaluation of model-generated
solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering
significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1
and 36.4% pass@10, while open-weight models with 70B parameters score below 4%
pass@1. Ablation studies further reveal the limitations of current LMMs in
vision reasoning and coding capabilities. These results underscore key areas
for future research to enhance LMMs' capabilities. We have open-sourced our
code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.Summary
AI-Generated Summary