HumanEval-V：透過編碼任務評估大型多模型的視覺理解和推理能力

摘要

編碼任務對評估大型語言模型（LLMs）非常有價值，因為它們要求理解高級指令、複雜推理以及實現功能性程式 - 這是推動人工通用智能的核心能力。儘管大型多模型模型（LMMs）取得了進展，將LLMs與視覺感知和理解能力相結合，但在嚴格評估這些模型的編碼基準方面仍存在明顯缺乏，特別是在強調視覺推理的任務中。為了填補這一空白，我們引入了HumanEval-V，這是一個新穎且輕量級的基準，專門設計用於評估LMMs的視覺理解和推理能力，通過代碼生成。HumanEval-V包括108個精心設計的入門級Python編碼任務，這些任務源自CodeForces和Stack Overflow等平台。通過修改原始問題的上下文和算法模式，重新繪製視覺元素以確保與來源的區別，防止潛在的數據泄漏。LLMs需要根據提供的視覺上下文和預定義的Python函數簽名來完成代碼解決方案，詳細說明任務要求。每個任務都配備了精心製作的測試用例，以確保對模型生成的解決方案進行全面可靠的評估。我們使用HumanEval-V評估了19個最先進的LLMs，揭示了重大挑戰。像GPT-4o這樣的專有模型僅達到13%的pass@1和36.4%的pass@10，而具有700億參數的開放權重模型在pass@1方面得分低於4%。消融研究進一步揭示了當前LMMs在視覺推理和編碼能力方面的局限性。這些結果突顯了未來研究增強LMMs能力的關鍵領域。我們已在https://github.com/HumanEval-V/HumanEval-V-Benchmark上開源了我們的代碼和基準。

English

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.

HumanEval-V：透過編碼任務評估大型多模型的視覺理解和推理能力

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

摘要

Summary

Support

Support