HumanEval-V: 코딩 작업을 통해 대규모 다중모달 모델의 시각 이해 및 추론 능력을 평가하기

초록

코딩 작업은 대규모 언어 모델 (LLM)을 평가하는 데 유용한데, 고수준 지침의 이해, 복잡한 추론, 기능적 프로그램 구현이 필요하기 때문에 인공 일반 지능을 발전시키는 핵심 능력을 요구합니다. 대규모 다중모달 모델 (LMM)의 발전에도 불구하고, 시각적 추론을 강조하는 작업에서 이러한 모델을 엄격하게 평가하는 코딩 벤치마크의 부족이 여전히 존재합니다. 이 간극을 해결하기 위해, 우리는 HumanEval-V를 소개합니다. 이는 시각적 이해와 추론 능력을 평가하기 위해 특별히 설계된 경량 벤치마크로, LMM의 시각적 이해와 추론 능력을 코드 생성을 통해 평가합니다. HumanEval-V에는 CodeForces 및 Stack Overflow와 같은 플랫폼에서 파생된 108가지 신중하게 설계된 초급 Python 코딩 작업이 포함되어 있습니다. 각 작업은 원래 문제의 맥락과 알고리즘 패턴을 수정하여 적응되었으며, 소스와 구분되도록 시각적 요소가 다시 그려져 잠재적인 데이터 누출을 방지합니다. LMM은 제공된 시각적 맥락과 작업 요구 사항을 개요화한 미리 정의된 Python 함수 서명을 기반으로 코드 솔루션을 완료해야 합니다. 각 작업은 모델이 생성한 솔루션을 철저하고 신뢰할 수 있는 평가를 보장하기 위해 정교하게 수작업된 테스트 케이스로 구성되어 있습니다. 우리는 HumanEval-V를 사용하여 19개의 최첨단 LMM을 평가하여 중요한 도전 과제를 발견했습니다. GPT-4o와 같은 소유 모델은 13%의 pass@1과 36.4%의 pass@10을 달성하는 반면, 700억 개의 매개변수를 가진 오픈 웨이트 모델은 4% 미만의 pass@1을 기록했습니다. 제거 연구는 현재 LMM의 시각적 추론 및 코딩 능력의 한계를 드러냅니다. 이러한 결과는 LMM의 능력을 향상시키기 위한 미래 연구의 주요 영역을 강조합니다. 우리는 코드와 벤치마크를 https://github.com/HumanEval-V/HumanEval-V-Benchmark에서 오픈 소스로 제공하였습니다.

English

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.

HumanEval-V: 코딩 작업을 통해 대규모 다중모달 모델의 시각 이해 및 추론 능력을 평가하기

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

초록

Support