OmniBench：邁向通用全語言模型未來

摘要

近期在多模式大型語言模型（MLLMs）方面的進展旨在整合和解釋來自不同模態的數據。然而，這些模型同時處理和推理多種模態的能力仍未得到充分探索，部分原因是缺乏全面的模態基準。我們引入了OmniBench，這是一個新穎的基準，旨在嚴格評估模型在視覺、聲學和文本輸入之間同時識別、解釋和推理的能力。我們將能夠進行這種三模式處理的模型定義為全語言模型（OLMs）。OmniBench以高質量的人工標註為特色，確保準確的回答需要跨越所有三種模態的整合理解和推理。我們的主要發現顯示：i）開源OLMs在三模式情境中的指示遵循和推理能力存在關鍵限制；ii）即使為基準模型提供圖像和音頻的替代文本表示，其表現仍不佳（低於50%的準確率）。這些結果表明，在現有的MLLM訓練範式中，從文本、圖像和音頻中構建一致上下文的能力通常被忽略。我們主張未來的研究應該專注於開發更強大的三模式整合技術和訓練策略，以提高OLM在不同模態之間的性能。代碼和實時排行榜可在https://m-a-p.ai/OmniBench找到。

English

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.

OmniBench：邁向通用全語言模型未來

OmniBench: Towards The Future of Universal Omni-Language Models

摘要

Summary

Support

Support