ChatPaper.aiChatPaper

MixEval-X:來自真實世界數據混合的任意對任意評估

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

October 17, 2024
作者: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh
cs.AI

摘要

對於人工智慧模型來說,感知和生成多樣性形式是至關重要的,以便有效地從現實世界的信號中學習並互動,這需要可靠的評估來促進其發展。我們確定了當前評估中的兩個主要問題:(1) 不一致的標準,由不同社群形成,具有不同的協議和成熟度水平;以及(2) 顯著的查詢、評分和泛化偏差。為了應對這些問題,我們引入了MixEval-X,這是第一個任意-任意的現實世界基準,旨在優化和標準化跨輸入和輸出形式的評估。我們提出了多模態基準混合和適應-校正管道,以重建現實世界任務分佈,確保評估能夠有效地泛化到現實世界的使用案例。廣泛的元評估顯示,我們的方法有效地將基準樣本與現實世界任務分佈對齊,並且模型排名與眾包的現實世界評估密切相關(高達0.98)。我們提供了全面的排行榜來重新排列現有的模型和組織,並提供見解,以增進對多模態評估的理解,並為未來研究提供信息。
English
Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

Summary

AI-Generated Summary

PDF762November 16, 2024