InterFeedback:通过人类反馈揭示大型多模态模型的交互智能
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
February 20, 2025
作者: Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou
cs.AI
摘要
现有基准测试未能充分评估大型多模态模型(LMMs)与人类用户的交互智能,而这对于开发通用人工智能助手至关重要。为此,我们设计了InterFeedback,一个可应用于任何LMM及数据集的交互框架,以自主评估这一能力。在此基础上,我们推出了InterFeedback-Bench,利用MMMU-Pro和MathVerse这两个代表性数据集,对10种不同的开源LMM进行了交互智能评估。此外,我们还发布了InterFeedback-Human,这是一个包含120个案例的新数据集,专门用于手动测试如OpenAI-o1和Claude-3.5-Sonnet等领先模型的交互表现。评估结果显示,即便是最先进的LMM(如OpenAI-o1),在通过人类反馈修正结果方面的成功率也不足50%。我们的发现强调了需要开发能够增强LMMs理解并利用反馈能力的方法。
English
Existing benchmarks do not test Large Multimodal Models (LMMs) on their
interactive intelligence with human users which is vital for developing
general-purpose AI assistants. We design InterFeedback, an interactive
framework, which can be applied to any LMM and dataset to assess this ability
autonomously. On top of this, we introduce InterFeedback-Bench which evaluates
interactive intelligence using two representative datasets, MMMU-Pro and
MathVerse, to test 10 different open-source LMMs. Additionally, we present
InterFeedback-Human, a newly collected dataset of 120 cases designed for
manually testing interactive performance in leading models such as OpenAI-o1
and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art
LMM (like OpenAI-o1) can correct their results through human feedback less than
50%. Our findings point to the need for methods that can enhance the LMMs'
capability to interpret and benefit from feedback.Summary
AI-Generated Summary