ChatPaper.aiChatPaper

InterFeedback:通过人类反馈揭示大型多模态模型的交互智能

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

February 20, 2025
作者: Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou
cs.AI

摘要

现有基准测试未能充分评估大型多模态模型(LMMs)与人类用户的交互智能,而这对于开发通用人工智能助手至关重要。为此,我们设计了InterFeedback,一个可应用于任何LMM及数据集的交互框架,以自主评估这一能力。在此基础上,我们推出了InterFeedback-Bench,利用MMMU-Pro和MathVerse这两个代表性数据集,对10种不同的开源LMM进行了交互智能评估。此外,我们还发布了InterFeedback-Human,这是一个包含120个案例的新数据集,专门用于手动测试如OpenAI-o1和Claude-3.5-Sonnet等领先模型的交互表现。评估结果显示,即便是最先进的LMM(如OpenAI-o1),在通过人类反馈修正结果方面的成功率也不足50%。我们的发现强调了需要开发能够增强LMMs理解并利用反馈能力的方法。
English
Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.

Summary

AI-Generated Summary

PDF72February 24, 2025