多模态演示摘要的视觉-语言模型研究: 模态与结构效应分析
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
April 14, 2025
作者: Théo Gigant, Camille Guinaudeau, Frédéric Dufaux
cs.AI
摘要
视觉语言模型(VLMs)能够处理多种格式的视觉与文本信息:纯文本、图像、图文交替内容,甚至长达数小时的视频。在本研究中,我们对利用VLMs进行多模态演示自动摘要的多种输入表示进行了细致的定量与定性分析。基于这些实验,我们提出了在不同输入长度预算下,从文本密集的多模态文档中生成摘要的成本效益策略。研究表明,相较于原始视频,从视频流中提取的幻灯片作为输入更具优势,而图文交替的幻灯片与文字记录的结构化表示则能带来最佳性能。最后,我们反思并评论了多模态演示中跨模态交互的本质,并分享了提升VLMs理解此类文档能力的建议。
English
Vision-Language Models (VLMs) can process visual and textual information in
multiple formats: texts, images, interleaved texts and images, or even
hour-long videos. In this work, we conduct fine-grained quantitative and
qualitative analyses of automatic summarization of multimodal presentations
using VLMs with various representations as input. From these experiments, we
suggest cost-effective strategies for generating summaries from text-heavy
multimodal documents under different input-length budgets using VLMs. We show
that slides extracted from the video stream can be beneficially used as input
against the raw video, and that a structured representation from interleaved
slides and transcript provides the best performance. Finally, we reflect and
comment on the nature of cross-modal interactions in multimodal presentations
and share suggestions to improve the capabilities of VLMs to understand
documents of this nature.Summary
AI-Generated Summary