多模态演示摘要的视觉-语言模型研究：模态与结构效应分析

摘要

视觉语言模型（VLMs）能够处理多种格式的视觉与文本信息：纯文本、图像、图文交替内容，甚至长达数小时的视频。在本研究中，我们对利用VLMs进行多模态演示自动摘要的多种输入表示进行了细致的定量与定性分析。基于这些实验，我们提出了在不同输入长度预算下，从文本密集的多模态文档中生成摘要的成本效益策略。研究表明，相较于原始视频，从视频流中提取的幻灯片作为输入更具优势，而图文交替的幻灯片与文字记录的结构化表示则能带来最佳性能。最后，我们反思并评论了多模态演示中跨模态交互的本质，并分享了提升VLMs理解此类文档能力的建议。

English

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

多模态演示摘要的视觉-语言模型研究：模态与结构效应分析

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

摘要

Summary

Support

Support