ChatPaper.aiChatPaper

多模态演示摘要与视觉-语言模型: 模态与结构影响的研究

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

April 14, 2025
作者: Théo Gigant, Camille Guinaudeau, Frédéric Dufaux
cs.AI

摘要

視覺語言模型(VLMs)能夠處理多種格式的視覺與文本資訊:純文字、圖像、交錯的文本與圖像,甚至長達數小時的影片。在本研究中,我們針對使用不同輸入表徵的VLMs進行了細緻的定量與定性分析,以探討多模態演示的自動摘要生成。基於這些實驗,我們提出了在各種輸入長度限制下,利用VLMs從文本密集的多模態文件中生成摘要的成本效益策略。我們證明,從影片流中提取的幻燈片作為輸入,相比原始影片更具優勢,而交錯的幻燈片與轉錄文本的結構化表徵則能提供最佳性能。最後,我們反思並評論了多模態演示中跨模態互動的本質,並分享了提升VLMs理解此類文件能力的建議。
English
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

Summary

AI-Generated Summary

PDF22April 16, 2025