多模态演示摘要与视觉-语言模型: 模态与结构影响的研究
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
April 14, 2025
作者: Théo Gigant, Camille Guinaudeau, Frédéric Dufaux
cs.AI
摘要
視覺語言模型(VLMs)能夠處理多種格式的視覺與文本資訊:純文字、圖像、交錯的文本與圖像,甚至長達數小時的影片。在本研究中,我們針對使用不同輸入表徵的VLMs進行了細緻的定量與定性分析,以探討多模態演示的自動摘要生成。基於這些實驗,我們提出了在各種輸入長度限制下,利用VLMs從文本密集的多模態文件中生成摘要的成本效益策略。我們證明,從影片流中提取的幻燈片作為輸入,相比原始影片更具優勢,而交錯的幻燈片與轉錄文本的結構化表徵則能提供最佳性能。最後,我們反思並評論了多模態演示中跨模態互動的本質,並分享了提升VLMs理解此類文件能力的建議。
English
Vision-Language Models (VLMs) can process visual and textual information in
multiple formats: texts, images, interleaved texts and images, or even
hour-long videos. In this work, we conduct fine-grained quantitative and
qualitative analyses of automatic summarization of multimodal presentations
using VLMs with various representations as input. From these experiments, we
suggest cost-effective strategies for generating summaries from text-heavy
multimodal documents under different input-length budgets using VLMs. We show
that slides extracted from the video stream can be beneficially used as input
against the raw video, and that a structured representation from interleaved
slides and transcript provides the best performance. Finally, we reflect and
comment on the nature of cross-modal interactions in multimodal presentations
and share suggestions to improve the capabilities of VLMs to understand
documents of this nature.Summary
AI-Generated Summary