多模态演示摘要与视觉-语言模型：模态与结构影响的研究

摘要

視覺語言模型（VLMs）能夠處理多種格式的視覺與文本資訊：純文字、圖像、交錯的文本與圖像，甚至長達數小時的影片。在本研究中，我們針對使用不同輸入表徵的VLMs進行了細緻的定量與定性分析，以探討多模態演示的自動摘要生成。基於這些實驗，我們提出了在各種輸入長度限制下，利用VLMs從文本密集的多模態文件中生成摘要的成本效益策略。我們證明，從影片流中提取的幻燈片作為輸入，相比原始影片更具優勢，而交錯的幻燈片與轉錄文本的結構化表徵則能提供最佳性能。最後，我們反思並評論了多模態演示中跨模態互動的本質，並分享了提升VLMs理解此類文件能力的建議。

English

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

多模态演示摘要与视觉-语言模型：模态与结构影响的研究

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

摘要

Summary

Support

Support