CoS：用于长视频理解的链式拍摄提示

摘要

多模态大型语言模型（MLLMs）在处理长视频时面临困难，因为需要过多的视觉标记。这些标记远远超出了MLLMs的上下文长度，导致填充了冗余的与任务无关的镜头。如何选择镜头是一个尚未解决的关键问题：稀疏采样可能会错过关键细节，而穷举采样会使模型淹没在与任务无关的内容中，导致对视频的误解。为了解决这个问题，我们提出了“镜头链提示”（CoS）。关键思想是将镜头选择框架化为测试时的视觉提示优化，通过优化镜头-任务对齐来选择适应视频理解语义任务的镜头。CoS包括两个关键部分：（1）执行伪时序定位的二进制视频摘要机制，发现用于识别与任务相关镜头的二进制编码，以及（2）部署二进制编码以配对（学习对齐）与任务相关的正面镜头和无关的负面镜头的视频共推理模块。它将优化的镜头选择嵌入到原始视频中，有助于专注于相关上下文以优化对长视频的理解。在三个基准和五个数据集上的实验表明了CoS的有效性和适应性。代码可在https://lwpyh.github.io/CoS找到。

English

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

CoS：用于长视频理解的链式拍摄提示

CoS: Chain-of-Shot Prompting for Long Video Understanding

摘要

Summary

Support