CoS: 장기 비디오 이해를 위한 샷 체인 프롬프팅

초록

다중 모달 대규모 언어 모델 (MLLMs)은 비디오가 길어질수록 과도한 시각 토큰이 필요하여 어려움을 겪습니다. 이러한 토큰들은 MLLMs의 컨텍스트 길이를 크게 초과하며, 불필요한 작업과 무관한 장면으로 채워지게 됩니다. 어떻게 샷을 선택할지는 아직 해결되지 않은 중요한 문제입니다: 희소 샘플링은 중요한 세부 정보를 놓칠 위험이 있으며, 철저한 샘플링은 모델을 관련 없는 콘텐츠로 넘쳐나게 하여 비디오를 오해하게 만듭니다. 이 문제를 해결하기 위해 우리는 샷 체인 프롬프팅 (CoS)을 제안합니다. 핵심 아이디어는 샷 선택을 테스트 시간 시각 프롬프트 최적화로 프레임화하여 비디오 이해 의미 작업에 적응적으로 선택된 샷을 최적화하는 것입니다. CoS에는 두 가지 핵심 부분이 있습니다: (1) 가짜 시간 기준을 수행하는 이진 비디오 요약 메커니즘으로, 작업과 관련된 샷을 식별하기 위한 이진 코딩을 발견하고, (2) 이진 코딩을 배치하여 작업과 관련 있는 긍정적인 샷과 관련 없는 부정적인 샷을 쌍으로 만드는 비디오 공동 추론 모듈입니다. 최적화된 샷 선택을 원본 비디오에 임베드하여 긴 비디오 이해를 최적화하기 위해 관련 컨텍스트에 집중할 수 있습니다. 세 가지 기준선과 다섯 데이터셋을 대상으로 한 실험은 CoS의 효과성과 적응성을 입증합니다. 코드는 https://lwpyh.github.io/CoS에서 제공됩니다.

English

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

CoS: 장기 비디오 이해를 위한 샷 체인 프롬프팅

CoS: Chain-of-Shot Prompting for Long Video Understanding

초록

Support