QuoTA:基于CoT查询解耦的面向查询的令牌分配方法,用于长视频理解
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
March 11, 2025
作者: Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
cs.AI
摘要
近期在长视频理解领域的进展通常通过基于注意力分布的视觉令牌剪枝来缓解视觉冗余。然而,现有方法虽然在解码器层采用事后低响应令牌剪枝,却忽视了视觉令牌与指令(查询)之间在输入层面的语义关联。本文提出QuoTA,一种无需训练的事前模块,它扩展了现有的大规模视频-语言模型(LVLMs),基于查询导向的帧级重要性评估进行视觉令牌分配。查询导向的令牌选择至关重要,因为它使视觉处理与任务特定需求对齐,优化令牌预算利用的同时保留语义相关内容。具体而言,(i) QuoTA根据查询相关性策略性地分配帧级重要性评分,使得在解码器层跨模态交互前一次性完成视觉令牌分配,(ii) 我们通过思维链推理解耦查询,以促进更精确的基于LVLM的帧重要性评分,以及(iii) QuoTA提供即插即用功能,可扩展至现有LVLMs。大量实验结果表明,在LLaVA-Video-7B上实施QuoTA,在保持与基线相同视觉令牌预算的情况下,在包括Video-MME和MLVU在内的六个基准测试中平均性能提升了3.2%。代码已开源,地址为https://github.com/MAC-AutoML/QuoTA。
English
Recent advances in long video understanding typically mitigate visual
redundancy through visual token pruning based on attention distribution.
However, while existing methods employ post-hoc low-response token pruning in
decoder layers, they overlook the input-level semantic correlation between
visual tokens and instructions (query). In this paper, we propose QuoTA, an
ante-hoc training-free modular that extends existing large video-language
models (LVLMs) for visual token assignment based on query-oriented frame-level
importance assessment. The query-oriented token selection is crucial as it
aligns visual processing with task-specific requirements, optimizing token
budget utilization while preserving semantically relevant content.
Specifically, (i) QuoTA strategically allocates frame-level importance scores
based on query relevance, enabling one-time visual token assignment before
cross-modal interactions in decoder layers, (ii) we decouple the query through
Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame
importance scoring, and (iii) QuoTA offers a plug-and-play functionality that
extends to existing LVLMs. Extensive experimental results demonstrate that
implementing QuoTA with LLaVA-Video-7B yields an average performance
improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while
operating within an identical visual token budget as the baseline. Codes are
open-sourced at https://github.com/MAC-AutoML/QuoTA.Summary
AI-Generated Summary