ChatPaper.aiChatPaper

视频ICL:基于置信度的迭代上下文学习,用于视频理解中的外部分布

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

December 3, 2024
作者: Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, Sung Ju Hwang
cs.AI

摘要

最近在视频大型多模型(LMMs)方面取得的进展显著提高了它们对视频理解和推理能力。然而,在训练数据中代表性不足的分布外(OOD)任务上,它们的性能会下降。传统方法,如在OOD数据集上微调,由于高计算成本而不切实际。尽管在语言任务和图像语言任务中,基于示范示例的上下文学习(ICL)显示出了有希望的泛化性能,但将ICL应用于视频语言任务面临挑战,因为视频需要更长的标记长度。为了解决这些问题,我们提出了VideoICL,这是一种新颖的视频上下文学习框架,用于OOD任务,引入了基于相似性的相关示例选择策略和基于置信度的迭代推理方法。这允许选择最相关的示例并根据相似性对它们进行排名,用于推理。如果生成的响应置信度较低,我们的框架会选择新的示例,并再次进行推理,迭代地优化结果,直到获得高置信度的响应。这种方法通过扩展有效上下文长度而不产生高成本,提高了OOD视频理解性能。在多个基准测试上的实验结果显示了显著的性能提升,特别是在特定领域的场景中,为更广泛的视频理解应用奠定了基础。代码将在https://github.com/KangsanKim07/VideoICL 上发布。
English
Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at https://github.com/KangsanKim07/VideoICL

Summary

AI-Generated Summary

PDF222December 6, 2024