VideoICL：基於信心的迭代上下文學習，用於超出分布的視頻理解

摘要

近期在視頻大型多模型（LMMs）方面的進展顯著提高了它們對視頻理解和推理能力。然而，在訓練數據中低頻的分布外（OOD）任務上，它們的性能下降。傳統方法，如在OOD數據集上微調，由於高計算成本而不切實際。儘管在語言任務和圖像語言任務中，通過示範示例進行的上下文學習（ICL）在無需微調的情況下表現出有希望的泛化性能，但將ICL應用於視頻語言任務面臨挑戰，因為視頻需要更長的標記長度。為了應對這些問題，我們提出了VideoICL，一種新穎的視頻上下文學習框架，用於OOD任務，引入了基於相似性的相關示例選擇策略和基於信心的迭代推理方法。這使得可以選擇最相關的示例並根據相似性對其進行排名，以供推理使用。如果生成的響應信心不足，我們的框架將選擇新的示例並再次進行推理，通過迭代改進結果，直到獲得高信心的響應。這種方法通過擴展有效上下文長度而不產生高成本，提高了OOD視頻理解性能。在多個基準測試中的實驗結果顯示了顯著的性能增益，特別是在特定領域情景下，為更廣泛的視頻理解應用奠定了基礎。代碼將在https://github.com/KangsanKim07/VideoICL 上發布。

English

Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at https://github.com/KangsanKim07/VideoICL

VideoICL：基於信心的迭代上下文學習，用於超出分布的視頻理解

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

摘要

Support