VideoICL:基於信心的迭代上下文學習,用於超出分布的視頻理解
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
December 3, 2024
作者: Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, Sung Ju Hwang
cs.AI
摘要
近期在視頻大型多模型(LMMs)方面的進展顯著提高了它們對視頻理解和推理能力。然而,在訓練數據中低頻的分布外(OOD)任務上,它們的性能下降。傳統方法,如在OOD數據集上微調,由於高計算成本而不切實際。儘管在語言任務和圖像語言任務中,通過示範示例進行的上下文學習(ICL)在無需微調的情況下表現出有希望的泛化性能,但將ICL應用於視頻語言任務面臨挑戰,因為視頻需要更長的標記長度。為了應對這些問題,我們提出了VideoICL,一種新穎的視頻上下文學習框架,用於OOD任務,引入了基於相似性的相關示例選擇策略和基於信心的迭代推理方法。這使得可以選擇最相關的示例並根據相似性對其進行排名,以供推理使用。如果生成的響應信心不足,我們的框架將選擇新的示例並再次進行推理,通過迭代改進結果,直到獲得高信心的響應。這種方法通過擴展有效上下文長度而不產生高成本,提高了OOD視頻理解性能。在多個基準測試中的實驗結果顯示了顯著的性能增益,特別是在特定領域情景下,為更廣泛的視頻理解應用奠定了基礎。代碼將在https://github.com/KangsanKim07/VideoICL 上發布。
English
Recent advancements in video large multimodal models (LMMs) have
significantly improved their video understanding and reasoning capabilities.
However, their performance drops on out-of-distribution (OOD) tasks that are
underrepresented in training data. Traditional methods like fine-tuning on OOD
datasets are impractical due to high computational costs. While In-context
learning (ICL) with demonstration examples has shown promising generalization
performance in language tasks and image-language tasks without fine-tuning,
applying ICL to video-language tasks faces challenges due to the limited
context length in Video LMMs, as videos require longer token lengths. To
address these issues, we propose VideoICL, a novel video in-context learning
framework for OOD tasks that introduces a similarity-based relevant example
selection strategy and a confidence-based iterative inference approach. This
allows to select the most relevant examples and rank them based on similarity,
to be used for inference. If the generated response has low confidence, our
framework selects new examples and performs inference again, iteratively
refining the results until a high-confidence response is obtained. This
approach improves OOD video understanding performance by extending effective
context length without incurring high costs. The experimental results on
multiple benchmarks demonstrate significant performance gains, especially in
domain-specific scenarios, laying the groundwork for broader video
comprehension applications. Code will be released at
https://github.com/KangsanKim07/VideoICLSummary
AI-Generated Summary