AuroraCap:高效、高性能的影片詳細字幕生成及新的基準。
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
October 4, 2024
作者: Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
cs.AI
摘要
影片詳細字幕是一項關鍵任務,旨在生成對影片內容進行全面且連貫的文字描述,有助於影片理解和生成。在本文中,我們提出了基於大型多模型的影片字幕生成模型 AuroraCap。我們採用最簡單的架構設計,無需額外的參數進行時間建模。為了應對長影片序列帶來的額外負擔,我們實現了 token 合併策略,減少輸入視覺 token 的數量。令人驚訝的是,我們發現這種策略幾乎不會導致性能下降。AuroraCap 在各種影片和圖片字幕基準測試中表現優異,例如,在 Flickr30k 上獲得了 88.9 的 CIDEr,超越了 GPT-4V(55.3)和 Gemini-1.5 Pro(82.2)。然而,現有的影片字幕基準測試僅包含簡單描述,由幾十個詞組成,這限制了該領域的研究。因此,我們開發了 VDC,一個具有一千多個精心註釋結構化字幕的影片詳細字幕基準測試。此外,我們提出了一個新的 LLM 輔助指標 VDCscore 用於改進評估,該指標採用分治策略,將長字幕評估轉換為多個短問答對。通過人類 Elo 排名的幫助,我們的實驗表明,這個基準測試更好地與人類對影片詳細字幕質量的判斷相關。
English
Video detailed captioning is a key task which aims to generate comprehensive
and coherent textual descriptions of video content, benefiting both video
understanding and generation. In this paper, we propose AuroraCap, a video
captioner based on a large multimodal model. We follow the simplest
architecture design without additional parameters for temporal modeling. To
address the overhead caused by lengthy video sequences, we implement the token
merging strategy, reducing the number of input visual tokens. Surprisingly, we
found that this strategy results in little performance loss. AuroraCap shows
superior performance on various video and image captioning benchmarks, for
example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and
Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include
simple descriptions, consisting of a few dozen words, which limits research in
this field. Therefore, we develop VDC, a video detailed captioning benchmark
with over one thousand carefully annotated structured captions. In addition, we
propose a new LLM-assisted metric VDCscore for bettering evaluation, which
adopts a divide-and-conquer strategy to transform long caption evaluation into
multiple short question-answer pairs. With the help of human Elo ranking, our
experiments show that this benchmark better correlates with human judgments of
video detailed captioning quality.Summary
AI-Generated Summary