OmniCaptioner:一統天下的字幕生成器
OmniCaptioner: One Captioner to Rule Them All
April 9, 2025
作者: Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Bo Zhang, Peng Gao
cs.AI
摘要
我們提出OmniCaptioner,這是一個多功能的視覺描述框架,旨在生成跨多種視覺領域的細粒度文本描述。與以往僅限於特定圖像類型(如自然圖像或幾何視覺)的方法不同,我們的框架提供了一個統一的解決方案,適用於自然圖像、視覺文本(如海報、用戶界面、教科書)以及結構化視覺(如文檔、表格、圖表)的描述。通過將低層次的像素信息轉換為語義豐富的文本表示,我們的框架彌合了視覺與文本模態之間的差距。我們的結果凸顯了三個關鍵優勢:(i)增強了大型語言模型(LLMs)的視覺推理能力,其中視覺模態的長上下文描述特別有助於DeepSeek-R1系列在多模態場景中的有效推理;(ii)提升了圖像生成質量,詳細的描述改善了文本到圖像生成及圖像轉換等任務;(iii)實現了高效的監督微調(SFT),使得在更少數據的情況下更快收斂。我們相信OmniCaptioner的多功能性和適應性,能為彌合語言與視覺模態之間的差距提供新的視角。
English
We propose OmniCaptioner, a versatile visual captioning framework for
generating fine-grained textual descriptions across a wide variety of visual
domains. Unlike prior methods limited to specific image types (e.g., natural
images or geometric visuals), our framework provides a unified solution for
captioning natural images, visual text (e.g., posters, UIs, textbooks), and
structured visuals (e.g., documents, tables, charts). By converting low-level
pixel information into semantically rich textual representations, our framework
bridges the gap between visual and textual modalities. Our results highlight
three key advantages: (i) Enhanced Visual Reasoning with LLMs, where
long-context captions of visual modalities empower LLMs, particularly the
DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii)
Improved Image Generation, where detailed captions improve tasks like
text-to-image generation and image transformation; and (iii) Efficient
Supervised Fine-Tuning (SFT), which enables faster convergence with less data.
We believe the versatility and adaptability of OmniCaptioner can offer a new
perspective for bridging the gap between language and visual modalities.Summary
AI-Generated Summary