OmniCaptioner：一统天下的全能字幕生成器

摘要

我们提出了OmniCaptioner，一个多功能的视觉描述生成框架，旨在为广泛的视觉领域生成细粒度的文本描述。与以往局限于特定图像类型（如自然图像或几何视觉）的方法不同，我们的框架为自然图像、视觉文本（如海报、用户界面、教科书）以及结构化视觉内容（如文档、表格、图表）提供了一体化的解决方案。通过将低层次的像素信息转化为语义丰富的文本表示，我们的框架弥合了视觉与文本模态之间的鸿沟。研究结果凸显了三大优势：(i) 增强的视觉推理能力，借助长上下文视觉描述，特别是DeepSeek-R1系列大语言模型在多模态场景中有效推理；(ii) 提升的图像生成质量，详细描述促进了文本到图像生成及图像转换等任务的改进；(iii) 高效的监督微调（SFT），实现了更少数据下的快速收敛。我们相信，OmniCaptioner的多样性与适应性将为弥合语言与视觉模态之间的差距提供新的视角。

English

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

OmniCaptioner：一统天下的全能字幕生成器

OmniCaptioner: One Captioner to Rule Them All

摘要

Summary

Support

Support