ChatPaper.aiChatPaper

重新審視大規模圖像標題數據在預訓練多模態基礎模型中的應用

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

October 3, 2024
作者: Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
cs.AI

摘要

最近在多模型方面的進展凸顯了改寫標題以提升效能的價值,然而仍存在關鍵挑戰。例如,雖然合成標題通常提供優越的品質和圖像-文字對齊,但目前尚不清楚它們是否能完全取代AltTexts:合成標題在預訓練中的作用及與原始網頁抓取的AltTexts的互動仍不明確。此外,不同的多模型基礎模型可能對特定標題格式有獨特偏好,但目前對於為每個模型確定最佳標題的努力仍受限。在這項工作中,我們提出了一種新穎、可控且可擴展的標題生成流程,旨在產生多樣的標題格式,以適應各種多模型模型。通過將短合成標題(SSC)轉向密集合成標題(DSC+)作為案例研究,我們系統地探索它們在模型(如CLIP、多模型LLMs和擴散模型)中與AltTexts的影響和互動。我們的研究發現,保留合成標題和AltTexts的混合方法可以優於僅使用合成標題,從而改善對齊和效能,並且每個模型都展現對特定標題格式的偏好。這一全面分析提供了優化標題策略的寶貴見解,從而推動多模型基礎模型的預訓練。
English
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

Summary

AI-Generated Summary

PDF552November 16, 2024