대규모 이미지 캡션 데이터를 다시 검토하여 다중 모달 기반 모델 사전 훈련하기

초록

최근의 다중 모달 모델의 발전은 성능 향상을 위해 다시 작성된 캡션의 가치를 강조하지만 중요한 도전 과제가 남아 있습니다. 예를 들어, 합성 캡션은 종종 우수한 품질과 이미지-텍스트 정렬을 제공하지만, 합성 캡션이 AltText를 완전히 대체할 수 있는지 여전히 명확하지 않습니다. 합성 캡션의 역할 및 사전 훈련에서 원본 웹 크롤링된 AltText와의 상호 작용은 여전히 잘 이해되지 않습니다. 또한, 다양한 다중 모달 기반 모델은 특정 캡션 형식에 대한 고유한 선호도를 가질 수 있지만, 각 모델에 대한 최적의 캡션을 식별하기 위한 노력은 제한적입니다. 본 연구에서는 다양한 다중 모달 모델에 맞게 설계된 다양한 캡션 형식을 생성하는 혁신적이고 조절 가능하며 확장 가능한 캡션 파이프라인을 제안합니다. Short Synthetic Captions (SSC)에서 Dense Synthetic Captions (DSC+)로의 케이스 스터디를 통해, 합성 캡션과 AltText 간의 효과와 상호 작용을 체계적으로 탐구합니다. CLIP, 다중 모달 LLMs 및 확산 모델과 같은 모델을 대상으로, 각 모델이 특정 캡션 형식을 선호하는 것을 밝혀내는 발견을 했습니다. 합성 캡션과 AltText를 모두 유지하는 혼합 접근 방식은 합성 캡션만 사용하는 것보다 우수한 성능과 정렬을 제공하여, 각 모델이 특정 캡션 형식을 선호함을 보여줍니다. 이 포괄적인 분석은 캡션 전략을 최적화하는 데 유용한 통찰을 제공하여, 다중 모달 기반 모델의 사전 훈련을 발전시키는 데 기여합니다.

English

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

대규모 이미지 캡션 데이터를 다시 검토하여 다중 모달 기반 모델 사전 훈련하기

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

초록

Summary

Support

Support