VARCO-VISION: 한국어 비전-언어 모델의 확장된 경계

초록

본 논문에서는 오픈 소스 한영 언어-시각 모델(VLM), VARCO-VISION을 소개합니다. 우리는 모델이 언어 및 시각 정보를 학습하면서 백본 모델의 지식을 보존하는 단계별 훈련 전략을 통합했습니다. 우리 모델은 유사한 크기의 모델과 비교했을 때 다양한 설정에서 우수한 성능을 보여주며, 이중 언어 이미지-텍스트 이해 및 생성 능력이 요구되는 환경에서 뛰어난 성과를 거두었습니다. VARCO-VISION은 또한 그라운딩, 참조, OCR을 수행할 수 있어 사용 범위와 실제 시나리오에서의 잠재적 응용 가능성을 확대합니다. 모델뿐만 아니라, 우리는 네 개의 폐쇄형 및 하나의 오픈셋 벤치마크를 포함한 다섯 개의 한국어 평가 데이터셋을 공개합니다. 우리의 이정표가 VLM을 훈련하려는 AI 연구자들에게 기회를 넓힐 것으로 기대합니다. VARCO-VISION은 https://huggingface.co/NCSOFT/VARCO-VISION-14B에서 이용 가능합니다.

English

In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at https://huggingface.co/NCSOFT/VARCO-VISION-14B.

VARCO-VISION: 한국어 비전-언어 모델의 확장된 경계

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

초록

Summary

Support