점수 1.5: 현실 세계 적용을 향한 비전-언어 모델 구축

초록

최근에는 비전-언어 모델이 상당한 발전을 이루어, 광학 문자 인식 및 복잡한 다이어그램 분석과 같은 다양한 작업에서 우수한 성능을 보여주고 있습니다. 이러한 추세를 바탕으로, 우리는 다양한 실제 응용 분야에서 뛰어난 성과를 거두도록 설계된 새로운 비전-언어 모델인 POINTS1.5를 소개합니다. POINTS1.5는 POINTS1.0의 개선판으로, 여러 가지 핵심 혁신을 통합하였습니다. i) 우리는 고정된 이미지 해상도를 가진 원래의 CLIP 비전 인코더를 NaViT 스타일의 비전 인코더로 대체하여, 원본 이미지 해상도를 지원하는 동적 고해상도를 지원합니다. 이로써 POINTS1.5는 이미지를 타일로 분할하지 않고도 모든 해상도의 이미지를 처리할 수 있습니다. ii) POINTS1.5에 이중 언어 지원을 추가하여 중국어의 능력을 크게 향상시켰습니다. 비전-언어 모델을 위한 오픈 소스 중국어 데이터셋의 부족으로, 우리는 인터넷에서 다수의 이미지를 수집하고 수동 및 자동 방법을 결합하여 주석을 달았습니다. iii) 시각적 지시 튜닝 데이터셋을 위한 엄격한 필터링 방법을 제안합니다. 우리는 모든 이러한 필터링 방법을 철저히 평가하고, 가장 효과적인 방법을 선택하여 최종 시각적 지시 튜닝 세트를 획득합니다. 이러한 혁신들 덕분에 POINTS1.5는 POINTS1.0을 크게 능가하며 다양한 실제 응용 분야에서 강력한 성능을 보여줍니다. 특히, 40억 토큰 미만으로 훈련된 POINTS1.5-7B는 100억 개 이하의 매개변수를 가진 모델 중 OpenCompass 리더보드에서 1위를 차지합니다.

English

Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters

점수 1.5: 현실 세계 적용을 향한 비전-언어 모델 구축

POINTS1.5: Building a Vision-Language Model towards Real World Applications

초록

Summary

Support