重點1.5:建立一個針對真實世界應用的視覺語言模型

POINTS1.5: Building a Vision-Language Model towards Real World Applications

December 11, 2024
作者: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou
cs.AI

摘要

近期,視覺語言模型取得了顯著進展,在眾多任務中展現出優異表現,例如光學字符識別和複雜圖表分析。延續這一趨勢,我們介紹一款新的視覺語言模型,POINTS1.5,旨在在各種實際應用中脫穎而出。POINTS1.5 是 POINTS1.0 的升級版本,融入了幾項關鍵創新:i)我們將原始的 CLIP 視覺編碼器(具有固定圖像分辨率)替換為支持本地動態高分辨率的 NaViT-style 視覺編碼器。這使得 POINTS1.5 能夠處理任意分辨率的圖像,無需將其分割成瓷磚。ii)我們為 POINTS1.5 添加了雙語支持,顯著增強了其在中文方面的能力。由於開源中文數據集在視覺語言模型中的稀缺性,我們從互聯網收集了大量圖像,並使用手動和自動方法的組合對其進行了標註。iii)我們提出了一套嚴格的過濾方法,用於視覺指導調整數據集。我們全面評估了所有這些過濾方法,並選擇了最有效的方法來獲得最終的視覺指導調整集。由於這些創新,POINTS1.5 在各種實際應用中明顯優於 POINTS1.0,表現出色。值得注意的是,POINTS1.5-7B 在少於 40 億令牌的訓練下,在具有少於 100 億參數的模型中在 OpenCompass 排行榜上名列第一。
English
Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters

Summary

AI-Generated Summary

PDF382December 12, 2024