构建一个面向真实世界应用的视觉-语言模型
POINTS1.5: Building a Vision-Language Model towards Real World Applications
December 11, 2024
作者: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou
cs.AI
摘要
视觉语言模型最近取得了显著进展,在多项任务上展现出卓越性能,例如光学字符识别和复杂图表分析。借鉴这一趋势,我们介绍了一个新的视觉语言模型,POINTS1.5,旨在在各种实际应用中表现出色。POINTS1.5是POINTS1.0的升级版本,融入了几项关键创新:i)我们用支持本地动态高分辨率的NaViT风格视觉编码器取代了原始的具有固定图像分辨率的CLIP视觉编码器。这使得POINTS1.5能够处理任何分辨率的图像,而无需将其分割成瓦片。ii)我们为POINTS1.5添加了双语支持,显著增强了其在中文方面的能力。由于视觉语言模型的开源中文数据集稀缺,我们从互联网收集了大量图像,并使用手动和自动方法相结合的方式对其进行了注释。iii)我们提出了一套严格的过滤方法,用于视觉指导调整数据集。我们全面评估了所有这些过滤方法,并选择了最有效的方法来获取最终的视觉指导调整集。由于这些创新,POINTS1.5在各种实际应用中明显优于POINTS1.0,并表现出强大的性能。值得注意的是,POINTS1.5-7B仅在少于40亿标记的情况下进行训练,并在拥有少于100亿参数的模型中在OpenCompass排行榜上名列第一。
English
Vision-language models have made significant strides recently, demonstrating
superior performance across a range of tasks, e.g. optical character
recognition and complex diagram analysis. Building on this trend, we introduce
a new vision-language model, POINTS1.5, designed to excel in various real-world
applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several
key innovations: i) We replace the original CLIP vision encoder, which had a
fixed image resolution, with a NaViT-style vision encoder that supports native
dynamic high resolution. This allows POINTS1.5 to process images of any
resolution without needing to split them into tiles. ii) We add bilingual
support to POINTS1.5, significantly enhancing its capability in Chinese. Due to
the scarcity of open-source Chinese datasets for vision-language models, we
collect numerous images from the Internet and annotate them using a combination
of manual and automatic methods. iii) We propose a set of rigorous filtering
methods for visual instruction tuning datasets. We comprehensively evaluate all
these filtering methods, and choose the most effective ones to obtain the final
visual instruction tuning set. Thanks to these innovations, POINTS1.5
significantly outperforms POINTS1.0 and demonstrates strong performance across
a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer
than 4 billion tokens and ranks first on the OpenCompass leaderboard among
models with fewer than 10 billion parametersSummary
AI-Generated Summary