BIOMEDICA:一个源自科学文献的开放生物医学图像描述存档、数据集和视觉-语言模型
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
January 13, 2025
作者: Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy
cs.AI
摘要
视觉语言模型(VLMs)的发展受到大规模和多样化的多模态数据集的推动。然而,通向通用生物医学VLMs的进展受限于生物学和医学领域缺乏注释的、公开可访问的数据集。现有的努力局限于狭窄领域,缺乏科学文献中编码的生物医学知识的完整多样性。为了填补这一空白,我们引入了BIOMEDICA,这是一个可扩展的、开源的框架,用于提取、注释和序列化PubMed Central开放获取子集的全部内容,形成一个易于使用、公开可访问的数据集。我们的框架生成了一个包含超过2400万个独特的图像文本对的综合存档,来自超过600万篇文章。同时提供元数据和专家指导的注释。我们通过发布BMCA-CLIP展示了我们资源的实用性和可访问性,这是一套通过流式传输不断在BIOMEDICA数据集上进行预训练的CLIP风格模型套件,消除了本地下载27 TB数据的需求。平均而言,我们的模型在40个任务上实现了最先进的性能,涵盖了病理学、放射学、眼科学、皮肤科、外科学、分子生物学、寄生学和细胞生物学,以6.56%的平均改进表现出色,其中在皮肤科和眼科学中高达29.8%和17.5%,并且在更强的图像文本检索方面表现更佳,同时使用的计算资源减少了10倍。为了促进可重复性和合作,我们向更广泛的研究社区发布了我们的代码库和数据集。
English
The development of vision-language models (VLMs) is driven by large-scale and
diverse multimodal datasets. However, progress toward generalist biomedical
VLMs is limited by the lack of annotated, publicly accessible datasets across
biology and medicine. Existing efforts are restricted to narrow domains,
missing the full diversity of biomedical knowledge encoded in scientific
literature. To address this gap, we introduce BIOMEDICA, a scalable,
open-source framework to extract, annotate, and serialize the entirety of the
PubMed Central Open Access subset into an easy-to-use, publicly accessible
dataset.Our framework produces a comprehensive archive with over 24 million
unique image-text pairs from over 6 million articles. Metadata and
expert-guided annotations are also provided. We demonstrate the utility and
accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style
models continuously pre-trained on the BIOMEDICA dataset via streaming,
eliminating the need to download 27 TB of data locally.On average, our models
achieve state-of-the-art performance across 40 tasks - spanning pathology,
radiology, ophthalmology, dermatology, surgery, molecular biology,
parasitology, and cell biology - excelling in zero-shot classification with a
6.56% average improvement (as high as 29.8% and 17.5% in dermatology and
ophthalmology, respectively), and stronger image-text retrieval, all while
using 10x less compute. To foster reproducibility and collaboration, we release
our codebase and dataset for the broader research community.Summary
AI-Generated Summary