BIOMEDICA:一個源自科學文獻的開放生物醫學圖像說明存檔、數據集和視覺語言模型

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

January 13, 2025
作者: Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy
cs.AI

摘要

視覺語言模型(VLMs)的發展受到大規模和多樣化的多模態數據集驅動。然而,通往通用生物醫學VLMs的進展受限於生物學和醫學領域缺乏注釋且可公開訪問的數據集。現有的努力受限於狹窄的領域,缺乏科學文獻中編碼的生物醫學知識的完整多樣性。為彌補這一差距,我們引入了BIOMEDICA,一個可擴展的開源框架,用於提取、標註和序列化PubMed Central開放訪問子集的全部內容,形成易於使用且可公開訪問的數據集。我們的框架生成了一個包含超過2400萬個獨特的圖像-文本對的全面存檔,來自超過600萬篇文章。同時提供元數據和專家指導的標註。我們通過發布BMCA-CLIP展示了我們資源的實用性和可訪問性,這是一套通過流式連續預訓練於BIOMEDICA數據集上的CLIP風格模型套件,無需在本地下載27 TB的數據。我們的模型平均在40個任務中實現了最先進的性能,跨越病理學、放射學、眼科學、皮膚科、外科學、分子生物學、寄生學和細胞生物學,以6.56%的平均改進(在皮膚科和眼科學中高達29.8%和17.5%),並實現更強大的圖像-文本檢索,同時使用的計算資源少了10倍。為了促進可重現性和協作,我們釋出了我們的代碼庫和數據集供更廣泛的研究社區使用。
English
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

Summary

AI-Generated Summary

PDF452January 14, 2025