SPIDER：一个全面的多器官监督病理学数据集及基准模型

摘要

推动计算病理学中的人工智能发展，需要大规模、高质量且多样化的数据集，然而现有的公开数据集往往在器官多样性、类别覆盖或标注质量方面存在局限。为填补这一空白，我们推出了SPIDER（监督病理图像描述库），这是目前公开可用的最大切片级数据集，涵盖皮肤、结直肠和胸部等多种器官类型，并为每种器官提供了全面的类别覆盖。SPIDER包含由病理学专家验证的高质量标注，并配有周围环境切片，通过提供空间上下文信息来提升分类性能。与数据集一同发布的，还有基于SPIDER训练的基线模型，这些模型采用Hibou-L基础模型作为特征提取器，并结合了基于注意力的分类头。这些模型在多种组织类别上实现了最先进的性能，为未来的数字病理学研究设立了强有力的基准。除了切片分类外，该模型还能快速识别关键区域、量化组织指标，并为多模态方法奠定基础。数据集及训练好的模型均已公开，旨在促进研究、提升可重复性，并推动AI驱动的病理学发展。访问地址：https://github.com/HistAI/SPIDER。

English

Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, and Thorax, with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: https://github.com/HistAI/SPIDER

SPIDER：一个全面的多器官监督病理学数据集及基准模型

SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models

摘要

Summary

Support

Support