MedINST:生物醫學指令的元數據集
MedINST: Meta Dataset of Biomedical Instructions
October 17, 2024
作者: Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, Qingyu Chen
cs.AI
摘要
在醫學分析領域中整合大型語言模型(LLM)技術已帶來顯著進展,然而大型、多元且有良好標註的數據集稀缺仍是一個主要挑戰。醫學數據和任務以不同格式、大小和其他參數呈現,需要廣泛的預處理和標準化,以有效用於訓練LLMs。為應對這些挑戰,我們引入了MedINST,即生物醫學指令的元數據集,這是一個新穎的多領域、多任務指令元數據集。MedINST 包含 133 個生物醫學自然語言處理任務和超過 700 萬個訓練樣本,使其成為迄今為止最全面的生物醫學指令數據集。我們使用 MedINST 作為元數據集,精心策劃了 MedINST32,這是一個具有不同任務難度的挑戰性基準,旨在評估LLMs的泛化能力。我們在 MedINST 上對幾個LLMs進行微調,並在 MedINST32 上進行評估,展示了增強的跨任務泛化能力。
English
The integration of large language model (LLM) techniques in the field of
medical analysis has brought about significant advancements, yet the scarcity
of large, diverse, and well-annotated datasets remains a major challenge.
Medical data and tasks, which vary in format, size, and other parameters,
require extensive preprocessing and standardization for effective use in
training LLMs. To address these challenges, we introduce MedINST, the Meta
Dataset of Biomedical Instructions, a novel multi-domain, multi-task
instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over
7 million training samples, making it the most comprehensive biomedical
instruction dataset to date. Using MedINST as the meta dataset, we curate
MedINST32, a challenging benchmark with different task difficulties aiming to
evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and
evaluate on MedINST32, showcasing enhanced cross-task generalization.Summary
AI-Generated Summary