迈向原子性质预测的数据高效预训练
Towards Data-Efficient Pretraining for Atomic Property Prediction
February 16, 2025
作者: Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI
摘要
本文挑战了原子属性预测中最近的范式,该范式将进展与不断增长的数据集大小和计算资源联系起来。我们表明,在精心选择的与任务相关的数据集上进行预训练可以达到甚至超过大规模预训练的效果,同时仅使用 1/24 的计算成本。我们引入了化学相似性指数(CSI),这是受计算机视觉中 Fr\'echet Inception 距离启发的一种新颖度量标准,用于分子图,量化上游预训练数据集与下游任务之间的对齐程度。通过选择与最小 CSI 距离的最相关数据集,我们展示了在较小、专注的数据集上预训练的模型始终优于在大规模混合数据集(如 JMP)上预训练的模型,即使这些更大的数据集包含相关数据集。出乎意料的是,我们还发现,不加区分地添加更多数据可能会降低模型性能,尤其是当额外数据与手头任务不太对齐时。我们的发现突出了在原子属性预测的预训练中,质量往往胜过数量。
English
This paper challenges the recent paradigm in atomic property prediction that
links progress to growing dataset sizes and computational resources. We show
that pretraining on a carefully selected, task-relevant dataset can match or
even surpass large-scale pretraining, while using as little as 1/24th of the
computational cost. We introduce the Chemical Similarity Index (CSI), a novel
metric inspired by computer vision's Fr\'echet Inception Distance, for
molecular graphs which quantifies the alignment between upstream pretraining
datasets and downstream tasks. By selecting the most relevant dataset with
minimal CSI distance, we show that models pretrained on a smaller, focused
dataset consistently outperform those pretrained on massive, mixed datasets
such as JMP, even when those larger datasets include the relevant dataset.
Counterintuitively, we also find that indiscriminately adding more data can
degrade model performance when the additional data poorly aligns with the task
at hand. Our findings highlight that quality often outperforms quantity in
pretraining for atomic property prediction.Summary
AI-Generated Summary