理解与缓解机器学习中的分布偏移 力场
Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields
March 11, 2025
作者: Tobias Kreiman, Aditi S. Krishnapriyan
cs.AI
摘要
机器学习力场(MLFFs)作为一种替代昂贵的从头算量子力学分子模拟的方法,展现出巨大潜力。鉴于所关注化学空间的多样性及生成新数据的高昂成本,理解MLFFs如何在其训练分布之外实现泛化至关重要。为了刻画并深入理解MLFFs中的分布偏移现象,我们在化学数据集上开展了一系列诊断性实验,揭示了即便是基于海量数据训练的大型基础模型也面临显著挑战的常见偏移类型。基于这些观察,我们提出假设:当前的有监督训练方法未能充分正则化MLFFs,导致模型过拟合并学习到对分布外系统的不良表示。为此,我们提出了两种新方法,作为缓解MLFFs分布偏移的初步尝试。这些方法聚焦于测试时优化策略,旨在以最小的计算成本实现改进,且无需依赖昂贵的从头算参考标签。第一种策略基于谱图理论,通过调整测试图的边结构,使其与训练期间观察到的图结构对齐。第二种策略则利用辅助目标(如廉价的物理先验)在测试时采取梯度步骤,以提升对分布外系统的表示能力。我们的测试时优化策略显著降低了分布外系统的误差,表明MLFFs具备并有望向建模多样化化学空间迈进,但当前训练方式尚未有效引导其实现这一目标。实验结果为评估下一代MLFFs的泛化能力确立了明确的基准。相关代码已发布于https://tkreiman.github.io/projects/mlff_distribution_shifts/。
English
Machine Learning Force Fields (MLFFs) are a promising alternative to
expensive ab initio quantum mechanical molecular simulations. Given the
diversity of chemical spaces that are of interest and the cost of generating
new data, it is important to understand how MLFFs generalize beyond their
training distributions. In order to characterize and better understand
distribution shifts in MLFFs, we conduct diagnostic experiments on chemical
datasets, revealing common shifts that pose significant challenges, even for
large foundation models trained on extensive data. Based on these observations,
we hypothesize that current supervised training methods inadequately regularize
MLFFs, resulting in overfitting and learning poor representations of
out-of-distribution systems. We then propose two new methods as initial steps
for mitigating distribution shifts for MLFFs. Our methods focus on test-time
refinement strategies that incur minimal computational cost and do not use
expensive ab initio reference labels. The first strategy, based on spectral
graph theory, modifies the edges of test graphs to align with graph structures
seen during training. Our second strategy improves representations for
out-of-distribution systems at test-time by taking gradient steps using an
auxiliary objective, such as a cheap physical prior. Our test-time refinement
strategies significantly reduce errors on out-of-distribution systems,
suggesting that MLFFs are capable of and can move towards modeling diverse
chemical spaces, but are not being effectively trained to do so. Our
experiments establish clear benchmarks for evaluating the generalization
capabilities of the next generation of MLFFs. Our code is available at
https://tkreiman.github.io/projects/mlff_distribution_shifts/.Summary
AI-Generated Summary