理解与缓解机器学习中的分布偏移力场

摘要

机器学习力场（MLFFs）作为一种替代昂贵的从头算量子力学分子模拟的方法，展现出巨大潜力。鉴于所关注化学空间的多样性及生成新数据的高昂成本，理解MLFFs如何在其训练分布之外实现泛化至关重要。为了刻画并深入理解MLFFs中的分布偏移现象，我们在化学数据集上开展了一系列诊断性实验，揭示了即便是基于海量数据训练的大型基础模型也面临显著挑战的常见偏移类型。基于这些观察，我们提出假设：当前的有监督训练方法未能充分正则化MLFFs，导致模型过拟合并学习到对分布外系统的不良表示。为此，我们提出了两种新方法，作为缓解MLFFs分布偏移的初步尝试。这些方法聚焦于测试时优化策略，旨在以最小的计算成本实现改进，且无需依赖昂贵的从头算参考标签。第一种策略基于谱图理论，通过调整测试图的边结构，使其与训练期间观察到的图结构对齐。第二种策略则利用辅助目标（如廉价的物理先验）在测试时采取梯度步骤，以提升对分布外系统的表示能力。我们的测试时优化策略显著降低了分布外系统的误差，表明MLFFs具备并有望向建模多样化化学空间迈进，但当前训练方式尚未有效引导其实现这一目标。实验结果为评估下一代MLFFs的泛化能力确立了明确的基准。相关代码已发布于https://tkreiman.github.io/projects/mlff_distribution_shifts/。

English

Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.

理解与缓解机器学习中的分布偏移力场

Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

摘要

Summary

Support