ChatPaper.aiChatPaper

MolReFlect:实现分子与文本之间上下文中的细粒度对齐

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

November 22, 2024
作者: Jiatong Li, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing Li
cs.AI

摘要

分子发现是一个关键的研究领域,影响着我们所服用的药物以及我们所使用的材料。最近,大型语言模型(LLMs)在分子理解和生成方面被广泛采用,然而分子与其相应标题之间的对齐仍然是一个重要挑战。先前的努力通常将分子视为一般的SMILES字符串或分子图,忽略了分子亚结构与描述性文本短语之间的细粒度对齐,这对于准确和可解释的预测至关重要。在这种情况下,我们介绍了MolReFlect,这是一个新颖的师生框架,旨在以细粒度方式进行分子-标题对齐。我们的方法最初利用较大的师傅LLM来标记详细的对齐,通过直接从分子标题或SMILES字符串中提取关键短语,并将其暗示给相应的亚结构或特征来实现。为了改进这些对齐,我们提出了上下文选择性反思,它检索以前的提取结果作为师傅LLM的上下文示例进行反思,并让较小的学生LLM从上下文反思和以前的提取结果中进行选择。最后,我们通过思维链上下文分子调整来增强学生LLM的学习过程,将细粒度对齐和推理过程整合到思维链格式中。我们的实验结果表明,MolReFlect使得像Mistral-7B这样的LLMs能够显著优于以前的基线,在ChEBI-20数据集上实现了最先进的性能。这一进展不仅增强了LLMs在分子-标题翻译任务中的生成能力,还有助于构建更具解释性的框架。
English
Molecule discovery is a pivotal research field, impacting everything from the medicines we take to the materials we use. Recently, Large Language Models (LLMs) have been widely adopted in molecule understanding and generation, yet the alignments between molecules and their corresponding captions remain a significant challenge. Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions. In this case, we introduce MolReFlect, a novel teacher-student framework designed to contextually perform the molecule-caption alignments in a fine-grained way. Our approach initially leverages a larger teacher LLM to label the detailed alignments by directly extracting critical phrases from molecule captions or SMILES strings and implying them to corresponding sub-structures or characteristics. To refine these alignments, we propose In-Context Selective Reflection, which retrieves previous extraction results as context examples for teacher LLM to reflect and lets a smaller student LLM select from in-context reflection and previous extraction results. Finally, we enhance the learning process of the student LLM through Chain-of-Thought In-Context Molecule Tuning, integrating the fine-grained alignments and the reasoning processes within the Chain-of-Thought format. Our experimental results demonstrate that MolReFlect enables LLMs like Mistral-7B to significantly outperform the previous baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement not only enhances the generative capabilities of LLMs in the molecule-caption translation task, but also contributes to a more explainable framework.

Summary

AI-Generated Summary

PDF52November 27, 2024