MolReFlect:朝向分子和文本之間的上下文細粒度對齊
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
November 22, 2024
作者: Jiatong Li, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing Li
cs.AI
摘要
分子發現是一個重要的研究領域,影響著我們服用的藥物和使用的材料等方方面面。最近,大型語言模型(LLMs)已被廣泛應用於分子理解和生成,然而分子與其相應標題之間的對齊仍然是一個重大挑戰。先前的努力通常將分子視為一個通用的SMILES字符串或分子圖,忽略了分子子結構與描述性文本短語之間的細粒度對齊,這對於準確和可解釋的預測至關重要。在這種情況下,我們介紹了MolReFlect,這是一個新穎的師生框架,旨在以細粒度的方式在上下文中執行分子-標題對齊。我們的方法最初利用一個更大的師生LLM來標記詳細的對齊,通過直接從分子標題或SMILES字符串中提取關鍵短語並將其暗示給相應的子結構或特徵。為了改進這些對齊,我們提出了In-Context Selective Reflection,它檢索以前的提取結果作為師生LLM反映的上下文示例,並讓較小的學生LLM從上下文反映和以前的提取結果中進行選擇。最後,我們通過Chain-of-Thought In-Context Molecule Tuning增強了學生LLM的學習過程,將細粒度的對齊和推理過程融入Chain-of-Thought格式中。我們的實驗結果表明,MolReFlect使像Mistral-7B這樣的LLMs能夠顯著優於以前的基準線,在ChEBI-20數據集上實現了SOTA性能。這一進步不僅增強了LLMs在分子-標題翻譯任務中的生成能力,還有助於構建更具解釋性的框架。
English
Molecule discovery is a pivotal research field, impacting everything from the
medicines we take to the materials we use. Recently, Large Language Models
(LLMs) have been widely adopted in molecule understanding and generation, yet
the alignments between molecules and their corresponding captions remain a
significant challenge. Previous endeavours often treat the molecule as a
general SMILES string or molecular graph, neglecting the fine-grained
alignments between the molecular sub-structures and the descriptive textual
phrases, which are crucial for accurate and explainable predictions. In this
case, we introduce MolReFlect, a novel teacher-student framework designed to
contextually perform the molecule-caption alignments in a fine-grained way. Our
approach initially leverages a larger teacher LLM to label the detailed
alignments by directly extracting critical phrases from molecule captions or
SMILES strings and implying them to corresponding sub-structures or
characteristics. To refine these alignments, we propose In-Context Selective
Reflection, which retrieves previous extraction results as context examples for
teacher LLM to reflect and lets a smaller student LLM select from in-context
reflection and previous extraction results. Finally, we enhance the learning
process of the student LLM through Chain-of-Thought In-Context Molecule Tuning,
integrating the fine-grained alignments and the reasoning processes within the
Chain-of-Thought format. Our experimental results demonstrate that MolReFlect
enables LLMs like Mistral-7B to significantly outperform the previous
baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement
not only enhances the generative capabilities of LLMs in the molecule-caption
translation task, but also contributes to a more explainable framework.Summary
AI-Generated Summary