Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Increasing adoption of AI in drug discovery Existing works mainly use machine learning to utilize chemical structures, ignoring textual knowledge Presenting MoleculeSTM, a multi-modal molecule structure-text model Constructing PubChemSTM, the largest multi-modal dataset to date Designing two challenging zero-shot tasks based on text instructions MoleculeSTM has open vocabulary and compositionality via natural language Obtaining state-of-the-art generalization ability to novel biochemical concepts Paper Content Introduction Recent progress in AI promises to be transformative for drug discovery AI methods have been used to augment and accelerate current computational pipelines ML methods mainly focus on modeling chemical structure of molecules Supervised setting requires expensive annotations Unsupervised pretraining on large-scale databases proposed Existing molecule pretraining methods incorporate only chemical structures Textual data is being harnessed in large-scale multi-modal models Pretrained multi-modal models can generalize well to new categories and tasks Previous work attempted to leverage textual knowledge to learn molecule representation Proposed MoleculeSTM incorporates both molecular structural information and textual knowledge MoleculeSTM can be generalized to diverse downstream tasks in a zero-shot manner MoleculeSTM has two main attributes: open vocabulary and compositionality Results Overview and preliminaries MoleculeSTM consists of two branches: chemical structure and textual description Pretraining uses contrastive learning to reduce representation distance between same molecule pairs and increase distance between different molecule pairs Downstream tasks include zero-shot structure-text retrieval, zero-shot text-based molecule editing, and molecular property prediction Pretrained models are used for retrieval in the zero-shot setting Molecular property prediction uses pretrained encoder and adds a prediction head Two principles for downstream task design Open vocabulary: language model can support exploration of novel biochemical concepts with unbound vocabulary Compositionality: language model can transform molecule property compositionality problem into language compositionality problem Language model can be used for drug re-purposing and text-based lead optimization Downstream: zero-shot structure-text retrieval Retrieval task can be seen as a multiple-choice problem Pretrained encoders and projectors from MoleculeSTM are used and remain frozen in the task Example of setting (1) is given Downstream: zero-shot text-based molecule editing MoleculeSTM and a pretrained molecule generative model are frozen Editing pipeline is split into two phases: space alignment and latent optimization Space alignment phase: learn an adaptor module to align the two latent spaces Latent optimization phase: optimize a latent code to be close to the representations of input molecule and text prompt Evaluation metric is satisfactory hit ratio, which is task-specific Downstream: molecular property prediction MoleculeSTM is a pretrained chemical structure representation that shares information with external domain knowledge MoleculeNet is a benchmark used to evaluate the expressiveness of the pretrained molecule representation methods Evaluation metric is ROC-AUC Baselines include randomly initialized models, MegaMolBART, KV-PLM, AttrMasking, ContextPred, InfoGraph, MolCLR, and GraphMVP MoleculeSTM performs best on average across all eight tasks Discussion Presented a multi-modal model, MoleculeSTM, to illustrate effectiveness of incorporating textual descriptions for molecule representation learning Confirmed improved performance of MoleculeSTM compared to existing methods MoleculeSTM can retrieve novel drug-target relations and modify molecule substructures to gain desired properties Outcomes of downstream tasks consistent with feedback from chemistry experts Methods PubChem database is used as data source PubChemSTM dataset is constructed with 250K molecules and 281K structure-text pairs Chemical structure branch f c uses SMILES string and 2D molecular graph Textual description branch f t uses BERT model and pretrained SciBERT Pretraining uses contrastive learning strategy Pre-processing includes PubChemSTM-raw and PubChemSTM-extracted Vocabulary size is important factor Evaluation is computationally feasible Fuzzy matching is used for molecule editing task