Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Increasing adoption of AI in drug discovery
- Existing works mainly use machine learning to utilize chemical structures, ignoring textual knowledge
- Presenting MoleculeSTM, a multi-modal molecule structure-text model
- Constructing PubChemSTM, the largest multi-modal dataset to date
- Designing two challenging zero-shot tasks based on text instructions
- MoleculeSTM has open vocabulary and compositionality via natural language
- Obtaining state-of-the-art generalization ability to novel biochemical concepts
Paper Content
Introduction
- Recent progress in AI promises to be transformative for drug discovery
- AI methods have been used to augment and accelerate current computational pipelines
- ML methods mainly focus on modeling chemical structure of molecules
- Supervised setting requires expensive annotations
- Unsupervised pretraining on large-scale databases proposed
- Existing molecule pretraining methods incorporate only chemical structures
- Textual data is being harnessed in large-scale multi-modal models
- Pretrained multi-modal models can generalize well to new categories and tasks
- Previous work attempted to leverage textual knowledge to learn molecule representation
- Proposed MoleculeSTM incorporates both molecular structural information and textual knowledge
- MoleculeSTM can be generalized to diverse downstream tasks in a zero-shot manner
- MoleculeSTM has two main attributes: open vocabulary and compositionality
Results
Overview and preliminaries
- MoleculeSTM consists of two branches: chemical structure and textual description
- Pretraining uses contrastive learning to reduce representation distance between same molecule pairs and increase distance between different molecule pairs
- Downstream tasks include zero-shot structure-text retrieval, zero-shot text-based molecule editing, and molecular property prediction
- Pretrained models are used for retrieval in the zero-shot setting
- Molecular property prediction uses pretrained encoder and adds a prediction head
Two principles for downstream task design
- Open vocabulary: language model can support exploration of novel biochemical concepts with unbound vocabulary
- Compositionality: language model can transform molecule property compositionality problem into language compositionality problem
- Language model can be used for drug re-purposing and text-based lead optimization
Downstream: zero-shot structure-text retrieval
- Retrieval task can be seen as a multiple-choice problem
- Pretrained encoders and projectors from MoleculeSTM are used and remain frozen in the task
- Example of setting (1) is given
Downstream: zero-shot text-based molecule editing
- MoleculeSTM and a pretrained molecule generative model are frozen
- Editing pipeline is split into two phases: space alignment and latent optimization
- Space alignment phase: learn an adaptor module to align the two latent spaces
- Latent optimization phase: optimize a latent code to be close to the representations of input molecule and text prompt
- Evaluation metric is satisfactory hit ratio, which is task-specific
Downstream: molecular property prediction
- MoleculeSTM is a pretrained chemical structure representation that shares information with external domain knowledge
- MoleculeNet is a benchmark used to evaluate the expressiveness of the pretrained molecule representation methods
- Evaluation metric is ROC-AUC
- Baselines include randomly initialized models, MegaMolBART, KV-PLM, AttrMasking, ContextPred, InfoGraph, MolCLR, and GraphMVP
- MoleculeSTM performs best on average across all eight tasks
Discussion
- Presented a multi-modal model, MoleculeSTM, to illustrate effectiveness of incorporating textual descriptions for molecule representation learning
- Confirmed improved performance of MoleculeSTM compared to existing methods
- MoleculeSTM can retrieve novel drug-target relations and modify molecule substructures to gain desired properties
- Outcomes of downstream tasks consistent with feedback from chemistry experts
Methods
- PubChem database is used as data source
- PubChemSTM dataset is constructed with 250K molecules and 281K structure-text pairs
- Chemical structure branch f c uses SMILES string and 2D molecular graph
- Textual description branch f t uses BERT model and pretrained SciBERT
- Pretraining uses contrastive learning strategy
- Pre-processing includes PubChemSTM-raw and PubChemSTM-extracted
- Vocabulary size is important factor
- Evaluation is computationally feasible
- Fuzzy matching is used for molecule editing task