Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Increasing adoption of AI in drug discovery
Existing works mainly use machine learning to utilize chemical structures, ignoring textual knowledge
Presenting MoleculeSTM, a multi-modal molecule structure-text model
Constructing PubChemSTM, the largest multi-modal dataset to date
Designing two challenging zero-shot tasks based on text instructions
MoleculeSTM has open vocabulary and compositionality via natural language
Obtaining state-of-the-art generalization ability to novel biochemical concepts

Recent progress in AI promises to be transformative for drug discovery
AI methods have been used to augment and accelerate current computational pipelines
ML methods mainly focus on modeling chemical structure of molecules
Supervised setting requires expensive annotations
Unsupervised pretraining on large-scale databases proposed
Existing molecule pretraining methods incorporate only chemical structures
Textual data is being harnessed in large-scale multi-modal models
Pretrained multi-modal models can generalize well to new categories and tasks
Previous work attempted to leverage textual knowledge to learn molecule representation
Proposed MoleculeSTM incorporates both molecular structural information and textual knowledge
MoleculeSTM can be generalized to diverse downstream tasks in a zero-shot manner
MoleculeSTM has two main attributes: open vocabulary and compositionality

MoleculeSTM consists of two branches: chemical structure and textual description
Pretraining uses contrastive learning to reduce representation distance between same molecule pairs and increase distance between different molecule pairs
Downstream tasks include zero-shot structure-text retrieval, zero-shot text-based molecule editing, and molecular property prediction
Pretrained models are used for retrieval in the zero-shot setting
Molecular property prediction uses pretrained encoder and adds a prediction head

Open vocabulary: language model can support exploration of novel biochemical concepts with unbound vocabulary
Compositionality: language model can transform molecule property compositionality problem into language compositionality problem
Language model can be used for drug re-purposing and text-based lead optimization

Retrieval task can be seen as a multiple-choice problem
Pretrained encoders and projectors from MoleculeSTM are used and remain frozen in the task
Example of setting (1) is given

MoleculeSTM and a pretrained molecule generative model are frozen
Editing pipeline is split into two phases: space alignment and latent optimization
Space alignment phase: learn an adaptor module to align the two latent spaces
Latent optimization phase: optimize a latent code to be close to the representations of input molecule and text prompt
Evaluation metric is satisfactory hit ratio, which is task-specific

MoleculeSTM is a pretrained chemical structure representation that shares information with external domain knowledge
MoleculeNet is a benchmark used to evaluate the expressiveness of the pretrained molecule representation methods
Evaluation metric is ROC-AUC
Baselines include randomly initialized models, MegaMolBART, KV-PLM, AttrMasking, ContextPred, InfoGraph, MolCLR, and GraphMVP
MoleculeSTM performs best on average across all eight tasks

Presented a multi-modal model, MoleculeSTM, to illustrate effectiveness of incorporating textual descriptions for molecule representation learning
Confirmed improved performance of MoleculeSTM compared to existing methods
MoleculeSTM can retrieve novel drug-target relations and modify molecule substructures to gain desired properties
Outcomes of downstream tasks consistent with feedback from chemistry experts

PubChem database is used as data source
PubChemSTM dataset is constructed with 250K molecules and 281K structure-text pairs
Chemical structure branch f c uses SMILES string and 2D molecular graph
Textual description branch f t uses BERT model and pretrained SciBERT
Pretraining uses contrastive learning strategy
Pre-processing includes PubChemSTM-raw and PubChemSTM-extracted
Vocabulary size is important factor
Evaluation is computationally feasible
Fuzzy matching is used for molecule editing task