Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Increasing adoption of AI in drug discovery
  • Existing works mainly use machine learning to utilize chemical structures, ignoring textual knowledge
  • Presenting MoleculeSTM, a multi-modal molecule structure-text model
  • Constructing PubChemSTM, the largest multi-modal dataset to date
  • Designing two challenging zero-shot tasks based on text instructions
  • MoleculeSTM has open vocabulary and compositionality via natural language
  • Obtaining state-of-the-art generalization ability to novel biochemical concepts

Paper Content

Introduction

  • Recent progress in AI promises to be transformative for drug discovery
  • AI methods have been used to augment and accelerate current computational pipelines
  • ML methods mainly focus on modeling chemical structure of molecules
  • Supervised setting requires expensive annotations
  • Unsupervised pretraining on large-scale databases proposed
  • Existing molecule pretraining methods incorporate only chemical structures
  • Textual data is being harnessed in large-scale multi-modal models
  • Pretrained multi-modal models can generalize well to new categories and tasks
  • Previous work attempted to leverage textual knowledge to learn molecule representation
  • Proposed MoleculeSTM incorporates both molecular structural information and textual knowledge
  • MoleculeSTM can be generalized to diverse downstream tasks in a zero-shot manner
  • MoleculeSTM has two main attributes: open vocabulary and compositionality

Results

Overview and preliminaries

  • MoleculeSTM consists of two branches: chemical structure and textual description
  • Pretraining uses contrastive learning to reduce representation distance between same molecule pairs and increase distance between different molecule pairs
  • Downstream tasks include zero-shot structure-text retrieval, zero-shot text-based molecule editing, and molecular property prediction
  • Pretrained models are used for retrieval in the zero-shot setting
  • Molecular property prediction uses pretrained encoder and adds a prediction head

Two principles for downstream task design

  • Open vocabulary: language model can support exploration of novel biochemical concepts with unbound vocabulary
  • Compositionality: language model can transform molecule property compositionality problem into language compositionality problem
  • Language model can be used for drug re-purposing and text-based lead optimization

Downstream: zero-shot structure-text retrieval

  • Retrieval task can be seen as a multiple-choice problem
  • Pretrained encoders and projectors from MoleculeSTM are used and remain frozen in the task
  • Example of setting (1) is given

Downstream: zero-shot text-based molecule editing

  • MoleculeSTM and a pretrained molecule generative model are frozen
  • Editing pipeline is split into two phases: space alignment and latent optimization
  • Space alignment phase: learn an adaptor module to align the two latent spaces
  • Latent optimization phase: optimize a latent code to be close to the representations of input molecule and text prompt
  • Evaluation metric is satisfactory hit ratio, which is task-specific

Downstream: molecular property prediction

  • MoleculeSTM is a pretrained chemical structure representation that shares information with external domain knowledge
  • MoleculeNet is a benchmark used to evaluate the expressiveness of the pretrained molecule representation methods
  • Evaluation metric is ROC-AUC
  • Baselines include randomly initialized models, MegaMolBART, KV-PLM, AttrMasking, ContextPred, InfoGraph, MolCLR, and GraphMVP
  • MoleculeSTM performs best on average across all eight tasks

Discussion

  • Presented a multi-modal model, MoleculeSTM, to illustrate effectiveness of incorporating textual descriptions for molecule representation learning
  • Confirmed improved performance of MoleculeSTM compared to existing methods
  • MoleculeSTM can retrieve novel drug-target relations and modify molecule substructures to gain desired properties
  • Outcomes of downstream tasks consistent with feedback from chemistry experts

Methods

  • PubChem database is used as data source
  • PubChemSTM dataset is constructed with 250K molecules and 281K structure-text pairs
  • Chemical structure branch f c uses SMILES string and 2D molecular graph
  • Textual description branch f t uses BERT model and pretrained SciBERT
  • Pretraining uses contrastive learning strategy
  • Pre-processing includes PubChemSTM-raw and PubChemSTM-extracted
  • Vocabulary size is important factor
  • Evaluation is computationally feasible
  • Fuzzy matching is used for molecule editing task