Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

AlphaFold is a protein folding neural network that predicts accurate protein structures
AlphaFold’s robustness has not been explored
We measure the robustness of the predicted structures using RMSD and GDT similarity measure
Minimally perturbing protein sequences to fool protein folding neural networks is NP-complete
Adversarial protein sequences can lead to large RMSD between predicted protein structure and original sequence

Paper Content

Introduction

Proteins are essential for life and reproduction
Proteins are composed of 20 amino acids
It is important to understand 3D structure of proteins
High-throughput sequencing techniques have helped to understand primary sequence of proteins
AlphaFold has achieved success in predicting protein structures using neural networks
Adversarial sequences can result in very different 3D protein structures

PFNNs should obey the observation that small changes in protein sequence usually don’t lead to drastic changes in structure
Studies have shown that two proteins with 50% sequence identity align within 1Å RMSD
Exceptions exist where small perturbations can alter the entire fold of a protein
Stein and Mchaourab 2021 used in silico mutagenesis to enhance AlphaFold prediction
Del Alamo et al. 2022 present a method to manipulate inputs to obtain diverse structures
Jha et al. 2021 aimed to generate adversarial sequences to damage RosettaFold output
This paper presents results for more than 100 sequences and derives a complexity proof

Robustness metric using adversarial attacks

PFNNs should make robust predictions
Notion of biologically similar sequences defined using Block Substitution Matrices
Adversarial attacks used on PFNNs within space of similar sequences
RMSD and GDT used as robustness measure

Blosum similarity measures

Two sequences of n residues are compared to calculate sequence similarity
Not all changes in residues have the same impact on protein structures
Early work in bioinformatics focused on properties of amino acids and genetic codes
Amino acid scoring matrices are derived from empirical observations of frequencies of amino acid replacements in homologous sequences
PAM250 matrix was based on 1572 mutations observed in 71 families of closely-related proteins
BLOSUM approach focuses on identifying conserved blocks or conserved sub-sequences in a variety of proteins
BLOSUM62, BLOSUM80 and BLOSUM90 denote block substitution matrices with 62%, 80%, and 90% similarity
BLOSUM matrix is a matrix of integers that denotes the similarity between residue types
Sequence similarity measure counts replacement frequencies in conserved blocks across different proteins

Approach

Existence of adversarial examples in PFNNs that produce different structures
Use of BLOSUM matrices to identify similar sequences with similar 3D structures
RMSD and GDT used to measure robustness of PFNNs on given input
Focus on AlphaFold model, winner of CASP2020

Output structural measure

Given a sequence of n residues, its 3D structure is an ordered n-tuple of 3D coordinates
Goal is to use a structural distance measure that captures variations in two structures and is invariant to rigid-body motion
Structural distances used are RMSD and GDT with two variants: Total Score and High Accuracy
Alignment algorithm is used before computing RMSD and GDT measures
RMSD is measured in Å
GDT score returns a value in [0, 1] where 1 indicates identical structures
GDT score is computed with respect to four thresholds

Adversarial attacks on pfnns

Neural networks can be tricked into producing incorrect responses with small changes in input images.
A sequence of residues can be changed to maximize a structural distance measure while minimizing sequence similarity.
A brute-force exploration of the sequence space is used to generate adversarial sequences.
Complex protein folding systems have a high inference time and a discrete input space, making it difficult to develop black-box attacks.

Complexity

The PAA problem is formalized and its complexity is established
The PAA problem is NP-complete
A reduction from the CLIQUE problem is used to prove the NP-completeness
The PAA problem is in NP
The PAA problem is NP-hard
The PAA problem is reduced to an instance of the PAA problem
The input tensor is represented as a one-hot encoding
The connectivity structure of the model is derived from the edges of the CLIQUE instance
There is a clique of size k in G if and only if there is a feasible solution to the reduced PAA instance
The RMSD distance is computed without the alignment step

Experimental results

Used AlphaFold 1 with default settings
Included results from high-accuracy and less accurate MSA step
Used PyMOL alignment without outlier rejections
Generated adversarial sequences by randomly sampling 20 sequences
Investigated how change in bound on biological similarity changes adversarial sequence
Configured BLOSUM threshold to be 20, 30, and 40
Observed that increase in BLOSUM threshold increases RMSD
Change in overall average confidence between original and perturbed sequence not significant
Investigated impact of using prediction confidence scores in determining location of residues to be altered
Selecting residues with low or high confidence scores not related to amount of induced RMSD

Covid-19 case studies

Applied adversarial approach to 111 publicly available COVID-19 protein sequences
Figures 1 and 4 show original and adversarial sequences
Small changes in input sequence result in significant changes in output structures
Similarity between original and adversarial sequences is high
Small changes in input sequence cause AlphaFold to predict structures that are highly divergent from original
AlphaFold predicts adversarial structure with similar confidence values to original
GDT scores are generally low
Small changes in input sequences can damage predictions

Conclusion

Recent progress in predicting protein folding structures has the potential to advance understanding of diseases, the human proteome, and drug design.
Predictive protein folding is still a grand challenge.
This paper presents the first work in this direction by showing that Protein Folding Neural Networks (PFNNs) are vulnerable to minor changes in the input protein sequence.
These changes can cause large changes in the predicted protein structure, making PFNNs unsuitable for safety-critical applications.
Standard protein structural distance and similarity were used to measure the robustness of AlphaFold.
Adversarial sequences were generated against COVID-19 and UniProt protein sequences.
Results were reported for RMSD, GDT-TS, and GDT-HA.

Link to paper#

Abstract#

Paper Content#

Introduction#

Summary and related work#

Robustness metric using adversarial attacks#

Blosum similarity measures#

Approach#

Output structural measure#

Adversarial attacks on pfnns#

Complexity#

Experimental results#

Covid-19 case studies#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Summary and related work

Robustness metric using adversarial attacks

Blosum similarity measures

Approach

Output structural measure

Adversarial attacks on pfnns

Complexity

Experimental results

Covid-19 case studies

Conclusion