Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- AlphaFold is a protein folding neural network that predicts accurate protein structures
- AlphaFold’s robustness has not been explored
- We measure the robustness of the predicted structures using RMSD and GDT similarity measure
- Minimally perturbing protein sequences to fool protein folding neural networks is NP-complete
- Adversarial protein sequences can lead to large RMSD between predicted protein structure and original sequence
Paper Content
Introduction
- Proteins are essential for life and reproduction
- Proteins are composed of 20 amino acids
- It is important to understand 3D structure of proteins
- High-throughput sequencing techniques have helped to understand primary sequence of proteins
- AlphaFold has achieved success in predicting protein structures using neural networks
- Adversarial sequences can result in very different 3D protein structures
Summary and related work
- PFNNs should obey the observation that small changes in protein sequence usually don’t lead to drastic changes in structure
- Studies have shown that two proteins with 50% sequence identity align within 1Å RMSD
- Exceptions exist where small perturbations can alter the entire fold of a protein
- Stein and Mchaourab 2021 used in silico mutagenesis to enhance AlphaFold prediction
- Del Alamo et al. 2022 present a method to manipulate inputs to obtain diverse structures
- Jha et al. 2021 aimed to generate adversarial sequences to damage RosettaFold output
- This paper presents results for more than 100 sequences and derives a complexity proof
Robustness metric using adversarial attacks
- PFNNs should make robust predictions
- Notion of biologically similar sequences defined using Block Substitution Matrices
- Adversarial attacks used on PFNNs within space of similar sequences
- RMSD and GDT used as robustness measure
Blosum similarity measures
- Two sequences of n residues are compared to calculate sequence similarity
- Not all changes in residues have the same impact on protein structures
- Early work in bioinformatics focused on properties of amino acids and genetic codes
- Amino acid scoring matrices are derived from empirical observations of frequencies of amino acid replacements in homologous sequences
- PAM250 matrix was based on 1572 mutations observed in 71 families of closely-related proteins
- BLOSUM approach focuses on identifying conserved blocks or conserved sub-sequences in a variety of proteins
- BLOSUM62, BLOSUM80 and BLOSUM90 denote block substitution matrices with 62%, 80%, and 90% similarity
- BLOSUM matrix is a matrix of integers that denotes the similarity between residue types
- Sequence similarity measure counts replacement frequencies in conserved blocks across different proteins
Approach
- Existence of adversarial examples in PFNNs that produce different structures
- Use of BLOSUM matrices to identify similar sequences with similar 3D structures
- RMSD and GDT used to measure robustness of PFNNs on given input
- Focus on AlphaFold model, winner of CASP2020
Output structural measure
- Given a sequence of n residues, its 3D structure is an ordered n-tuple of 3D coordinates
- Goal is to use a structural distance measure that captures variations in two structures and is invariant to rigid-body motion
- Structural distances used are RMSD and GDT with two variants: Total Score and High Accuracy
- Alignment algorithm is used before computing RMSD and GDT measures
- RMSD is measured in Å
- GDT score returns a value in [0, 1] where 1 indicates identical structures
- GDT score is computed with respect to four thresholds
Adversarial attacks on pfnns
- Neural networks can be tricked into producing incorrect responses with small changes in input images.
- A sequence of residues can be changed to maximize a structural distance measure while minimizing sequence similarity.
- A brute-force exploration of the sequence space is used to generate adversarial sequences.
- Complex protein folding systems have a high inference time and a discrete input space, making it difficult to develop black-box attacks.
Complexity
- The PAA problem is formalized and its complexity is established
- The PAA problem is NP-complete
- A reduction from the CLIQUE problem is used to prove the NP-completeness
- The PAA problem is in NP
- The PAA problem is NP-hard
- The PAA problem is reduced to an instance of the PAA problem
- The input tensor is represented as a one-hot encoding
- The connectivity structure of the model is derived from the edges of the CLIQUE instance
- There is a clique of size k in G if and only if there is a feasible solution to the reduced PAA instance
- The RMSD distance is computed without the alignment step
Experimental results
- Used AlphaFold 1 with default settings
- Included results from high-accuracy and less accurate MSA step
- Used PyMOL alignment without outlier rejections
- Generated adversarial sequences by randomly sampling 20 sequences
- Investigated how change in bound on biological similarity changes adversarial sequence
- Configured BLOSUM threshold to be 20, 30, and 40
- Observed that increase in BLOSUM threshold increases RMSD
- Change in overall average confidence between original and perturbed sequence not significant
- Investigated impact of using prediction confidence scores in determining location of residues to be altered
- Selecting residues with low or high confidence scores not related to amount of induced RMSD
Covid-19 case studies
- Applied adversarial approach to 111 publicly available COVID-19 protein sequences
- Figures 1 and 4 show original and adversarial sequences
- Small changes in input sequence result in significant changes in output structures
- Similarity between original and adversarial sequences is high
- Small changes in input sequence cause AlphaFold to predict structures that are highly divergent from original
- AlphaFold predicts adversarial structure with similar confidence values to original
- GDT scores are generally low
- Small changes in input sequences can damage predictions
Conclusion
- Recent progress in predicting protein folding structures has the potential to advance understanding of diseases, the human proteome, and drug design.
- Predictive protein folding is still a grand challenge.
- This paper presents the first work in this direction by showing that Protein Folding Neural Networks (PFNNs) are vulnerable to minor changes in the input protein sequence.
- These changes can cause large changes in the predicted protein structure, making PFNNs unsuitable for safety-critical applications.
- Standard protein structural distance and similarity were used to measure the robustness of AlphaFold.
- Adversarial sequences were generated against COVID-19 and UniProt protein sequences.
- Results were reported for RMSD, GDT-TS, and GDT-HA.