I have several protein structures – what to do and where to start?

How structural similarity points the way in drug design

Computational structural biology can't exist without structural alignment. As we know, sequence defines structure, and structure defines function – hence, similarities in protein structures can suggest similarities in protein functions. Such molecular similarities (and differences) are crucial for understanding evolutionary relationships between proteins and their common mechanisms of action. The comparative analysis through structural alignment aims to find structural motifs, functional sites, and conservative patterns across molecular structures (Ma and Wang 2014). Structural clustering is closely related to structural alignment. It uncovers hidden patterns in complex structural data and organizes this data into meaningful groups based on their 3D similarity (how spatially close two structures are to each other). By grouping similar structures, clustering simplifies the interpretation and analysis of large datasets, revealing trends and relationships that might otherwise be overlooked (Hamamsy et al. 2023). Together, structural alignment and clustering enhance the understanding of structure-function relationships, providing more "knowledge-based" strategies in drug discovery rather than aimless wandering in the dark.

On the figure below, you can see different proteins that partially share a similar fold. The proteins were chosen using the FoldSeek (van Kempen et al. 2024) comparison tool to the initial 3N1H molecule. 1L3A and 2GIA were at the top of the FoldSeek rating because their global structures are quite similar to 3N1H. In contrast, 4R5J and 8ESD, and 3NM7 were at the bottom since only some local structural elements were similar to those in 3N1H. As we can see, these insights are reflected in the TM-score evaluation, which you can read below.

Why sequence similarity isn’t enough

While sequence similarity search remains the primary method for protein annotation and analysis, its reliance solely on sequence data can be limiting (Shatsky, Nussinov, and Wolfson 2008). Despite its effectiveness in identifying homologous sequences and inferring properties such as function and structure, sequence similarity alone may not always provide a comprehensive understanding of protein evolution (Liu et al. 2018). This limitation arises from the existence of diverse sequences that share very similar structures, indicating functional similarity through evolutionary processes (Koehl 2001). For example, proteins with sequence identity below 25% can still have similar structures (Krissinel 2007). Conversely, advancements in protein design have enabled the creation of highly diverse sequences that fold into identical structures. These approaches encompass both AI-based methods, such as ProteinMPNN (Dauparas et al. 2022), which diversifies the amino acid sequence while preserving the structure, and non-AI-based techniques like QTY-code, which convert the hydrophobic sequence of a membrane protein into a hydrophilic one. 

The intersection between sequence and structural analysis underscores the need for more advanced techniques in protein structure prediction and modeling. In response to this demand, initiatives like the Critical Assessment of Structure Prediction (CASP) appeared (Kryshtafovych et al. 2021). It's a biennial community-wide event that evaluates the current state-of-the-art structure prediction methods. Through CASP, researchers gain insights into the strengths and limitations of existing methods and identify areas for future improvement. Thus, AlphaFold2, the champion of CASP14, revolutionized the field of protein structure prediction accuracy through AI (Jumper et al. 2021).

Quantitative decoding of structural similarity

Both clustering, structural similarity, and alignment use similar scores to evaluate the results. These scores usually rely on the distance between two molecular structures, which can be either experimental or modeled. Some of the most common scores include RMSD, TM-score, GDT, and LDDT (see details in the table) (Olechnovič et al. 2019). Depending on the size of the protein each score takes into account, they can be separated into two groups. Global scores (RMSD, TM-score, and GDT) take into account the whole structure of the molecule providing the general overview on resemblance or dissimilarity between the structures. In contrast, local scores (LDDT) consider only some parts of the structure (domains or residues) and offer more detailed evaluations at that scale. 

ScoreDescriptionTypeUsage
RMSD (root mean square deviation)Calculates the square root of the average of the squared differences between corresponding atom coordinatesGlobalMeasures average distance between corresponding atoms
TM-score (template modeling score)Derives a normalized score based on the length of aligned residues and the RMSD between their positionsGlobalEvaluates similarity between protein structures
GDT (global distance test)Computes the percentage of residues within a certain distance cutoff that are superimposedGlobalQuantifies similarity between two protein structures
LDDT (local distance difference test)Determines the difference in distance distributions between the model and the reference structure on a per-residue basisLocalAssesses local structural quality of protein models

RMSD (root-mean-square deviation)

RMSD is a common metric used to quantify the structural similarity or dissimilarity between two or more protein structures. It provides a measure of the average atomic displacement between equivalent atoms in the superimposed structures. An RMSD of 0 indicates a perfect match, meaning the structures are identical. RMSD values below 2 Å generally reflect a good alignment between 3D coordinates of two or several structures making them highly similar, while RMSD > 3 Å suggest notable structural differences.

ValueInterpretation
Low RMSD (<2Å)- High accuracy (atomwise) of compared structures - Compared structures are very similar or even identical - Successful protein structure prediction
Medium RMSD (2-4Å)- Accuracy (residue wise) can be acceptable depending on the required for the task resolution and the scale of the comparison (local, global, how particular structural elements correspond to each other) - Local RMSD can be higher in disordered regions (loops) and lower in well-predicted or well-structured places (helices or sheets), influencing the global RMSD values and the quality of the structure overall - Compared structures are probably similar but not identical - Can be both successful and failure depending on the task and the required accuracy (do you need to separate individual atoms or residues are enough)
High RMSD (>4Å)- Low accuracy (domain wise) - You can separate, distinguish, and compare only the global structural elements - Compared structures are very different or very hard to compare - For many applications unacceptable RMSD values

TM-score (template modeling score)

TM-score measures the similarity between two protein structures. It offers a more accurate assessment of the overall resemblance of full-length protein structures compared to the commonly used Root Mean Square Deviation (RMSD) measure. TM-score provides a similarity score between (0,1], with 1 indicating a perfect match. Scores below 0.20 suggest unrelated proteins, while those above 0.5 imply structures with roughly the same fold.

ValueInterpretation
High TM-score (~1)- High structural similarity between compared structures - Compared structures are very similar or even identical - Successful protein structure prediction
Medium TM-score (0.4-0.5)- Acceptable structural similarity depending on the specific task - Moderate level of structural similarity
Low TM-score (<0.4)- Low structural similarity, indication poor models in the protein structure prediction cases

GDT (global distance test)

GDT quantifies the similarity between two protein structures that share known amino acid correspondences, such as identical amino acid sequences, while exhibiting distinct tertiary structures. The GDT score assesses structure similarity by calculating the largest set of alpha carbon atoms in the model structure falling within defined distance cutoffs of their positions in the experimental structure. The algorithm generates 20 GDT scores at consecutive distance cutoffs (0.5 Å to 10.0 Å). For structure similarity assessment, multiple GDT scores are used, typically increasing with higher cutoffs. A plateau in the score increase may indicate extreme divergence between experimental and predicted structures. GDT scores are often expressed as a percentage (e.g., GDT-HA, GDT-TS).

ValueInterpretation
High GDT (>90%)- High accuracy - Compared structures closely matches (they are very similar or even identical) - Successful protein structure prediction
Medium GDT (50-90%)- Can be considered acceptable, depending on the task - Appropriation of GDT scores should be decided together with the focus of research and applying other structure evaluation techniques
Low GDT (<50%)- Low accuracy and poor results - Residues with low GDT scores are likely to have inaccuracies in their local environments - Protein structure prediction is most likely unreliable and cannot be used

LDDT (local distance difference test)

LDDT assesses the local accuracy of protein structures even in the presence of domain movements, while maintaining a strong correlation with global measures. It checks how accurately the distances between atoms in one protein structure match those in another structure being compared to it. LDDT has an extension called pLDDT (per-residue local distance difference test) which evaluates the per-residue deviation between predicted and experimentally determined structures. It offers insights into the reliability of individual regions within a protein model. Unlike LDDT, which provides a single score for the entire structure, pLDDT assigns a score to each residue based on the agreement between its local environment in the model and the experimental structure. Both scores range from 0 to 100, with higher scores indicating close correspondence of individual residues between structures at the local level.

ValueInterpretation
High LDDT (>80)- High confidence in the local accuracy of predicted protein structures, good results with reliable side chains - Compared structures are very similar or even identical - Successful protein structure prediction
Medium LDDT (50-80)- Can be considered acceptable, especially if the global structure of the protein is still accurately predicted - Important to interpret the LDDT scores alongside other metrics to assess the overall quality of the predicted structure
Low LDDT (<50)- Low confidence in the local accuracy of the predicted structure, poor results - Residues with low LDDT scores are likely to have inaccuracies in their local environments - Proteins probably have disordered structure

Structural comparison is built explicitly and implicitly available in 310 copilot, so even today, you can find out how close or distant the structures of your sequences are by a chat-based assistance for structural comparison, analysis, and clustering.

References

Dauparas, J., I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, et al. 2022. “Robust Deep Learning-Based Protein Sequence Design Using ProteinMPNN.” Science 378 (6615): 49–56.

Hamamsy, Tymor, James T. Morton, Robert Blackwell, Daniel Berenberg, Nicholas Carriero, Vladimir Gligorijevic, Charlie E. M. Strauss, Julia Koehler Leman, Kyunghyun Cho, and Richard Bonneau. 2023. “Protein Remote Homology Detection and Structural Alignment Using Deep Learning.” Nature Biotechnology, September, 1–11.

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Applying and Improving AlphaFold at CASP14.” Proteins 89 (12): 1711–21.

Kempen, Michel van, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L. M. Gilchrist, Johannes Söding, and Martin Steinegger. 2024. “Fast and Accurate Protein Structure Search with Foldseek.” Nature Biotechnology 42 (2): 243–46.

Koehl, P. 2001. “Protein Structure Similarities.” Current Opinion in Structural Biology 11 (3): 348–53.

Krissinel, Evgeny. 2007. “On the Relationship between Sequence and Structure Similarities in Proteomics.” Bioinformatics  23 (6): 717–23.

Kryshtafovych, Andriy, Torsten Schwede, Maya Topf, Krzysztof Fidelis, and John Moult. 2021. “Critical Assessment of Methods of Protein Structure Prediction (CASP)-Round XIV.” Proteins 89 (12): 1607–17.

Liu, Yang, Qing Ye, Liwei Wang, and Jian Peng. 2018. “Learning Structural Motif Representations for Efficient Protein Structure Search.” Bioinformatics  34 (17): i773–80.

Ma, Jianzhu, and Sheng Wang. 2014. “Algorithms, Applications, and Challenges of Protein Structure Alignment.” Advances in Protein Chemistry and Structural Biology 94: 121–75.

Olechnovič, Kliment, Bohdan Monastyrskyy, Andriy Kryshtafovych, and Česlovas Venclovas. 2019. “Comparative Analysis of Methods for Evaluation of Protein Models against Native Structures.” Bioinformatics  35 (6): 937–44.

Shatsky, Maxim, Ruth Nussinov, and Haim J. Wolfson. 2008. “Algorithms for Multiple Protein Structure Alignment and Structure-Derived Multiple Sequence Alignment.” Methods in Molecular Biology  413: 125–46.