I have several protein structures – what to do and where to start?
May 8, 2024
How structural similarity points the way in drug design
Computational structural biology can't exist without structural alignment. As we know, sequence defines structure, and structure defines function – hence, similarities in protein structures can suggest similarities in protein functions. Such molecular similarities (and differences) are crucial for understanding evolutionary relationships between proteins and their common mechanisms of action. The comparative analysis through structural alignment aims to find structural motifs, functional sites, and conservative patterns across molecular structures (Ma and Wang 2014). Structural clustering is closely related to structural alignment. It uncovers hidden patterns in complex structural data and organizes this data into meaningful groups based on their 3D similarity (how spatially close two structures are to each other). By grouping similar structures, clustering simplifies the interpretation and analysis of large datasets, revealing trends and relationships that might otherwise be overlooked (Hamamsy et al. 2023). Together, structural alignment and clustering enhance the understanding of structure-function relationships, providing more "knowledge-based" strategies in drug discovery rather than aimless wandering in the dark.
On the figure below, you can see different proteins that partially share a similar fold. The proteins were chosen using the FoldSeek (van Kempen et al. 2024) comparison tool to the initial 3N1H molecule. 1L3A and 2GIA were at the top of the FoldSeek rating because their global structures are quite similar to 3N1H. In contrast, 4R5J and 8ESD, and 3NM7 were at the bottom since only some local structural elements were similar to those in 3N1H. As we can see, these insights are reflected in the TM-score evaluation, which you can read below.
Why sequence similarity isn’t enough
While sequence similarity search remains the primary method for protein annotation and analysis, its reliance solely on sequence data can be limiting (Shatsky, Nussinov, and Wolfson 2008). Despite its effectiveness in identifying homologous sequences and inferring properties such as function and structure, sequence similarity alone may not always provide a comprehensive understanding of protein evolution (Liu et al. 2018). This limitation arises from the existence of diverse sequences that share very similar structures, indicating functional similarity through evolutionary processes (Koehl 2001). For example, proteins with sequence identity below 25% can still have similar structures (Krissinel 2007). Conversely, advancements in protein design have enabled the creation of highly diverse sequences that fold into identical structures. These approaches encompass both AI-based methods, such as ProteinMPNN (Dauparas et al. 2022), which diversifies the amino acid sequence while preserving the structure, and non-AI-based techniques like QTY-code, which convert the hydrophobic sequence of a membrane protein into a hydrophilic one.
The intersection between sequence and structural analysis underscores the need for more advanced techniques in protein structure prediction and modeling. In response to this demand, initiatives like the Critical Assessment of Structure Prediction (CASP) appeared (Kryshtafovych et al. 2021). It's a biennial community-wide event that evaluates the current state-of-the-art structure prediction methods. Through CASP, researchers gain insights into the strengths and limitations of existing methods and identify areas for future improvement. Thus, AlphaFold2, the champion of CASP14, revolutionized the field of protein structure prediction accuracy through AI (Jumper et al. 2021).
Quantitative decoding of structural similarity
Both clustering, structural similarity, and alignment use similar scores to evaluate the results. These scores usually rely on the distance between two molecular structures, which can be either experimental or modeled. Some of the most common scores include RMSD, TM-score, GDT, and LDDT (see details in the table) (Olechnovič et al. 2019). Depending on the size of the protein each score takes into account, they can be separated into two groups. Global scores (RMSD, TM-score, and GDT) take into account the whole structure of the molecule providing the general overview on resemblance or dissimilarity between the structures. In contrast, local scores (LDDT) consider only some parts of the structure (domains or residues) and offer more detailed evaluations at that scale.
Score | Description | Type | Usage |
---|---|---|---|
RMSD (root mean square deviation) | Calculates the square root of the average of the squared differences between corresponding atom coordinates | Global | Measures average distance between corresponding atoms |
TM-score (template modeling score) | Derives a normalized score based on the length of aligned residues and the RMSD between their positions | Global | Evaluates similarity between protein structures |
GDT (global distance test) | Computes the percentage of residues within a certain distance cutoff that are superimposed | Global | Quantifies similarity between two protein structures |
LDDT (local distance difference test) | Determines the difference in distance distributions between the model and the reference structure on a per-residue basis | Local | Assesses local structural quality of protein models |
RMSD (root-mean-square deviation)
RMSD is a common metric used to quantify the structural similarity or dissimilarity between two or more protein structures. It provides a measure of the average atomic displacement between equivalent atoms in the superimposed structures. An RMSD of 0 indicates a perfect match, meaning the structures are identical. RMSD values below 2 Å generally reflect a good alignment between 3D coordinates of two or several structures making them highly similar, while RMSD > 3 Å suggest notable structural differences.
Value | Interpretation |
---|---|
Low RMSD (<2Å) | - High accuracy (atomwise) of compared structures - Compared structures are very similar or even identical - Successful protein structure prediction |
Medium RMSD (2-4Å) | - Accuracy (residue wise) can be acceptable depending on the required for the task resolution and the scale of the comparison (local, global, how particular structural elements correspond to each other) - Local RMSD can be higher in disordered regions (loops) and lower in well-predicted or well-structured places (helices or sheets), influencing the global RMSD values and the quality of the structure overall - Compared structures are probably similar but not identical - Can be both successful and failure depending on the task and the required accuracy (do you need to separate individual atoms or residues are enough) |
High RMSD (>4Å) | - Low accuracy (domain wise) - You can separate, distinguish, and compare only the global structural elements - Compared structures are very different or very hard to compare - For many applications unacceptable RMSD values |
TM-score (template modeling score)
TM-score measures the similarity between two protein structures. It offers a more accurate assessment of the overall resemblance of full-length protein structures compared to the commonly used Root Mean Square Deviation (RMSD) measure. TM-score provides a similarity score between (0,1], with 1 indicating a perfect match. Scores below 0.20 suggest unrelated proteins, while those above 0.5 imply structures with roughly the same fold.
Value | Interpretation |
---|---|
High TM-score (~1) | - High structural similarity between compared structures - Compared structures are very similar or even identical - Successful protein structure prediction |
Medium TM-score (0.4-0.5) | - Acceptable structural similarity depending on the specific task - Moderate level of structural similarity |
Low TM-score (<0.4) | - Low structural similarity, indication poor models in the protein structure prediction cases |
GDT (global distance test)
GDT quantifies the similarity between two protein structures that share known amino acid correspondences, such as identical amino acid sequences, while exhibiting distinct tertiary structures. The GDT score assesses structure similarity by calculating the largest set of alpha carbon atoms in the model structure falling within defined distance cutoffs of their positions in the experimental structure. The algorithm generates 20 GDT scores at consecutive distance cutoffs (0.5 Å to 10.0 Å). For structure similarity assessment, multiple GDT scores are used, typically increasing with higher cutoffs. A plateau in the score increase may indicate extreme divergence between experimental and predicted structures. GDT scores are often expressed as a percentage (e.g., GDT-HA, GDT-TS).
Value | Interpretation |
---|---|
High GDT (>90%) | - High accuracy - Compared structures closely matches (they are very similar or even identical) - Successful protein structure prediction |
Medium GDT (50-90%) | - Can be considered acceptable, depending on the task - Appropriation of GDT scores should be decided together with the focus of research and applying other structure evaluation techniques |
Low GDT (<50%) | - Low accuracy and poor results - Residues with low GDT scores are likely to have inaccuracies in their local environments - Protein structure prediction is most likely unreliable and cannot be used |
LDDT (local distance difference test)
LDDT assesses the local accuracy of protein structures even in the presence of domain movements, while maintaining a strong correlation with global measures. It checks how accurately the distances between atoms in one protein structure match those in another structure being compared to it. LDDT has an extension called pLDDT (per-residue local distance difference test) which evaluates the per-residue deviation between predicted and experimentally determined structures. It offers insights into the reliability of individual regions within a protein model. Unlike LDDT, which provides a single score for the entire structure, pLDDT assigns a score to each residue based on the agreement between its local environment in the model and the experimental structure. Both scores range from 0 to 100, with higher scores indicating close correspondence of individual residues between structures at the local level.
Value | Interpretation |
---|---|
High LDDT (>80) | - High confidence in the local accuracy of predicted protein structures, good results with reliable side chains - Compared structures are very similar or even identical - Successful protein structure prediction |
Medium LDDT (50-80) | - Can be considered acceptable, especially if the global structure of the protein is still accurately predicted - Important to interpret the LDDT scores alongside other metrics to assess the overall quality of the predicted structure |
Low LDDT (<50) | - Low confidence in the local accuracy of the predicted structure, poor results - Residues with low LDDT scores are likely to have inaccuracies in their local environments - Proteins probably have disordered structure |
Structural comparison is built explicitly and implicitly available in 310 copilot, so even today, you can find out how close or distant the structures of your sequences are by a chat-based assistance for structural comparison, analysis, and clustering.
References