Hide and seek: structure similarities with FoldSeek

Scientists and researchers in computational biology, particularly those involved in protein design, use tools for structural search and clustering every day. These two approaches help to organize protein structures based on their spatial similarity. But why do we need it at all? See our structural similarity blog post. 

A new algorithm called FoldSeek (van Kempen et al. 2024) efficiently clustered over 214 million protein structures from the AlphaFold database into only 2.30 million clusters, significantly reducing the data by almost 100 times! You can group the data points together to obtain only the non-repetitive, meaningful representative structures for each group. This reduction is possible due to the intrinsic patterns shared across many protein types. In this sense, when we don’t look separately into the protein or specific type (like transmembrane protein, for example) or species (human protein, for example), we can find similarities between many of them. In other words, in a dataset, you can have several similar proteins that you can treat equally (like only one protein instead of several) as they have the same 3D characteristics. Such similarities are called conservative, as they hold during the evolution process throughout the history of the world. Since conservative structures exist and structure often defines function, it means that they are responsible for similar (and hence life) functions.

About 31% of the clusters that FoldSeek found may contain previously unknown structures (“dark proteins”) (Barrio-Hernandez et al. 2023). highlighting the limited structural knowledge we possess. While most clusters are ancient, some are species-specific, suggesting possible new gene emergence. Authors compared structures to find groups of related domains and discover distant connections between them, expanding our understanding of how different families are related across evolution. This further confirms the significance of evolutionary information in structural biology, as shown by AlphaFold2 (Jumper et al. 2021) and AlphaFold3 (Abramson et al. 2024), which improved structure prediction accuracy by incorporating sequence conservation information. 

On the figure below you can see top representatives from the top 3 largest clusters of both functionally known (“non-dark”) and unknown (“dark”) clusters. Non-dark clusters outnumber dark ones and are more populated. Structures are very different from each other both in size and shape. Not all dark cluster structures have correct models predicted by AlphaFold2 (like A0A7Z0QE20 representative). Overall, this information sheds light on the structural organization of the whole AlphaFold database.

Foldseek converts query structures into 3Di alphabet sequences, utilizing a pre-trained 3Di substitution matrix for searching target structures via MMseqs2. High-scoring hits undergo local alignment with either 3Di (default) or globally with TM-align (Foldseek-TM), combining 3Di and amino acid substitution scores. Benchmarked against other methods in terms of sensitivity and accuracy, Foldseek  demonstrates high performance across all six secondary structure classes in single-domain structures for homology detection using the SCOPe40 database. Foldseek shows similar sensitivity to Dali and TM-align, higher than CE, and much higher sensitivity than CLE-SW, Geometricus, 3D-BLAST, and MMseqs2. In terms of alignment quality, “Foldseek alignments are more accurate and sensitive than MMseqs2, CLE-SW, and TM-align, similarly accurate as Dali and 13% less precise but 15% more sensitive than CE(van Kempen et al. 2024).

Structural analysis poses a significant time challenge, contrasting with the swiftness of sequence analysis. The authors of FoldSeek highlight this disparity: “Searching with a single query structure through a database with 100 million protein structures would take the popular TM-align tool a month on one CPU core, and an all-versus-all comparison would take 10 millennia on a 1,000-core cluster. Sequence searching is four to five orders of magnitude faster: an all-versus-all comparison of 100 million sequences would take MMseqs2 only around a week on the same cluster(van Kempen et al. 2024). In contrast, their algorithm, FoldSeek, drastically accelerates structural comparisons while maintaining sensitivity. This speed enhancement, achieved through the development of a 3Di alphabet, facilitates large-scale structural analysis. Unlike conventional methods like CLE, 3D-BLAST, and Protein Blocks, which describe the protein backbone, 3Di characterizes interactions between closely placed residue pairs in 3D space. This approach offers several advantages: less dependency between consecutive letters, evenly distributed state frequencies, and enhanced information representation in critical protein regions such as binding pockets or catalytic domains.

Authors provide a web server to try this tool for free. Alternatively, it is also available on 310.ai copilot if you prefer a more chat-style approach to computational tools. FoldSeek has three sides: search, cluster, and complexsearch. In all cases, as input, you need to provide a structure in PDB or mmCIF format. The default search's output will contain information about the query (your sequence) and target (found sequence) sequence IDs and C-alpha coordinates, TM-score, rotation matrix, translation vector, LDDT, and the probability of the query and target being homologous. For cluster, you’ll get a list of clusters with information about the representative structure of each cluster and its relation to other members as output. Structures belong to the same cluster if they share significant 3D similarity. Complexsearch's output is similar to that of search and cluster, but specifically designed for multi-chain structures (protein complexes).

FoldSeek is a great tool to try. It should significantly speed up your structural analysis, providing reliable results.