Proteome enlightenment: AI annotation for proteins with unknown function
June 12, 2024
The gulf between known and unknown proteins
Proteins perform a numerous functions in living organisms. They participate in repairing, building, catalyzing, signaling, and overall maintaining the proper functioning and life of cells. Advancements in genome sequencing have resulted in an abundance of protein sequences (Durairaj et al. 2023). For example, UniProtKB 2024_03 contains around 245.5 million sequence records. However, functional characterization lags behind, with only a subset of proteins being annotated manually or through automated pipelines (Vu and Jung 2021). In the example above only around 570 thousand sequences are annotated (SwissProt database) while the rest (almost 245 million sequences) remain unannotated (TrEMBL). It means that we know function only for 0.2% of the entire amount of protein sequences! Despite this, a significant portion of sequences, including those with unknown functions or from undiscovered protein families, remain unannotated.
Tools for protein function prediction and annotation are essential for several reasons:
- They help to interpret big data generated by high-throughput sequencing technologies by providing insights into the biological roles and functions of proteins.
- Accurate annotation of proteins allows researchers to prioritize targets for further experimental validation, saving time and resources.
Moreover, these tools aid in comparative genomics, evolutionary studies, and understanding the mechanisms underlying diseases, boosting molecular biology, genetics, drug discovery, and biotechnology (Idhaya, Suruliandi, and Raja 2024; de Crécy-Lagard et al. 2022). Anyone involved in biological research, from academic scientists to pharmaceutical companies, can benefit from using these tools.
Get a name for your protein in seconds
ProtNLM (Protein Natural Language Model) is integrated into UniProt's Automatic Annotation pipeline to automatically classify and annotate unreviewed records in UniProtKB by predicting protein names from amino acid sequences. You can find the results of the ProtNLM’s prediction at the “Names & Taxonomy”–>”Protein names”–>”Recommended name” section on the UniProt web site (here’s A0A2Z4IEP2, the UniProt’s example). Overall ProtNLM has already annotated around 49 million previously unannotated protein sequences in the UniProt database.
ProtNLM was developed by Google Research in 2023. ProtNLM utilizes a transformer sequence-to-sequence model (similar to assigning titles to images or documents), to generate textual descriptions for proteins. Others have typically treated this as a classification problem (there is a fixed set of possible outputs) rather than a captioning problem (new annotation is possible) or have started with structure (e.g. DeepFRI) rather than sequence alone. It’s still work in progress as stated at the very beginning of the preprint and main challenges include the ambiguity of assigning multiple names to a single protein and the difficulty in verifying proposed descriptions without external evidence. The model is trained on UniProt data, filtered for quality, and validated through automated and manual evaluation. Recent improvements include leveraging additional information such as organism and secondary structure, and employing ensemble models for enhanced performance. However, acknowledging the potential for errors, UniProt encourages user feedback for continuous improvement and accuracy assessment.
We tried ProtNLM in practice with 3 existing in nature sequences (insulin, hemoglobin, and collagen) and 3 completely random ones (the corresponding protein for each sequence written in gold). In the snapshots you can see the ProtNLM output through 310.ai copilot. In the results you see a predicted name (an annotation for a protein sequence) and a score (a confidence in the prediction). The higher the score the more accurate annotation is. You can observe that the scores for random sequences are much lower than the score for existing proteins. Also you can notice that the predicted names are very close to each other in natural sequences but vary quite a lot with the random ones. Both findings show that the higher scores with higher numbers of similar predictions in the top 10 reflect more confident results. By the way the cute robot icon was found on flaticon.com.
You can try the copilot to annotate your own sequences. It’s easy, fast, and convenient, considering the web-based chat-style interface. Also you can find a tutorial on how to use it on YouTube.
References
- Abramson, Josh, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, et al. 2024. "Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3." Nature, May.
- Crécy-Lagard, Valérie de, Rocio Amorin de Hegedus, Cecilia Arighi, Jill Babor, Alex Bateman, Ian Blaby, Crysten Blaby-Haas, et al. 2022. “A Roadmap for the Functional Annotation of Protein Families: A Community Perspective.” Database: The Journal of Biological Databases and Curation 2022 (August).
- Durairaj, Janani, Andrew M. Waterhouse, Toomas Mets, Tetiana Brodiazhenko, Minhal Abdullah, Gabriel Studer, Gerardo Tauriello, et al. 2023. “Uncovering New Families and Folds in the Natural Protein Universe.” Nature 622 (7983): 646–53.
- Idhaya, T., A. Suruliandi, and S. P. Raja. 2024. “A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction.” The Protein Journal 43 (2): 171–86.
- Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.
- Varadi, Mihaly, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, et al. 2024. “AlphaFold Protein Structure Database in 2024: Providing Structure Coverage for over 214 Million Protein Sequences.” Nucleic Acids Research 52 (D1): D368–75.
- Vu, Thi Thuy Duong, and Jaehee Jung. 2021. “Protein Function Prediction with Gene Ontology: From Traditional to Deep Learning Models.” PeerJ 9 (August): e12019.