ProtNLM for Predicting Protein Functions
Protein function prediction involves taking the amino acid sequence of a protein and determining what job(s) it performs. ProtNLM is an ML natural language model built on T5 for predicting the function(s) of a protein based on its amino acid sequence. Since proteins can have many different functions at the same time, multiple descriptions can be considered correct for the same input sequence. Accurately predicting a protein’s function gives scientists valuable insights: they can identify what a newly discovered protein might do, search for proteins to perform specific tasks, or check if a computer-designed protein will work as expected. While experiments are necessary to confirm the function, comparing a protein sequence to known ones is a common first step.
Predict the structure of ‘DIHICGICKQQFNNLDAFVAHKQSGSQ’
Name | Score |
---|---|
C2H2-type domain-containing protein | 0.54 |
General transcription factor II-I repeat ... | 0.02 |
Zinc finger protein | 0.019 |
Inputs
Protein sequence: For single chain, common 20 amino acids
Outputs
Name(s): Short text description of a protein, rank ordered by their score (0.0, 1.0).
Score(s): Model confidence per name (softmax). Larger is better, and scores below around 0.2 can be considered low confidence. Scores across all names sums to 1.0.
Examples
Predict the function of M4332
Name | Score |
---|---|
Tyrosine-protein kinase | 0.592 |
2.7.10.2 | 0.249 |
Non-specific protein-tyrosine kinase | 0.019 |
Load Q9UPY3 and predict its function
Name | Score |
---|---|
Endoribonuclease Dicer | 0.466 |
3.1.26.3 | 0.409 |
Endoribonuclease Dcr-1 | 0.033 |
Analyzing function prediction results
Independent and orthogonal models for function prediction would be highly valuable for cross-validating predicted functions.
- Comparison to Experiment: Testing of each predicted function using the appropriate experimental assay would be the gold-standard metric. However, this data can be difficult to obtain.
- Visual Inspection: Some functions are structurally encoded, so using the protein structure, an expert can look for the presence of motifs. For example, kinases often have 3 key amino acids in a very specific distance and orientation from each other, known as a catalytic triad.