Protein Function Prediction
Function prediction involves taking the amino acid sequence of a protein and determining what job(s) it performs. Proteins can act like tiny machines, and some important functions they do include speeding up (catalyzing) vital chemical reactions, moving (transporting) materials in and out of cells, or signaling when the body is under attack. Accurately predicting a protein’s function gives scientists valuable insights: they can identify what a newly discovered protein might do, search for proteins to perform specific tasks, or check if a computer-designed protein will work as expected. While experiments are necessary to confirm the function, comparing a protein sequence to known ones is a common first step.
Predict the structure of ‘DIHICGICKQQFNNLDAFVAHKQSGSQ’
Name | Score |
---|---|
C2H2-type domain-containing protein | 0.54 |
General transcription factor II-I repeat ... | 0.02 |
Zinc finger protein | 0.019 |
ProtNLM
ProtNLM is an ML natural language model built on T5 for predicting the name of a protein based on its amino acid sequence. ProtNLM will return a number of suggested descriptive names, rank ordered by their score (0.0, 1.0) where larger is better, and scores below around 0.2 can be considered low confidence. Since proteins can have many different functions at the same time, multiple descriptions can be considered correct for the same input sequence.
Inputs
Protein sequence: For single chain, common 20 amino acids
Outputs
Name(s): Short text description of a protein.
Score(s): Model confidence per name (softmax). Scores across all names sums to 1.0.
Example Scripts
Predict the function of M4332
Name | Score |
---|---|
Tyrosine-protein kinase | 0.592 |
2.7.10.2 | 0.249 |
Non-specific protein-tyrosine kinase | 0.019 |
Load Q9UPY3 and predict its function
Name | Score |
---|---|
Endoribonuclease Dicer | 0.466 |
3.1.26.3 | 0.409 |
Endoribonuclease Dcr-1 | 0.033 |
Analyzing function prediction results
Independent and orthogonal models for function prediction would be highly valuable for cross-validating predicted functions.
- Comparison to Experiment: Testing of each predicted function using the appropriate experimental assay would be the gold-standard metric. However, this data can be difficult to obtain.
- Visual Inspection: Some functions are structurally encoded, so using the protein structure, an expert can look for the presence of motifs. For example, kinases often have 3 key amino acids in a very specific distance and orientation from each other, known as a catalytic triad.
Integration with Other Tools
Docking
Function can mean many different types of things. One type can be binding or interaction with another object, for example the small molecule ATP. Therefore, docking predictions can be used to cross-validate some functional predictions.
Folding
Function can sometimes be structurally encoded. When an experimentally determined structure is available, it should be used for analysis. However, since such data is often available, it can be helpful to use a structure prediction model to get a prediction for analysis.