Skip to main content

ProtNLM for Predicting Protein Functions

Protein function prediction involves taking the amino acid sequence of a protein and determining what job(s) it performs. ProtNLM is an ML natural language model built on T5 for predicting the function(s) of a protein based on its amino acid sequence. Since proteins can have many different functions at the same time, multiple descriptions can be considered correct for the same input sequence. Accurately predicting a protein’s function gives scientists valuable insights: they can identify what a newly discovered protein might do, search for proteins to perform specific tasks, or check if a computer-designed protein will work as expected. While experiments are necessary to confirm the function, comparing a protein sequence to known ones is a common first step.

Predict the structure of ‘DIHICGICKQQFNNLDAFVAHKQSGSQ’

NameScore
C2H2-type domain-containing protein0.54
General transcription factor II-I repeat ...0.02
Zinc finger protein0.019

Inputs

Protein sequence: For single chain, common 20 amino acids

Outputs

Name(s): Short text description of a protein, rank ordered by their score (0.0, 1.0).
Score(s): Model confidence per name (softmax). Larger is better, and scores below around 0.2 can be considered low confidence. Scores across all names sums to 1.0.

Examples

Predict the function of M4332

NameScore
Tyrosine-protein kinase0.592
2.7.10.20.249
Non-specific protein-tyrosine kinase0.019

Load Q9UPY3 and predict its function

NameScore
Endoribonuclease Dicer0.466
3.1.26.30.409
Endoribonuclease Dcr-10.033

Analyzing function prediction results

Independent and orthogonal models for function prediction would be highly valuable for cross-validating predicted functions.

  • Comparison to Experiment: Testing of each predicted function using the appropriate experimental assay would be the gold-standard metric. However, this data can be difficult to obtain.
  • Visual Inspection: Some functions are structurally encoded, so using the protein structure, an expert can look for the presence of motifs. For example, kinases often have 3 key amino acids in a very specific distance and orientation from each other, known as a catalytic triad.