310 Protein Design ML Stack logo

Computational lab

With our computational lab, users can explore, create, and visualize on top of our curated datasets and models. The lab provides a coding environment – Jupyter Notebook – with direct access to previously organized protein datasets, the feature store, and pre-trained models, as well as the necessary processing capacity (e.g., T4 GPUs). Moreover, our lib310 Python library packages all the data science and visualization tools a protein design process might require.

Non-code Comp lab

Mol.E/Repo results

We have designed and trained a generative model Mol.E that outputs protein sequences conditioned on user-specified features through our proprietary protein language. Over many model iterations, we have improved our protein generation performance. The red sections of the proteins below indicate a difference of larger than 1 Å with their natural counterparts. As we move from simpler models to more complex ones with greater number of parameters, these red sections vanish.

Gene name: map
UPKB id: D2PU61
TM-Score: 0.649
Seq similarity: 7.1
Natural     Mol.E

Gene name: lgt
 UPKB id: A0A6N3E983
TM-Score: 0.685
Seq similarity: 12.3
Natural     Mol.E

Gene name: tpiA
 UPKB id: A0AOF6MNC4
TM-Score: 0.733
Seq similarity: 81.32
Natural     Mol.E

Gene name: gatD
 UPKB id: A0A124EVQ6
TM-Score: 0.812
Seq similarity: 10.5
Natural     Mol.E

 Gene name: tsaB
UPKB id: A0A0J8DGN3
TM-Score: 0.956
Seq similarity: 8.9
Natural     Mol.E

Gene name: FNW17
 UPKB id: A0A553C6Gg
TM-Score: 0.981
Seq similarity: 15.9
Natural     Mol.E

Large Language Model

Data

As of now, we host more than 40 tables from 12 datasets amounting to over 36 billion rows and 17 TeraBytes of data. These include notable databases, such as UniParc, Uniref, links, intact, Gene annotation, etc from organizations like Uniprot, Intact, Gene ontology, etc. These datasets now have a unified structure for an easier access.