Testing 96 Novel AI Proteins
November 15, 2024
In September, we built MP4, a general-purpose molecule programming foundation model. Like protein language models, this model is trained on the large datasets of sequences available. Unlike protein language models, this model uses broad and diverse data and can take a text description (a 'program' if you will) as the input. It can be used in a number of different ways to support design with and without templates and can create enzymes or repeats and things in between.
A generalist AI needs a generalist laboratory
Our initial tests involved generating thousands of unique protein sequences, diverging from the 96 million sequences in the 3.8 billion-year evolutionary database. We then selected 96 challenging sequences to test in the lab, designed specifically to lack robust natural references. Despite expert predictions that only 20–30% would even express successfully in a cell-free system, our AI surpassed expectations, achieving an impressive 84% success rate. Details are released in our discovery repo, here, and updated as data comes in. This is a first, but crucial step, toward our ultimate goal of developing an intelligent molecule programming system capable of drugging the undruggable.
One challenge in protein design is that experimental testing is often tailored for specific cases. Scientific protocols and measurements are complex to develop; while we must work within these constraints, it’s important not to limit ourselves based on temporary limitations.
To characterize the designed proteins, we added C-terminal tags (GFP11 for split-GFP solubility and Twin-Strep for purification) and genes were sourced from Twist Bioscience. Of the 96 sequences, two contained repeated motifs that complicated assembly; therefore, these sequences were not pursued further, as they would have required additional optimization and assembly.
Protein expression was carried out in a prokaryotic cell-free system at Adaptyv Bio. Using a split-GFP solubility assay (ref), 79 of 94 sequences demonstrated good solubility and expression, when counting both high and medium expressors, as defined by Adaptyv Bio protocols. Proteins do no express equally in all expression systems and the lack of expression in one system is not an indication of its lack of expression in other systems. Purification was performed in a single step using MagStrep Strep-Tactin XT magnetic beads (ref).
Finally, we quantified protein thermostability using nano differential scanning fluorimetry (nanoDSF). Although thermostability is generally a universal property of proteins, nanoDSF measurements require protein-specific optimization, as some proteins may not exhibit intrinsic fluorescence changes upon thermal unfolding. Despite the limitations, we are already able to verify several highly stable de novo proteins across different folds.
For AI, novel proteins are not harder than natural
While we had expected strong dependence on sequence homology to natural sequences to be strongly correlated with expression (the most natural sequence was one of the very highest in expression), overall this was not clearly the case with Pearson correlation of only 0.21. However, the fact that set is heavily biased to have low matches to natural sequences (~50% identity blastp against nr/nt or lower), this is not conclusive. And while the most novel structures in this set were certainly not the highest expressors, the Pearson correlation between structure match to something the the PDB and expression level was 0.015. That is to say, using MP4, novel proteins are not harder than natural ones.
Historically, all beta proteins have been more difficult to design de novo than all helical proteins. Here, we find our most beta proteins are too repetitive to clone (though individual domains may work well and will be tested in the future). We also find that the most helical proteins do not express well, perhaps because they consist of very extended helices.
We expected to find high dependence on average pLDDT (a measure of foldability from structure prediction methods like AlphaFold) and expression, but find instead a low pLDDT structure with very high expression as well as two very high pLDDT structures with relatively low expression. More data is needed, especially at lower pLDDT.
This set is small and limited, but as far as we can tell, there is no strong length dependence, some minor dependence on amino acid composition below a particular threshold, and minor dependence on the number of cysteines (though this was capped at 3).
First we make it, then we make it useful
Currently, drug development relies heavily on human specialists, each trained for years to grasp the complexities of a single protein or disease pathway. Our vision redefines this paradigm. Imagine a single AI model capable of encoding the full spectrum of protein sequences, their domains, functions, properties, interactions, and much more. While no individual scientist can achieve this, our AI can — and will.
While much has been achieved by mining nature, it’s time to stop mimicking and start innovating. We don’t know what the AI drugs will look like, but we know they will not look like the drugs of today. First, we’ve shown our AI can make real artificial proteins. Next, we’ll make them useful.
Looking forward to continued collaboration with Adaptyv Bio for lab results and AMD for GPUs!