Functional Protein Sequence Design using Large Language Models

        The advent of architectural developments in machine learning has enabled promising directions within de-novo protein design, with its pace reflecting that of the broader machine learning domain. Unsupervised training on millions of raw protein sequences exemplifies an important axiom of development, adapting large language models originally designed for natural language on millions of raw protein sequences to learn local and global structure motifs. Here, a protein language model is iteratively optimized by learning to predict the probability of the next amino acid given the past amino acids in a raw sequence.

Figure:  The relationship between the computational complexity of model training, quantified in Floating Point Operations Per Second (FLOPs), and the model performance across four distinctive task categories, each encompassing 3-4 subtasks

        Scaling properties of a protein language model consisting of 100 billion parameters trained on a trillion tokens is in accordance with their natural-language counterpart, expressing an exponential increase in pre-training computations corresponding to a linear growth in performance. Unconditional generation of artificial protein sequences in the absence of auxiliary expert knowledge are particularly effective in exploiting an exponentially growing source and diverse and unannotated protein data. However, absence of guidance by any specific conditions also leads to uncontrollable generation, limiting their practicality as they are unable to respond to explicit instructions and intents. To date, conditional protein language models have shown effectiveness via auxiliary information (e.g. protein family, biological process and molecular function properties) that is provided as input to the language model.

Figure: Training Data Distribution on xTrimoPGLM

        Examination of generative modeling methodologies (e.g. denoising diffusion probabilistic modeling) for de-novo protein design has come to light, in no small part to address the current limitations with transformer-based protein language models as it is approached today. The protein language models’ space of artificial protein sequence generation diversity is an expansion from those sampled by evolution, which stems from both the training data sourced and architectural design. As is, the model will not generate proteins that belong to a completely different domain or distribution during inference time. Current limitation results in the inability to generate and create a novel protein fold that catalyzes an unnatural reaction.

Figure: Structure examples of generated protein sequences with different parameter configurations. The first row depicts sequences with parameter (T=1.0, P=1.0, N-gram-penalty=3), while the second row removes the n-gram constraints to reduce long loop disorder regions.

        Inhibitory computational costs during training and inference are alleviated through expression of model weights and activations with low-precision data types rather than 32-bit floating points to reduce memory and speed up floating point operations. Repetition of protein sub-sequence is alleviated through nucleus sampling. At each step of sequence generation, rather than selecting only the top k most likely tokens followed by redistribution of probability mass among the tokens, smallest set of top candidates with the cumulative probability exceeding a threshold are selected, then the distribution rescaled among candidates. In turn, it partilally alleviates issue of repeating subsequence of amino acid residues (e.g. ATTATTATT)  to provide a more diverse space of artificial protein sequence generation.

Chen, Bo, et al. "xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv." (2023): 2023-07.

 

Madani, Ali, et al. "Large language models generate functional protein sequences across diverse families." Nature Biotechnology (2023): 1-8.

 

Holtzman, Ari, et al. "The curious case of neural text degeneration." arXiv preprint arXiv:1904.09751 (2019).