From noise to molecules

how modern scientists use AI to create new proteins

AI revolution in computational biology

     Many modern scientists are equipped not with a microscope and pipette, but with a computer and algorithms. They exist in the realm of bioinformatics, where computer science meets biology. 

    Recently, the world's attention was captured by the achievements and discoveries in a specific area of computer algorithms – artificial intelligence (AI). AI encompasses a wide range of technologies and techniques that enable machines to perform tasks that typically require human intelligence, such as reasoning, problem-solving, natural language processing, learning, and many others. With the advent of AlphaFold2 (Jumper et al., 2021), a tool that managed to solve a long-standing scientific challenge by predicting the three-dimensional structure of proteins using only their sequences (1D), exponential growth has happened in the applications of artificial intelligence (AI) to biological problems. In particular, most scientists started to use a subset of AI called machine learning (ML), which learns from large amounts of data to make and improve predictions based on patterns in this data, without being explicitly programmed.

    One of the most exciting applications of ML in science is generative models for structural biology, especially diffusion neural networks (NN). Briefly, NNs are the building blocks of deep learning (DL), which, in its turn, is a subset of ML specifically designed to work with architectures that have multiple hidden layers (Choi et al., 2020). These architectures can solve the problems of finding the right conformation of a small molecule that potentially can become a drug on the surface of its target (protein-ligand docking), predicting a site of interaction between two molecules that influences their biological functions (ligand binding site prediction), and even creating new molecules for various purposes (protein design).

Protein design with Protpardelle: what, why, and how?

    Today, we'll take a look at Protpardelle (Chu et al., 2023), a generative NN model for all-atom protein design. In general, molecular design allows scientists to create custom molecules (in this case, proteins) tailored for specific applications in medicine, biotechnology, and drug design. Proteins execute their functions through chemical interactions between the side chains of amino acids. These side chains serve as the main functional components, determining both the inherent characteristics of the protein and the kinds of interactions it can engage in. For instance, side chains are crucial in enzyme design, as this type of protein carries out catalytic reactions through a specific geometric 3D configuration of side chains in the enzyme binding site. To have the most control over the arrangement of all structural elements (including side chains), scientists specify the structure and sequence of a protein fully from scratch – a process called de novo protein design. Designing such interactions computationally is challenging because it involves many interdependent variables that influence each other from both sequence and structural perspectives. Consequently, many currently available generative models focus only on backbone design to simplify the task by reducing the number of variables (in this context, interacting atoms). However, Protpardelle performs an all-atom de novo design. It creates new proteins while considering side chains based on fundamental principles of chemistry, physics, and biology that the model learned from existing protein structures in the training dataset.

    It’s a very promising tool that allows you to obtain samples (a set of protein designs) unconditionally (just a set of proteins of particular lengths with any shapes and structures) and conditionally (design the rest of the structure based on the provided 3D subpart of a protein). It's exciting to note that with conditional design, you don’t need to have an experimentally-determined structure such as CryoEM or X-ray. You can make a model using currently available structure prediction tools, provide these modeled coordinates to Protpardelle, and it will still generate a sample! The predicted samples are of high quality, structurally diverse, and contain novel proteins relative to its training dataset. This is extremely useful and convenient, especially considering that there are far fewer structures compared to the number of sequences, and conducting your own experiments can be both time-consuming and expensive.

    The authors have made the code open source and placed it in a GitHub repository, accessible through a user-friendly interface. Additionally, compared to other tools like Sculptor (Eguchi et al., 2022) and RFdiffusion (Watson et al., 2022), Protpardelle requires less computational resources to make similarly accurate predictions. This accessibility makes Protpardelle suitable for a broad audience of scientists who may not feel comfortable programming on their own or lack extensive computational resources but still wish to utilize this model for protein design. Moreover, it is remarkably user-friendly: with just one line of code, you can generate visually appealing protein samples.

    It’s clear from the paper which aspects the authors have already investigated and what will be part of their future work. Unfortunately, there is no experimental verification of the generated proteins. However, Chu et al. conducted a comprehensive computational evaluation of the network’s quality, comparing samples with those models predicted by ESMFold (Lin et al., 2023), which makes the results reliable.

Craft your own proteins: theory and practice

    Considering the neural network model itself, let’s focus on its practical (application) and technical (development) components.

    In practical terms, Protpardelle offers two modes: conditional and unconditional design. Both designs can be executed in HuggingFace space, via the command line with Python, or by using the PyMol molecular visualizer. During conditional design, the network takes one or many residue indices to condition on and a protein structure in PDB format from which to select the specified residues. Unconditional design doesn’t require providing any structure; it only asks for the minimal and maximal lengths of proteins to generate, the step size, and the number of samples per protein length. Both approaches provide “sampling” (to generate a sample of alternative side chain conformations) of a particular length. The program also offers an option to perform only backbone design, excluding all-atom details. There are minor changes in how to set up the run and provide selected indices to the model based on where the code is run, and these are described with examples in the GitHub repository for this work.

    From a technical perspective, Protpardelle is not a fully SE(3)-equivariant diffusion model. It employs the U-ViT architecture with a modified training scheme containing a hidden dimension of 256. The model is composed of 6 residual noise-conditional transformer layers, taking as inputs 37 × 3 noisy atom coordinates for each residue, the 37 × 3 self-conditioning atom coordinates, the noise level, and the sequence mask. It was trained on non-redundant protein domains from the CATH S40 dataset. The neural network development process consisted of three major steps. First, the authors created a baseline protein backbone model (involving only C, Cα, N, O atoms) using the U-ViT denoiser network. They adjusted the number of layers by removing the convolutional up- and down-sampling ones and trained it with self-conditioning to achieve similar quality and preference for both types of main secondary structure elements (α-helices and β-sheets). Second, Chu et al. incorporated structure-conditioned sequence prediction models called miniMPNN (during sampling) and ProteinMPNN (at the final state) to co-design the sequence during the structure diffusion (Dauparas et al., 2022). The main difference between the two is that miniMPNN lacks the autoregressive mask but instead has noise conditioning at the multilayer perceptron (MLP) blocks. Third, researchers enabled side-chain diffusion, making the network capable of all-atom protein generation. They increased the number of atoms per residue during the already established forward diffusion process since the training scheme was identical for each atom. Scientists tuned the sequence resampling rate and conducted a second sampling stage conditioned on both the backbone and sequence to remove all geometrically faulty side chains, thus improving the overall quality of the generated structures.

    Overall, Protpardelle is a powerful tool to experiment with and apply in your scientific work. Many thanks to the authors for developing it!