ProteinMPNN: Message Passing on Protein Structures

August 23, 2023

Inverse folding aims at decoding protein sequences given structural information as input to a machine learning model. The fidelity of a network's sequence recovery depends on various factors. For ProteinMPNN, three outstanding contributions are the method to which structural information is encoded, the update schema for message passing, and order-agnostic decoding.

Table 1. Single chain sequence design performance on CATH held out test split

Edge features to the message passing network is built on the pairwise distance of amino acid backbone atoms (N, Ca, C, O), an offset tensor describing the relative position between each pair of residues storing self vs. non-self chain interactions, and a chain adjacency tensor differentiating chains for each atom pair. Given the 3D atomic coordinates of protein sequences, radial basis function embeddings are computed on the pairwise distances transforming continuous values to fixed size features to be processed by the model. The learned positional embedding of the model combines, then transforms the offset and chain adjacency tensor. Otherwise, the Information about the sequence order of residues or the relationship between residues from different chains would not be captured. The subsequent positional embedding is further combined with paired distances to fully describe the structural information as the edge features to ProteinMPNN.

Message passing updates in the backbone encoder network are performed by applying a multilayer perceptron to both node and edge features. In addition to updating node features based on the neighborhood of edges, edges themselves are updated within the encoder layer. The combined, then processed features are further applied with layer normalization, dropout, and residual connections for all layers prior to decoder network ingestion.

Beyond fixed forward decoding, ProteinMPNN leverages randomized decoding order construction. To create a randomized order of decoding positions, Gaussian noise is added to a binary mask representing chain connectivity of protein residue, 1 corresponding to a connected residue pair, and 0 corresponds to disconnected or missing regions. Upon noise addition, the augmented chain connectivity mask is sorted to create a tensor that describes the decoding order of the protein sequence, deviating from the canonical left-to-right autoregressive decoding schema to allow for arbitrary decoding order during inference.

Example of adding noise and sorting to construct decoding order:

Before noise: [1, 1, 0, 0]
After noise: [0.97, 1.08, 0, 0]
Randomized decoding order: [1, 0, 2, 3]

Dauparas, Justas, et al. "Robust deep learning–based protein sequence design using ProteinMPNN." Science 378.6615 (2022): 49-56.