To bind, or not to bind, that was the question

Artificial intelligence in biological research

Biomolecules have complex structures that define their functions. However, many proteins lack experimental structures due to challenges associated with their determination. To overcome this obstacle, scientists turn to molecular modeling tools. Many of these tools now incorporate artificial intelligence (AI) to achieve results faster and more efficiently. In biology, researchers mostly use machine learning (ML), a subset of AI that learns from a big amount of data. An example is AlphaFold2 (Jumper et al., 2021), a program predicting protein structure based on a provided sequence. Another popular application of ML in biology is ligand-binding site prediction. This is vital for drug discovery since ligands are small molecules that can control biomolecules to achieve a particular response. Such ligands include Aspirin, Ibuprofen, or Paracetamol and are used by millions every day. Hence, predicting ligand-binding sites is vital for human well-being (more details can be found in a blog post about drug design).

Leveraging AlphaFold to find binding ligand-binding pockets

And here, a new DL tool based on AlphaFold2 comes to our aid, which is capable of “accurately predicting small-molecule-binding residues given only a target protein,” called AlphaFold2 Bait-Informed Neural Descriptor (AF2BIND) (Gazizov et al., 2023). Unlike other methods, AF2BIND doesn’t require homology models, the exact identity of the ligand, or other such “hints” commonly given, which makes this a much more general and flexible method for discovering ligand pockets.

The model takes the target protein sequence and backbone, and 20 bait amino acids (“surrogates for a small-molecule ligand”) as input and returns the probability for each residue of the protein to be a small-molecule-binding residue called P(bind). The main idea behind this work was that, given AlphaFold2's ability to predict protein structures and its training on proteins in complex with ligands, it should be able to predict small-molecule binding sites. Authors added 20 "bait" amino acids to mimic the binding signal in the absence of a ligand. By introducing specific "bait" to AlphaFold2 along with a protein target template, researchers developed AF2BIND, “a logistic regression classifier trained from AlphaFold2 pair features to predict each residue’s probability of contacting a small-molecule ligand, given a target protein structure.” It makes AlphaFold2’s “internal representation of protein structure and sequence” sufficient for binding-site predictions.

AF2BIND and other tools

AF2BIND is not the first method that can predict ligand-binding sites: “ESM-IF (Hsu et al., 2022) or proteinMPNN (Dauparas et al., 2022) might detect solvent-exposed hydrophobic clusters of residues or completely buried collections of polar residues for binding-site prediction.” However, Gazizov’s model “outperforms other neural-network representations for binding-site prediction.” It can accurately predict ligand-exposed residues “without multiple sequence alignments (MSA), homology models, or knowledge of the true ligand.” AlphaFold2 offers a more comprehensive description for predicting ligand-binding sites compared to representations of previous models such as ESM2 (Brixi et al., 2023) and ESM1-IF (Carbery et al., 2023). Also, AF2BIND provides "hints" about which polarity (the hydrophobicity-related chemical characteristic) the potential ligand should contain to effectively bind to the residues selected by the model. For example, if the amino acid in the protein binding site is negatively charged, the ligand’s atoms for this position should be positively charged to maintain attractive interactions between the ligand and the binding site. It's possible, since AF2BIND is a logistic regression classification model and this model is well interpretable. In other words, logistic regression reflects the contribution of bait amino acids to the activation of P(bind). It directly transfers to polarity compatibility between protein and ligand, forming a “bait-residue activation map”.

Why AF2BIND?

In addition to the advantages mentioned in the previous paragraph (no need for MSA, homology modeling, or true ligand; accurate prediction of ligand-binding residues; extra "hints" about ligand chemical properties), AF2BIND also has its own Google Colab. It makes it really easy to run the prediction with just having a Google account. You don’t need to have specific AI-related skills to use this tool: you just need to insert the PDB code and chain name to start the ligand-binding residues prediction. What’s even better, it automatically provides you with structural visualization, where you can see the top-ranked residues colored from blue to red in descending order of P(bind) (binding probability). So, if you’re interested in the part of the protein that is likely to have a binding pocket, look at the bluest positions. It also gives you a table where all target residues are ranked based on their P(bind), taking into account some pocket volume information. Moreover, the prediction is made really fast – it takes only a few minutes per protein. You don’t need to have a powerful laptop or any extra resources, since all the calculations are Google-based. It’s more than enough for ligand-binding site predictions for a few proteins. If you are looking for more of a high-throughput option, you might need to think about a Pro account or install AF2BIND to your server. It’ll require some computer science skills, but AF2BIND has a great GitHub repository that provides a source notebook with information about the required libraries.

AF2BIND in action

Let’s take a closer look into the method. I used lysozyme from the rainbow trout (PDB ID: 1LMP) to test this model. Lysozyme is an enzyme that plays a crucial role in the immune system, specifically in the defense against bacterial infections. Therefore, ligands can inhibit lysozyme by interfering with its active site, preventing it from effectively breaking down bacterial cell walls. This ability can be useful, for example, in fighting resistant bacteria and improving its antimicrobial properties. I used AF2BIND Google Colab by inserting “1lmp” in target_pdb and “A” in target_chain. To the left, you can find the initial image that the notebook displayed, and to the right, I added labels to those residues that were ranked top by the model (just a reminder that the bluer the residue, the more suitable it is for ligand binding).

The polarity map for the selected residues might be a bit confusing at the beginning. In the lysozyme case, it shows M, S, P, Y, and N, the amino acids from quite diverse biochemical classes (polar uncharged, hydrophobic, and special side chains) as preferred by the activation function to be in a ligand that interacts with Q at position 57 in chain A of the target protein. But we can explain, for example, the “high” preference towards hydrophobic M in the ligand for the polar uncharged Q position in the protein. However, we can assume that Q is interacting via its amide group with other parts of the protein through hydrogen bonds and exposing the aliphatic part of its side chain towards the ligand, forming mostly hydrophobic interactions there. We can see such a case in the 1LMP crystal structure discussed in the next paragraph. Overall, it’s better to look at the general tendency through the majority of the positions (“horizontal blue trend”) and additionally check the applicability of the prediction based on biochemical knowledge. Relying on this approach, we see that the ligand has a propensity to have big hydrophobic groups (blue at the W level). The top 15 predicted ligand binding sites consist of 8 polar and 7 hydrophobic residues, so the predicted ligand properties correspond only partially to the predicted residues at the protein binding site.

Now let’s take a look at the existing experimentally determined lysozyme structure with a ligand. The real structure (1LMP) is shown in gray, the model is orange, and all residues that the model predicted as ligand-binding in the top 15 on P(bind) (table at the right) are represented as sticks. I colored lavender AF2BIND residues that match the experimental ones binding a ligand through any type of polar contacts. We can see that 10 out of 15 residues, making around 66.7% of the most probable binding residues, correspond to the real residues that bind a ligand. That sounds like a good result! Moreover, taking into account that it’s only one ligand and one experimental result, there might be residues that didn’t match in this example but can still be a potential target for a ligand. In any case, AF2BIND is a useful tool, but we still need scientists who’ll evaluate its predictions.

Summarizing all mentioned above, AF2BIND is a user-friendly tool that can provide a quick practical insight into the promising binding pocket position on the target protein. It can also give you some insights about the ligand properties you can take a look at and check, making the model great for early-stage research. Take note of this tool, as it can become a valuable addition for your first overview of the protein-ligand binding in your work!

References

Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021). https://doi.org/10.1038/s41586-021-03819-2 

Gazizov, A., Lian, A., Goverde, C. et al. AF2BIND: Predicting ligand-binding sites using the pair representation of AlphaFold2. bioRxiv (2023). https://doi.org/10.1101/2023.10.15.562410 

Hsu, C., Verkuil, R., Liu, J. et al. Learning inverse folding from millions of predicted structures. Proceedings of the 39th International Conference on Machine Learning (2022). https://proceedings.mlr.press/v162/hsu22a.html 

Dauparas, J., Anishchenko, I., Bennett, N. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (2022). https://doi.org/10.1126/science.add2187 

Lin, Z., Akin, H., Rao, R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023). https://doi.org/10.1126/science.ade2574 

Carbery, A., Buttenschoen, M., Skyner, R. et al. Learnt representations of proteins can be used for accurate prediction of small molecule binding sites on experimentally determined and predicted protein structures. bioRxiv (2023). https://doi.org/10.1101/2023.09.07.556685