Conf42 Machine Learning 2024 - Online

Deep Learning for Protein Structure Prediction

Abstract

Unlock the potential of deep learning in decoding protein structures. Reshaping drug discovery with practical insights.

Summary

  • Yaroslav: Today I want to talk about proteins, why we need proteins and how they can help solve different problems. We will discuss different kinds of structure prediction methods, physics based, statistical and finally deep learning.
  • Protein structure prediction methods can be different. They can be classified by the amount of information they use and their accuracy. Using different kinds of information, if you have more information, more accuracy.
  • The next evolutionary step in protein prediction is alphafold two model, which uses multiple sequence alignment directly. It produces the whole structure end to end with machine learning, without using any physics based iterational methods. Technology from image and text processing has trickled down to biology.
  • physics based methods require a lot of compute. The next logical step is to replace statistics with deep learning. With time I hope we can see more methods appear in text processing and image processing that can be applied to biology and structure prediction.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello. Welcome to my session on deep learning for protein structure prediction. My name is Yaroslav. Let's get started who I am. I spent four years of my life doing machine learning research and machine learning related software development. I have a master's degree in computational biology and I also worked for a year on antibody structure prediction, machine learning for drug discovery in a biopharmaceutical company. Today I want to talk about proteins, why we need proteins and how they can help solve different problems. I want to talk about evolutionary information and information that we can use to get more data about proteins and predictions, their behavior and structure. I want to talk about structure prediction methods and a little bit of history behind that. And we will discuss different kinds of structure prediction methods, physics based, statistical and finally deep learning. So first let's talk about what proteins are. So proteins are composed of amino acids, and amino acids are the building blocks of proteins. They are often shown as letters or those green boxes. Here are the same things as letters on the image below. And proteins are essentially chains of amino acids. And proteins have structure, different kinds of structure. First kind of structure is the amino acid chain, the sequence of amino acids that defines the protein. Secondary structure is a structure that local parts of protein intake. They can have spirals, or they can align on each other in different ways. And the third way we can describe the structure of a protein is its ternary structure. So it's the whole 3d structure of one chain of protein. And finally, we can describe the structure of a complex of proteins, so multiple chains interacting together. And why do we need to describe this structure, and why do we need to predict it and know it? So for different kinds of things, we may want to estimate the protein function based on its amino acid sequence, because we don't know the structure. And the structure is very hard to obtain through experimental methods. It's much easier to obtain the amino acid sequence. But to run some experiments on computers with the actual 3d structure, we need to first obtain it. And we may use this information about the structure to try to understand the protein function, try to understand how we can modify the protein and what use cases there are for that particular protein. So a couple of examples why we need proteins. First. One is plastic degeneration. So we can have a bacteria, genetically modified bacteria, that produces protein that acts as an enzyme, so it will speed up breakdown of plastic, and it can help us get rid of waste in different kinds of. The other thing that we can use antibodies for is vaccines and drugs. So for example, on the right there is a coronavirus displayed. And it has proteins going out of the shell. And those are the spike proteins that the virus uses to enter the cell. And those are proteins that our immune system reacts to. And it produces antibodies, which are also proteins that can bind to the spike proteins of coronavirus. And they can alert your immune system to destroy the virus. How can we get more information about protein, a specific protein, without any other experimental methods? On that particular protein we can have a look at similar proteins. And, and the idea is that if we have a similar protein, we have a similar structure or maybe a similar function. If there are some changes to protein sequence in some position, maybe there would be a change in another position which can be far away in a sequence, but it is actually close in 3d space. So if say one position changes its charge, so the other position has to change charge as well to preserve the structure. So we can have a look in a database and find similar proteins, align them together. So we have kind of the same structural positions on top of each other. And that can help us to get information about how variable is this concrete position or which positions it interacts with. Okay, so structure prediction methods can be different. And first of all, we have a protein folding on the upper left corner. And this is like the natural protein folding. It's really, really accurate. And it doesn't need to obtain a lot of sequence information or information from multiple sequence alignment. It only has to do its natural job. But if we are talking about protein structure prediction methods, they can be classified by the amount of information they use and their accuracy. So on the bottom left, we have physics based methods. They are not really accurate and they need a lot of compute to actually produce the result. So the next thing is methods using PSSM. PSSM is derived from multiple sequence alignment, and it is kind of a statistic about each and every position of multiple sequence alignment. Second order methods use coevolution information, so they will encode information about pair interactions in multiple sequence alignment, and use different kinds of methods produce the result. And finally, full multiple sequence alignment methods. We'll use full multiple sequence alignment and we'll use deep learning to process the whole data available from multiple sequence alignment. And for some classes of proteins which are not that, for which that information from multiple sequence alignment is not really useful. For example, for highly variable proteins such as antibodies, multiple sequence alignment can not be that useful to, to get more information about the protein. And the other thing is that end to end, deep learning methods are usually faster than physics based methods. And we will talk about why in a moment. So on average, if you have more information, you have more accuracy. Using different kinds of information, you get higher on prediction accuracy. So why is it difficult to get a result with physics based methods, and why do they have to use a lot of compute? That's because problem with a lot of particles interacting. So if you have even three particles interacting with each other, and you know the forces acting on them, that system cannot be solved in a closed form solution. And any changes to initial state can change your end state very drastically, because that's a chaotic system. And the only thing, the only method we have for solving that problem is iteration methods, which require a lot of compute. So, molecular dynamics methods use simulation, step by step simulation, and high performance compute systems to see how a protein folds and how the parts of the protein move under forces acting on the protein from inside and from outside. So those methods usually use some really, really expensive hardware, such as supercomputers. But they also have benefits, such as trajectory analysis can be performed on the whole simulation. So you can know the dynamic behavior of a protein in some cases. So those methods work with forces, and there are many different forces acting on particles in the protein, and some of them are described here on the on the right. And those forces are potential forces, which means they don't depend on particles velocity, they only depend on particles coordinates and properties. So, with the methods using physics based simulation, we are struggling to obtain a good first representation, because to achieve a low energy state, we have to spend a lot of iterations. So maybe we can do something and achieve a good first structure, and then take it from there to speed up the whole process. And for that, we can use homology modeling. Homology modeling is based on the same idea as multiple sequence alignment, that similar sequences have have similar structures. And if you have a database with structures and their sequences, you can look for similar sequences to the sequence you want to fold, and you can find fragments of that, of similar sequences, and you can combine them together to create the first model, and then you can evaluate multiple such models, or you can fine tune those models using physics based methods. The other problem with physics based methods is that we don't know how likely this current position is for that molecule to be in. So if we have a lot of statistics about which positions we observe in real proteins, then we can use this information to try to kind of forbid some states in a molecular dynamics process. If we know that this position is unlikely, we will apply forces to bring the molecule out of this position, because we assume that this is an optimization dead end. But for that we need to know the likelihood of different positions in the molecular structure. So if we use statistics, we just get a lot of data and estimate likelihood of every position. But it only works on a specific protein families, because statistics in one family can be different from statistics from another family. And that's where deep learning comes in. What we can do is we can estimate that position likelihood using machine learning. And that is what a model called alphafold one tried to do. So it tried to predict the likelihood of different positions for pairs of atoms. So that matrix in the middle, it can be treated as distribution over distances between the atoms. And you can see the diagonal has green color to it. That means that those atoms are close together, but some of the other atoms are close together as well, and they are not adjacent in the sequence. And to produce this distribution, we can use sequence and MSA features, which we can encode like a picture in a 2d space. And each position will tell us how those two atoms, on those two amino acids, on two different positions, I and j, interact together. And then finally, when we produce this, this kind of likelihood map, we can use physics based methods to fold the protein really quickly, because we know which positions it likely to take, and it really speeds up the whole physics based process. The next evolutionary step in protein prediction is alphafold two model, which uses multiple sequence alignment directly. And what it does is it produces the whole structure end to end with machine learning, without using any physics based iterational methods, which is a lot faster. So it can be divided into three steps. First step is obtaining an input. Using an input sequence, you can find a lot of similar sequences to produce an MSA, and you can also find their structure. As in homology modeling, you can find templates for your protein, pieces of other known structures that are likely similar to yours. After that, there is deep learning magic happening, and in the middle, we just encode the information that we got into the model. And the final step is structure prediction. So for that model, a new kind of structure prediction model was created, which would predict and update angles and distances between amino acids to produce the final result. End to end worked with geometrical features to get the final result, which also can be fine tuned with physics based methods, because sometimes this result will not be locally accurate because the model doesn't know physics. And a few iterations of physics based methods can kind of relax the model and push some atoms away or bring them together so the whole structure looks more natural. The other method for encoding protein information using a lot of data is language models. So proteins consist of different amino acids, just like text consists out of words. And we can use similar techniques from text processing and language processing to encode a lot of sequences into a large language model. And then we can use this large language model to encode our input sequence into some representation from that representation. Using the same idea as alpha Fold two, we can predict geometrical features for the structure, and we can predict the structure end to end using a lot more data for language model per training, protein language model per training. And then we can use a smaller model to predict the geometric features. And the same way as before, we can use refinement steps to fine tune the model using physics based methods and final model that was only released this month. Alphafold three expands on this idea of alphafold two template using using templates, using multiple sequence alignment, and using other things that bind to proteins to get the better result in protein structure prediction. So this model can not only work on proteins, but it was changed a little bit. So it can get other information in from things that proteins bind to or interact with that are non proteins and come from different origins, for example protein DNA interactions or something like that. So essentially it can be split into three different stages as well. First is input input building. The second is deep learning processing. And the third one was updated too. So it can predict not only proteins but other molecular structures too, such as DNA. And in this model, they used diffusion module to produce protein structure and other molecular structures from noise, similar to generative AI for images and videos. And we can see that many of technologies that are used in image and text processing, such as diffusion models, large language models, transformers and convolutional models, they all trickled down into biology. And people found ways to use this technology for biological applications, which are kind of far from image processing and also far from language processing a little bit. But anyway, people find new ways to use technologies, not only in the spaces where they were created, but also in biology and many other applications. So today you learn about physics based methods, statistical methods, and deep learning methods for protein structure prediction. You learned that physics based methods require a lot of compute, and there is a lot of research on how to speed them up. There are heuristics to speed up physics based methods such as statistical potentials and other statistical tricks to speed up the protein folding. The next logical step is to replace statistics with deep learning and kind of automate statistical feature recovery from data using deep learning. And the problem with that is that getting more data into a machine learning model, a single model or multiple models is challenging. And as the time goes, more and more methods can unify information from multiple sources to encode it together and get a better result. For protein prediction, you learned that end to end methods allow to use deep learning for every step of structure prediction except of obtaining the input. But those kinds of methods are end to end. Methods are really important because they can save a lot of time, because they have really good properties for parallelization, and they can be run on efficient hardware, and they don't require as much compute or as hard of a compute as iterations in physics based methods. Physics based methods are not dead still, so there are still use cases where you can only use physics based methods if you want to achieve good performance and accuracy. For example, if you want to analyze trajectories, or if you want to refine other structures that were produced by deep learning models without without really knowing the physics of it. So they are still useful for post processing and other applications where accuracy is really important. But they use a lot of compute new deep learning methods such as transformers, diffusion models, and convolutional networks. They trickle down into biology and with time I hope we can see more methods appear in text processing and image processing that can be applied to biology and structure prediction. Thank you for joining me. If you have any questions you can leave me a message on LinkedIn and I'll be happy to answer them.
...

Iaroslav Geraskin

Machine learning engineer @ TikTok

Iaroslav Geraskin's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways