Computational Medicinal Chemistry Course; homology modelling

Introduction

Today, it is still far from possible to calculate the ideal ligand for medicinal purposes given the three dimensional structure of the target protein. There are thousands of reasons that make this a very, very big challenge. Many problems will be discussed during this course, so here we will list just a few of them:

In practice, only two or three medicines have ever been developed based on a protein structure. Nevertheless, protein structures are very useful because they help the medicine man think, they can give direction to the screening procedures, and they can convince management to give more money for the project.

Homology modelling

Unfortunately, structure data is available for only very few targets, so we will almost always have to resort to modelling, because in the absence of experimental data, model-building on the basis of the known three dimensional structure of a homologous protein is at present the only reliable method to obtain structural information. Comparisons of the tertiary structures of homologous proteins have shown that three-dimensional structures have been better conserved during evolution than protein primary structures, and massive analysis of databases holding results of these three dimensional comparison methods, as well as a large number of well studied examples indicate the feasibility of model-building by homology.

All structure prediction techniques depend one way or another on experimental data. This is most easily seen for model building by homology, but also secondary structure prediction programs are trained on proteins with a known three dimensional structure, and even molecular dynamics force fields are mainly derived from protein and peptide data. Unfortunately, all protein structures contain errors. Hence, verification of the data used in modeling procedures is a prerequisit for good results. Many of the same verification techniques can of course also be used to get an impression of the quality of the model.

Model building by homology is a multi step process. At almost all steps choices have to be made. The modeller can virtually never be sure that she makes the best choices, and thus a large part of the modelling process consists of serious thought about how to gamble between multiple seemingly similar choices.

Differences between three-dimensional structures increase with decreasing sequence identity and accordingly the accuracy of models built by homology decreases.

Modelling as function of ID%

The errors in a model built on the basis of a structure with >90% sequence identity may be as low as the errors in crystallographically determined structures, except for a few individual side chains. If, as a test case, a known structure is built from another known structure, then in case of 50% sequence identity the RMS error in the modeled coordinates can be as large as 1.5 Angstrom, with considerably larger local errors. If the sequence identity is only around 25% the alignment is the main bottleneck for model building by homology, and large errors are often observed. With less than 25% sequence identity the homology often remains undetected. In figure 1 the key limiting factors in modeling as a function of sequence identity are shown.

At present most model building by homology protocols start from the assumption that, except for the insertions and deletions, the backbone of the model is identical to the backbone of the structure. In practice, however, domain motions and 'bending' of parts of molecules with respect to each other is often seen. Even in case of significant bending short range interactions will not differ very much and the model will be perfectly adequate for rational protein engineering, etc. However, the prediction of local differences in the backbone between structures that are homologous in sequence still requires much research, some aspects of which will be described below.

In recent years automatic model-building by homology has become a routine technique that is implemented in most molecular graphics software packages. Currently the emphasis in literature is on a few topics:

The modelling process

The modelling process can be subdivided into 9 stages:

What can be modelled?

As will be described below, the transfer of structural information to a potentially homologous protein is straightforward if the sequence similarity is high, but the assessment of the structural significance of sequence similarity can be difficult when sequence similarity is low or restricted to a short region.

Modelling threshold Figure 2. Homology threshold for structurally reliable alignments as a function of alignment length [free after Reinhard Schneider"s thesis].

The homology threshold (curved line) divides the graph into a region of safe structural homology where essentially all fragment pairs are observed to have good structural similarity and a region of homology unknown or unlikely where fragment pairs can be structurally similar but often are not, without a chance of predicting what it will be. At present 15% of the known protein sequences fall in the safe area, which implies that 15% of all sequences can be modelled and thus are open to structure function relation studies.

This indicates a key problem: the shorter the length of the alignment, the higher the level of similarity required for structural significance. Chothia and Lesk have studied the relation between the similarity in sequence and three-dimensional structure for the cores of globular proteins. To quantify this problem, Schneider and Sander calibrated the length dependence of structural significance of sequence similarity. This was done by deriving from the database of known structures a quantitative description of the relationship between sequence similarity, structural similarity and alignment length. The resulting definition of a length-dependent homology threshold (see figure 2) provides the basis for reliably deducing the likely structure of globular proteins down to the size of domains and fragments.
Template recognition

If the percentage sequence identity between the sequence of interest and a protein with known structure is high enough (more than 25 or 30 %) simple database search programs like FASTA or BLAST are clearly adequate to detect the homology. If, however, the percentage identity falls below 25% detection by straigthforward sequence alignment becomes problematic, and more advanced techniques are required. These techniques are beyond the scope of this section of the bioinformatics IV course.
Seminars

The two seminars deal with:

Homology modelling exercise

In the real world you need to make a good alignment between the sequence of the template and the sequence to be modelled. In this practical, we don"t have enough time for that, so we made the alignment for you. The following information is available:

If it is just coffee time then this is the right moment to build the model. Use the files:

  1. template.pdb
  2. template.pir
  3. model.pir

Run the homology modelling server, and save your model as model.pdb.

It is a waste of time to wait for the model to be built. It is not the actual process of modelling that is the problem, but the alignment, and the validation of the model. The alignment is the topic of the bioinformatics I course and won"t be discussed here. The validation of the model is today"s main topic.

You can look at your model with WHAT_TEACH....

Displaying a protein with WHAT_TEACH

First, lets figure out how WHAT_TEACH works. Click here to get at the description page. Read that, and do as you are told!

In your local directory you find the file GETPDB.SCR. You don't have to worry about this file because it should in principle be OK. However, it roughly looks like:

GETMOL 1crn.pdb crambin

The "GETMOL" is obligatory. Never change that, not even by accident! (If you are not at CMBI, you can use the same script: WHAT_TEACH knows where to find the PDB file. There is no need to specify paths, not that it is forbidden, of course.) You will have to change the second word because I was too lazy to make a hundred scripts. This word holds the name of the coordinate file you want to look at. You will have to change that word to work with another molecule. In the first example the file name is "model.pdb". Edit the file and put in the name of the model. If people from a previous course left some other stuff in this file, just delete that.

Activate WHAT_TEACH.
Lets first look at the model and play a bit with WHAT_TEACH. We have given you five pictures to work with:
  • MOL1 : the molecule with all atoms displayed
  • MOL2 : a C-alpha trace of your molecule
  • MOL3 : a ribbon plot of your molecule
  • MOL4 : a surface representation
  • MOL5 : the hydrogen bonds
Now, answer the following questions):
  • Where does the ligand bind?
  • Where is helix 12?
  • Where does the co-activator bind?
  • Why is there no ligand bound (trivial)?
  • Why is there no co-activator bound (not so trivial)?
Model validation

The model contains errors. Looking at the file, even in stereo, you are not going to detect those. That is why we have the model validation server. Upload the model ("model.pdb"), and run the server. That can take a few minutes, so get your coffeecup clean... Print the validation report. In case of catastrophy, the validation report of the model is availabable on-line. Look for five "big errors", and check in the structure how bad it really is. Make some good notes about these errors.

  1. Leu 86 (355) has some bad chirality violations. Why was this modeling error made? How can this problem be solved?
  2. Look at Leu 17 (281)? What is wrong with it? How was that "error" made? How can we fix it afterwards?
  3. The template we used is called NR_template.pdb. Use the SCRIPT as described above (so put NR_template.pdb in the file GETPDB.SCR first) and figure out which of the two alignments is better.


© SF, GV.