|Comparative Molecular Field Analysis (CoMFA)|
By Martin Ott
CoMFA (Comparative Molecular Field Analysis) is a 3D QSAR technique based on data from known active molecules. CoMFA can be applied, as it often is, when the 3D structure of the receptor is unknown. To apply CoMFA, all that is needed are the activities and the 3D structures of the molecules. Of course, activities have to be measured, but 3D structures can be determined either by measurement (crystal X-ray analysis) or by calculation from the 2D diagram and (optionally) subsequent optimization.
The aim of CoMFA is to derive a correlation between the biological activity of a set of molecules and their 3D shape, electrostatic and hydrogen bonding characteristics. This correlation is derived from a series of superimposed conformations, one for each molecule in the set. These conformations are presumed to be the biologically active structures, overlaid in their common binding mode. Each conformation is taken in turn, and the molecular fields around it are calculated. The fields, usually electrostatic and steric (van der Waals interactions), are measured at the lattice points of a regular Cartesian 3D grid; the lattice spacing is typically 2 Å. The "measured" interaction is between the molecule and a probe atom (an sp3-hybridized carbon with +1 charge).
How does CoMFA work?
What is needed before doing CoMFA?
In this tutorial, you will create a CoMFA model and study its application.
If necessary, check the X-windows start-up page for detailed instructions on how to set up the X-windows environment and to access CMBI's Unix machine you need for running Sybyl, cheminf.cmbi.ru.nl. Then, from the Unix shell (command prompt):
A series of 31 compounds with moderate to high activity for the estrogen receptor (ER-α) (Endocrinology 1997, 138(9), 4022-4025) have been constructed and provided with charges according to the Gasteiger-Huckel model. This set, along with the measured activity data, will be used for training in a CoMFA analysis. The molecules have already been aligned.
First, create a molecular database (MDB) and fill it with molecules:
Take a look at the aligned collection of molecules; rotate them by using the mouse (keep right mouse button pressed down). The majority of the molecules are steroids; do you recognize the steroid skeleton? Note the flat shape of the steroid A-ring region. The non-steroidal molecules have been aligned so as to fit their "A-ring" into the same flat region. If you have finished viewing the molecules, you can clear them and create the molecular spreadsheet (MSS). The spreadsheet is filled from the database you just created:
Now the activity data are imported from a text file into the spreadsheet. The text file is in a simple, space-delimited format:
A column of numbers should now appear behind the compound names. These data are relative binding affinities, expressed as (nanomolar!) concentration values. In QSAR the logarithms of concentrations are used, similar to using pH for the acidity of a solution. So the values we'll be using are -log(concentration*10-9). First, column 1 is given a new name ("RBA" for Relative Binding Affinity), and then Autofill/Functional data is used to create a column with converted values:
Verify that the functional specification you typed corresponds to the conversion "-log(RBA*10-9)".
Adding a CoMFA column to the molecular spreadsheet is quite straightforward:
Your spreadsheet window should now look like this:
In a partial least-squares (PLS) analysis, two factors are important: the number of components used in the regression equation (which will be dealt with later) and the (usually squared) correlation coefficient. A non-cross-validated PLS analysis gives a squared correlation coefficient usually indicated by r2. This number, which is also used in (multiple) linear regression, is between zero and one and expresses the quality of the PLS analysis. It indicates the proportion of the variation in the dependent variable (here the activity) that is explained by the regression equation and its value should be as close to one as possible. However, r2 expresses the quality of the data fit rather than the quality of prediction (which is what we are actually interested in).
To express the predictive power of the analysis, the cross-validated r2, usually indicated by q2, is used. In cross-validation, one value is left out, a model is derived using the remaining data, and the model is used to predict the value originally left out. This procedure is repeated for all values, yielding q2. q2 is normally (much) lower than r2 and values greater than 0.5 already indicate significant predictive power.
After the cross-validated PLS analysis, you will determine the optimal number of components. Recall that the components are linear combinations of the variables (of which you have very many!), ordered in such a way that the first component will describe most of the variation in the activity, the second most of the remaining variation, etc. The cross-validated PLS analysis will be carried out for different numbers of components. As a rule of thumb, the number of components should not exceed one-third of the number of molecules. More components would lead to a model that is overtrained - it has a better fit to the training data but the predictive power is diminished.
The two important factors here are the standard error of prediction (SEP) and the cross-validated r2 (usually called q2). SEP should be as low as possible and q2 should be larger than 0.5 (lower values indicate poor predictive power).
The resulting number of components will be used in the non-cross-validated analysis.
The cross-validated analysis was used just to determine the optimal number of components for a PLS analysis. The definitive CoMFA model, which you will use for the prediction of activities, is built by a non-cross-validated PLS analysis using the number of components found earlier.
To check the CoMFA model, the activities of the compounds in the training set are predicted and these values are compared to the measured data.
You will get a plot of predicted vs. measured values in the right upper corner, the steric contours (in green and yellow), and the electrostatic contours (in blue and red) of the CoMFA map. Each point in the plot is a molecule; select a molecule by clicking on a point to view it inside the contours, and repeat that for a few molecules. Where in the plot are the most active molecules? Note that the name and values of the molecule are shown in the text window as well, and that the molecule row is also highlighted in the spreadsheet. Use the reset tool to reset the display if necessary.
The meaning of the colors is as follows (check also the text window):
|Green:||Steric bulk favored|
|Yellow:||Steric bulk disfavored|
||Positive charge and H-bond donors favored|
Negative charge and H-bond acceptors disfavored
||Negative charge and H-bond acceptors favored|
Positive charge and H-bond donors disfavored
A main advantage of CoMFA is that it allows you to predict activities of close analogs (close because you cannot extrapolate outside the scope of the model) and also that it suggests where in the molecules you can make favorable modifications. Proceed as follows:
See if you can get a better ligand than diethylstilbestrol, or even a predicted activity above 9.5! (In which case you may well have strolled too far outside the scope of the model!)