How does it work?



Making PDB_REDO

Creating entries for PDB_REDO (version 6.16) takes eight steps: The PDB_REDO pipeline ties together software from CCP4 and other programs such as WHAT_CHECK using its own tools and decision-making algorithms. Whenever a serious problem in the procedure is encountered that prohibits making a PDB_REDO entry, a WHY_NOT entry is written instead.

Preparing the data

The reflection data is first filtered by the program cif2cif. It performs a few important checks to ensure that the next steps of the optimisation process can be run. These checks include (but are not limited to) 1) standardising the _refln.status column and checking it for information content, 2) checking the estimated standard deviations (σF) for each reflection and setting 0.0 values to safe defaults, 3) checking the σF set to see if it has any information content, 4) removing reflections with negative amplitudes, and 5) checking for general format problems. At the moment, only h, k, l, F, σF, and the status flag are kept. Anomalous diffraction pairs (F+ and F-) are merged. If the reflection data file contains more than one dataset, only the first set is kept.
In the case intensities instead of amplitudes are deposited, step 3) and 4) are skipped and intensities are written out instead of amplitudes. The intensities are later converted to amplitudes by the program ctruncate.

The PDB file is also edited before it is used. The program stripper removes all unknown atoms (UNX) and unknown ligands (UNL). Atoms in unknown residues (UNK) beyond Cβ are removed as well as all hydrogens, deuteriums and dubious LINKs. Superfluous oxygen atoms and LINKs in carbohydrates that are flagged by pdb-care are also removed. The treatment of atoms with occupancy set to 0.00 is context dependent: waters are always removed and side chain atoms as well (they are rebuilt later); other atoms have their occupancy reset to 0.01 to make sure that proper geometric rerestraints can be generated later.

The reformatted reflection data file and the stripped PDB file are then loaded into another program (extractor) to obtain information about crystal parameters, resolution ranges, R-factors, B-factors, LINKs, solvent mask parameters, and the TLS model. In addition to the TLS model from the PDB header a new set of TLS group definitions is created based on a single TLS group per macromolecular chain.

The output from extractor is checked. If no R value could be extracted from the PDB header, the job is stopped unless PDB_REDO runs in 'legacy' mode (only for PDB files from the seventies and eighties).
The initial solvent model for REFMAC (SIMPle or BULK) is chosen by looking for certain keywords (babinet, swat, bulk, moews, kretsinger, and tnt) in the solvent section of the REMARK 3 records in the PDB file. Iff any of the keywords is found, REFMAC's BULK option is used for the recalculation of R(-free).

A model validation report from WHAT_CHECK is created on-the-fly.

We now have enough information to convert the structure factors to the MTZ-format for CCP4. The original R-free set of reflections is kept. If this is not available or otherwise unusable, a new random set is generated (taking potentially twin-related reflections into account). The size of this set depends on the number of available reflections. At least 5% of all reflections are used but that percentage is increased (to a maximum of 10%) to try to get at least 500 reflections in the R-free set. If a new R-free set was selected,  the refinement procedure is (of course) adapted.
In the process of creating the MTZ file, sfcheck tests for twinning, calculates the Wilson B-factor and reports the completeness of the dataset.

Recalculating R(-free)

R and R-free are recalculated in a 0-cycle refinement run in REFMAC. The solvent model type and mask parameters extracted from the PDB header as well as anisotropic temperature factors (if available) are used for the calculation of R(-free). Please refer to the PDB_REDO script for more details. If REFMAC stops after finding a new ligand or new links, the 0-cycle REFMAC run is repeated with a newly generated restraint file. Beware that the restraints are not checked manually! If TLS tensors were extracted from the PDB header, the recalculation is performed twice (once with and once without static TLS tensors) to tests wether the B-factors in the model are total or residual values.
The recalculated R and R-free are extracted from the REFMAC output and compared to the value from the PDB header. If both values deviate more than 0.05 (that is, 5 percentage points) from the value from the PDB header, the recalculation is attempted with REFMAC's automated de-twinning switched on (but only if possible twinning was detected by sfcheck).
If this does not help to reproduce R and R-free, the structure is subjected to 10 cycles of rigid-body refinement to compensate for any rotation or translation of the structure before deposition. The rigid-body refined structure model will replace the original PDB file in the upcoming refinement. R and R-free are tested again, if the deviation from the headers is still too large, there is one final attempt to get better values: five cycles of pure TLS refinement (that is, the atoms do not move). This is only done when the structure was originally refined with TLS.
R and R-free are tested once more, this time with a cut-off of 0.10 (i.e. 10 percentage points). If the deviation is still too large, the re-refinement is aborted.
There are a few exceptions: 1) When PDB_REDO runs in legacy mode, the R-value is not validated (a number of old PDB files do not have an R-value). Rigid body refinement is always performed. 2) When no R-free value is extracted from the PDB header, only the R value is validated.

The recalculated values for R and R-free will be used to monitor the refinement success. An R-free Z-score is calculated based on the expected 'unbiased' R-free/R ratio. This ratio will also be used to monitor the re-refinement success. If a new R-free set was created, the calculated R-free value is unreliable. Therefore the 'unbiased' R-free value will be used to monitor refinement success.
The bond length and bond angle RMSZ values given by REFMAC are also extracted. If one of these values is greater than 1.000 (i.e. the PDB file had poor geometry), the geometric cut-offs used to check the refinement are relaxed.
If any chirality errors are detected by REFMAC, the program chiron is used to fix the errors caused by atom naming problems.

Setting refinement parameters

A number of parameters have to be set to ensure the refinement runs properly. If it was not tested in the previous step, REFMAC is used to check whether or not the data should be detwinned during refinement. The parameters for the solvent mask (probe sizes and the shrinkage factor) are optimised using a grid search in REFMAC.
If the experimental data extends beyond the resolution cut-off used to refine the original model, PDB_REDO attempts to find a higher resolution cut-off using paired refinement.

An appropriate B-factor model, i.e. overall B-factor, isotropic B-factors or anisotropic B-factors is chosen. When there are more than 30 (work set) reflections per atom, we always choose to use anisotropic B-factors. Between 30 and 13 reflections per atom, both isotropic and anisotropic B-factors are tested by doing refinement with automated weighting. The program Bselect is used to select the best B-factor model based on the Hamilton R ratio test and some additional validation. Between 13 and 4 reflections per atom, isotropic B-factors are used. When there are fewer than 4 reflections per atom, both one overal B-factor and individual isotropic B-factors are tested by refinement. The model selection is again performed by Bselect. In cases with strict NCS and a high number of copies (mostly found for viral capsids) isotropic B-factors are used by default.
When isotropic or overall B-factors are used, a TLS model must be selected. The B-factor is reset to the 0.5*Wilson B-factor and pure TLS refinement is performed with the TLS group selections made by extractor. The optimal TLS group selection is chosen by first letting Bselect filter out any cases with clear overfitting and then using the program picker to make the final choice. The program TLSanl is used to reduce the B-factors in the output PDB file to residual values. The new PDB file and the output TLS tensors are used as input for the subsequent refinements. In rare cases TLS refinement is not able to bring down R-free after resetting the B-factors. If this happens, we stop using TLS in the rest of the PDB_REDO procedure.

The number of refinement cycles depends on many factors. By default 15 cycles of TLS and 20 cycles of restrained refinement are used. To ensure convergence of the refinement, the number cylces is increased in these cases: 1) When jelly-body restraints are used (for lower resolution models), the number of cycles is increased by 10 cycles with tight restraints and 5 cycles with looser restraints. 2) For every step the resolution cut-off is set higher, 5 refinement cycles are added. 3) When a new R-free set is used, or when the R-free value is considered biased, the number of cycles is increased by 10. 4) When the model is treated as a legacy model (e.g. for PDB entries from the seventies and eighties), 10 cycles are added. 5) Anisotropic B-factors take longer to converge in refinement. The number of cycles is increased by 20 if anisotropic B-factors are used. 6) During k-cross validation of R and R-free an additional 30 refinement cycles are used.

The re-refinement

If individual (an)isotropic B-factors are used, the right B-factor restraint weight (WBSKAL in REFMAC) must be established. The optimal weight cannot be predicted based on what we know about the structure model. We therefore try up to seven weights (ranging from 2.50 to 0.10) based on the resolution and the number of reflections per atom. Each weight is tested by performing 10 cycles of restrained refinement (or 15 cycles of restrained refinement when using anisotropic Bs) with fixed TLS tensors (if applicable). For detailed settings refer to the PDB_REDO script.
The optimal weight is selected based on the free likelihood and a few cut-offs (see the selection section for details). If no weight gives acceptable results, the default value for WBSKAL (1.00) is used.

The next step is finding the right weights for the X-ray terms relative to the geometric restraints, the 'MATRIX' setting in REFMAC. The optimal weight is resolution dependent, but the correlation is very poor. We therefore try a number of weights ranging from 5.00 to 0.001 depending on the X-ray resolution and the number of reflections per atom.

The actual re-refinement consists of a number of refinement runs with selected matrix weights. The previously established B-factor model, B-factor restraint weight and solvent model are used. Automatic de-twinning and TLS are used when they were shown to be effective. Local NCS restraints or strict NCS constraints are used at any resolution, jelly body restraints at low resolution. Occupancies are refined for selected residues: 1) Hetero compounds (mostly ligands) with more than two different occupancies among the atoms. 2) Hetero compounds for which at leat one atom has occuapncy 0.01 or 0.00.
For detailed settings, please refer to the PDB_REDO script.

Selection

We now have a set of re-refined structure models. The optimal model is selected using these rules:

The optimal re-refined model is validated with WHAT_CHECK.

Rebuilding

We now have a conservatively optimised structure model (or the original PDB) and new maps. Using these maps we can rebuild the structure model. The program centrifuge is used to remove all waters that do not fit the maps. We then apply peptide flips using pepflip. If the fit with the maps and some geometry (e.g. Ramachandran) scores improve. Peptides in the middle of secondary structure elements (as assigned by DSSP are not likely to need flipping and are not tested to speed up the process. The next step is rebuilding existing side chains and adding the missing ones with SideAide. When the fit with the maps can be improved, a new side chain is built, otherwise the original coordinates are kept. Missing side chains are always built. Side chains involved in LINKs are never replaced. Waters that are in the way of rebuilt side chains are removed unless they are involved in LINKs.

The output structure model is validated in WHAT_CHECK to find side chains that need flipping because of the PDB standard (TYR, PHE, ASP and GLU), standard geometry (ARG) or hydrogen bonding (HIS, ASN and GLN) and side chains with chirality problems (LEU, VAL, ILE and THR). The structure model is then fed into SideAide once more to do the flips and replace all side chains with chirality problems.

More refinement

The rebuilt structure model has to be refined for one final time. Because the model may have been substantially altered the X-ray weight is optimised a bit more: the previously established weight is tried as well as a slightly higher and a slightly lower value. The optimal model is selected using the same criteria as before. If none of the models makes the cut, the one refined with the previously established X-ray weight is chosen. This way, there will always be a new (fully optimised) model even if this means a (slight) increase of R-free.
In the case re-refinement did not return an optimal X-ray weight and a new structure model, the refinement in this phase is performed with auto-weighting.

Validation and finalisation

If the test set is very small (i.e. with fewer than 500 reflections), k-fold cross validation is performed and the average R(-free) plus standard deviations are calculated. The program FoldX is used to compare the stability of the original, re-refined, and fully optimsed structure models by estimating their (folding) Gibbs energy. The program YASARA is used to record the changes PDB_REDO made to the structure model and to create ready-made 3D scenes of the model coloured by atomic shift and by TLS group. EDSTATS is used to test the fit to the electron density maps. The change on real-space correlation coefficient is plotted and tested for significance. A script for COOT is created to show all significant model changes. The fully optimised model is also validated with WHAT_CHECK. The validation scores and all important values from the previous steps are combined to a single webpage with links to relevant databases. The conservatively optimised and the fully optimised structure models are of course available for download together with MTZ files to calculate electron density maps, COOT script, YASARA scenes, and a few other relevant files.