How does it work?



Making PDB_REDO

Creating entries for PDB_REDO (version 4.36) takes nine steps: A custom-made script and several programs are used as well as two major software packages: CCP4, and WHAT IF/YASARA. Whenever a serious problem in the procedure is encountered, a WHY_NOT entry is written.

Preparing the data

The reflection data from the PDB are not always uniform in format. Therefore, they have to be reformatted by a program called cif2cif. It performs a few important checks to ensure that the next steps of the re-evaluation process can be run. These checks include (but are not limited to) 1) standardising the _refln.status column and checking it for information content, 2) checking the estimated standard deviations (σF) for each reflection and setting 0.0 values to safe defaults, 3) checking the σF set to see if it has any information content, 4) removing reflections with negative amplitudes, and 5) checking for general format problems. At the moment, only h, k, l, F, σF, and the status flag are kept. Anomalous diffraction pairs (F+ and F-) are merged. If the reflection data file contains more than one dataset, only the first set is kept.
In case intensities instead of amplitudes are deposited, step 3) and 4) are skipped and intensities are written out instead of amplitudes. The intensities are later converted to amplitudes by the program ctruncate.

The PDB file is also edited before it is used. The program stripper removes all unknown atoms (UNX) and unknown ligands (UNL). Atoms in unknown residues (UNK) beyond Cβ are removed as well as all hydrogens, deuteriums, atoms with occupancy 0.00 and dubious LINKs. Superfluous oxygen atoms and LINKs in carbohydrates that are flagged by pdb-care are also removed.

The reformatted reflection data file and the stripped PDB file are then loaded into another program (extractor) to obtain information about crystal parameters, resolution ranges, R-factors, B-factors, LINKs, solvent mask parameters, and TLS models. In addition to the TLS model form the PDB header a new set of TLS group definitions is created based on a single TLS group per chain.

The output from extractor is checked. If no R value could be extracted from the PDB header, the job is stopped unless PDB_REDO runs in 'legacy' mode (only for PDB files from the seventies and eighties).
The solvent model for Refmac (SIMPle or BULK) is chosen by looking for certain keywords (babinet, swat, bulk, moews, kretsinger, and tnt) in the solvent section of the REMARK 3 records in the PDB file. Iff any of the keywords is found, Refmac's BULK option is used for the recalculation of R(-free).

A model validation report from WHAT_CHECK is copied from the PDBREPORT databank. If the report is unavailable or outdated, PDB_REDO tries to make a report on-the-fly.

We now have enough information to convert the structure factors to the MTZ-format for CCP4. The original R-free set of reflections is kept. If this is not available or otherwise unusable, a new random set is generated. The size of this set depends on the number of available reflections. At least 5% of all reflections are used but that percentage is increased to (try to) get at least 500 reflections in the R-free set. If a new R-free set was selected,  the refinement procedure is (of course) adapted.
In the process of creating the MTZ file, sfcheck tests for twinning, calculates the Wilson B-factor and reports the completeness of the dataset.

Recalculating R(-free)

R and R-free are recalculated in a 0-cycle refinement run in Refmac. The solvent model type and mask parameters extracted from the PDB header as well as anisotropic temperature factors (if available) are used for the calculation of R(-free). Please refer to the PDB_REDO script for more details. If Refmac stops after finding a new ligand or new links, the 0-cycle Refmac run is repeated with the new restraint file. Beware that the restraints not checked manually! If TLS tensors were extracted from the PDB header, the recalculation is performed twice (once with and once without static TLS tensors) to tests wether the B-factors in the model are total or residual values.
The recalculated R and R-free are extracted from the Refmac output and compared to the value from the PDB header. If both values deviate more than 0.05 (that is, 5%) from the value from the PDB header, the recalculation is attempted with Refmac's automated de-twinning switched on (but only if possible twinning was detected by sfcheck).
If this does not help to reproduce R and R-free, the structure is subjected to 10 cycles of rigid-body refinement to compensate for any rotation or translation of the structure before deposition. The rigid-body refined structure model will replace the original PDB file in the upcoming refinement. R and R-free are tested again, if the deviation from the headers is still too large, there is one final attempt to get better values: five cycles of pure TLS refinement (that is, the atoms do not move). This is only done when the structure was originally refined with TLS.
R and R-free are tested once more, this time with a cut-off of 0.10 (i.e. 10%). If the deviation is still too large, the re-refinement is aborted.
There are a few exceptions: 1) When PDB_REDO runs in legacy mode, the R-value is not validated (a number of old PDB files do not have an R-value). Rigid body refinement is always performed. 2) When no R-free value is extracted from the PDB header, only the R value is validated.

The recalculated values for R and R-free will be used to monitor the refinement success. An R-free Z-score is calculated based on the expected 'unbiased' R-free/R ratio. This ratio will also be used to monitor the re-refinement success. If a new R-free set was created, the calculated R-free value is unreliable. Therefore the 'unbiased' R-free value will be used to monitor refinement success.
The bond length and bond angle RMSZ values given by Refmac are also extracted. If one of these values is greater than 1.000 (i.e. the PDB file had poor geometry), the geometric cut-offs used to check the refinement are relaxed.
If any chirality errors are detected by Refmac, the program chiron is used to fix the errors caused by atom naming problems.

Setting refinement parameters

A number of parameters have to be set to ensure the refinement runs properly. If it was not tested in the previous step, Refmac is used to check whether or not the data should be detwinned during refinement. The parameters for the solvent mask (probe sizes and the shrinkage factor) are optimised using a grid search in Refmac.
An appropriate B-factor model, i.e. overall B-factor, isotropic B-factors or anisotropic B-factors is chosen. When there are more then 18 (work set) reflections per atom, we always choose to use anisotropic B-factors. Between 18 and 13.5 reflections per atom, both isotropic and anisotropic B-factors are tested by doing refinement with automated weighting. The program Bselect is used to select the best B-factor model based on a Hamilton test and some additional validation. Between 13.5 and 3 reflections per atom, isotropic B-factors are used. When there are fewer than 3 reflections per atom, both one overal B-factor and individual isotropic B-factors are tested by refinement. The model selection is again performed by Bselect.
When isotropic or overall B-factors are used, a TLS model must be selected. The B-factor is reset to the Wilson B-factor and pure TLS refinement is performed with the TLS group selections made by extractor. The optimal TLS group selection is chosen by the program picker and the program TLSanl is used to reduce the B-factors in the output PDB file to residual values. The new PDB file and the output TLS tensors are used as input for the subsequent refinements. In rare cases TLS refinement is not able to bring down R-free after resetting the B-factors. If this happens, we stop using TLS in the rest of the PDB_REDO procedure.

The number of refinement cycles depends on many factors. If all is well, 10 cycles of TLS and 20 cycles of restrained refinement are used or just 25 cycles of restrained refinement in case of anisotropic Bs (convergence is a bit slower).
If the calculated R-free is lower than R, the difference between the recalculated R and R-free is less than a third of the difference reported in the PDB header, or Z-score is greater than 10.0, there may be something wrong with the deposited R-free set. In that case the number of refinement cycles is increased to ensure convergence: 15 cycles TLS + 30 cycles restrained (isotropic Bs) or 40 cycles (anisotropic Bs).
If no R-free set was deposited, the same number of cycles is used and also the B-factors are reset to the Wilson B (or the average B for structure models with resolution worse than 4.00Å).
If PDB_REDO runs in legacy mode we expect the refinement to need more cycles of refinement of reach convergence: 20 cycles TLS + 50 cycles restrained (isotropic Bs) or 60 cycles (anisotropic Bs).

B-factor weight optimization

If individual (an)isotropic B-factors are used, the next step of the PDB_REDO procedure is establishing the right weight for the B-factor restraints (WBSKAL in Refmac). The optimal weight cannot be predicted based on what we know about the PDB entry. We therefore try up to seven weights (ranging from 2.50 to 0.10) based on the resolution and the number of reflections per atom. Each weight is tested by performing 10 cycles of restrained refinement (or 15 cycles of restrained refinement when using anisotropic Bs) with fixed TLS tensors (if applicable). For detailed settings refer to the PDB_REDO script.
The optimal weight is selected based on the free likelihood and a few cut-offs (see the selection section for details). If no weight gives acceptable results, the default value for WBSKAL (1.00) is used.

The re-refinement

The next step is establishing the right weights for the X-ray terms relative to the geometric restraints, the 'MATRIX' setting in Refmac. The optimal weight is resolution dependent, but the correlation is very poor. We therefore try a number of weights ranging from 5.00 to 0.001 depending on the X-ray resolution. For atomic resolution (i.e. 1.20Å or higher) structure models, the maximum weight is 5.00 and the lowest is 0.70. For high resolution (i.e. 1.21Å to 1.70Å) models, the highest weight is 2.00 and the lowest 0.10. For medium resolution (i.e. 1.71Å to 2.79Å) models, the highest weight is 0.70 and the lowest 0.01. Weights ranging from 0.10 to 0.001 are used for the remaining structure models.

The actual re-refinement consists of a number of refinement runs with selected matrix weights. The previously established B-factor model, B-factor restraint weight and solvent model are used. Automatic de-twinning and TLS are used when they were shown to be effective. Local NCS restraints are used at any resolution, jelly body restraints at low resolution. detailed settings refer to the PDB_REDO script.

Selection

We now have a set of re-refined structure models. The optimal model is selected using these rules:

The optimal re-refined model is validated with WHAT_CHECK.

Rebuilding

We now have a conservatively optimised structure model (or the original PDB) and new maps. Using these maps we can rebuild the structure model. The program centrifuge is used to remove all waters that do not fit the maps. We then apply peptide flips using pepflip if the fit with the maps and some geometry (e.g. Ramachandran) scores improve. Peptides in the middle of secondary structure elements (as assigned by DSSP The next step is rebuilding existing side chains and adding the missing ones with SideAide. When the fit with the maps can be improved, a new side chain in built, otherwise the original coordinates are kept. Missing side chains are always built. Side chains involved in LINKs are never replaced. Waters that are in the way of rebuilt side chains are removed unless they are involved in LINKs.

The output structure model is validated in WHAT_CHECK to find side chains that need flipping because of the PDB standard (TYR, PHE, ASP and GLU), standard geometry (ARG) or hydrogen bonding (HIS, ASN and GLN) and side chains with chirality problems (LEU, VAL, ILE and THR). The structure model is then fed into SideAide once more to do the flips and replace all side chains with chirality problems.

More refinement

The rebuilt structure model has to be refined for one final time. Because the model may have been substantially altered the X-ray weight is optimised a bit more: the previously established weight is tried as well as a slightly higher and a slightly lower value. The optimal model is selected using the same criteria as before. If none of the models makes the cut, the one refined with the previously established X-ray weight is chosen. This way, there will always be a new (fully optimised) model even if this means a (slight) increase of R-free.
In the case the re-refinement did not return an optimal X-ray weight and a new structure model, the refinement in this phase is performed with auto-weighting.

Validation and finalisation

If the test set is very small (i.e. with fewer than 500 reflections), full cross validation is performed and the average R(-free) plus standard deviations are calculated. The program FoldX is used to compare the stability of the original, re-refined, and fully optimsed structure models by estimating their (folding) Gibbs energy. The program YASARA is used to record the changes PDB_REDO made to the structure model and to create ready-made 3D scenes of the model coloured by atomic shift and by TLS group. The fully optimised model is also validated with WHAT_CHECK. The validation scores and all important values from the previous steps are combined to a single webpage with links to relevant databases. The conservatively optimised and the fully optimised structure models are of course available for download together with MTZ files to calculate electron density maps, YASARA scenes, and a few other relevant files.