Structure Validation
By the early 1990’s it had become clear that it was possible to build and refine atomic structures that had reasonably good crystallographic R-values, but which contained serious structural errors. This motivated the development of programs for evaluating the correctness of structures that were either in the process of atomic refinement, or which had already been deposited into the PDB. Among the first was an approach by Luthy and Bowie in David Eisenberg’s group (Luthy, Bowie, and Eisenberg (1992) Nature 356, 83-85), named Verify3D, which assessed the degree to which the environment (polarity for example) around each amino acid in a structure was statistically consistent with the amino acid type at that position. Shortly thereafter, other programs were developed that took the analysis from the level of amino acid residues to the atomic level. One of those was ERRAT, written by Chris Colovos as an undergraduate in the Yeates group (Colovos and Yeates (1993) Verification of protein structures: patterns of non-bonded atomic interactions. Protein Sci. 2, 1511-1519). The ERRAT program classifies atoms into three types (C, N, and O) and asks whether the distribution of non-bonded interactions between atoms in a candidate structure matches the distribution established from a database of reliable, high resolution structures. A structure is evaluated in a 9-residue sliding window. For each such window, the number of interactions of the 6 possible types (CC,CN,CO,NN,NO,OO) is totaled. These six counts are then converted to fractional values by dividing by the total number of interactions. Because those six normalized values sum to unity, the values span a five dimensional space. The distribution of interaction frequencies for correct structures can therefore be characterized as a generalized Gaussian function in 5-dimensional space. Based on this distribution, the interaction frequencies observed for a candidate structure can be evaluated for the likelihood that they could have been drawn at random from the correct distribution. For each 9-residue window, the atomic interactions tabulated are all those that involve at least one atom from that window, and which are less than a distance cutoff of 3.5A. The ERRAT program was found to be very effective in identifying erroneous regions of model structures, and to be particularly useful during the process of building and refining crystal structures. One weakness of the program was a high sensitivity to small deviations in atomic positions, owing mainly to the discontinuous nature of the error function arising from the distance cutoff. Around 2002, another undergraduate student Dennis Obukov, rewrote ERRAT with a continuous distance weighting scheme, which led to more stable and robust behavior. That version of ERRAT replaced the original one and remains accessible through a web server at UCLA. Recent incarnations of statistical or knowledge-based atomic-level energy functions for evaluating protein structures include Zhou’s program DFIRE.