Results - 2013

Absolute Quality Assessment of Protein Structures

Protein structures are frequently used to explain biological processes on the nano-scale. Although only a limited set of experimental protein structures of medical and biological relevance is available, the acceptance of theoretical predictions of protein models in the life-sciences is low. Using methods like homology modeling, very often a prediction within experimental resolution can be made for highly homologous sequences. In the absence of homology, there are isolated cases where protein models close to the native conformation were constructed. However in the grey area of intermediate degrees of homology, little is known about the quality of protein models, in particular those generated from fully automated servers. Development of methods for absolute quality assessment of protein structures would therefore go a long way to increase the acceptance of theoretical models in life-science research. We therefore developed a protocol for absolute quality assessment, based on the concept of marginal stability of proteins. We hypothesized that every amino acid must contribute an optimal energy contribution towards the global protein structure in its biologically active state. We collected statistics for these energy contributions for a set of high-resolution protein structures and derived a N -dimensional statistical test, which assesses the quality of a protein model by comparing against these statistics. We found that the energy statistics of amino acids in their folded state differ from those of low quality protein models. By introducing energy statistics of triplets of amino acids, we could increase the specificity of our methods and reject 93% of the low quality protein models for 160 proteins tested. The remaining 7% of the protein models were found to be either oligomeric, not globular, or bound to cofactors, all classes of proteins, where the initial hypothesis was bound to fail. Given the present state of the art it is important to develop methods that reject false positives with high certainty, even if these work only for a specific subclass of proteins. There are bioinformatics-based approaches that can be used to predict with high reliability, whether, for a given protein sequence, a model belongs to one of the classes that our present algorithm cannot discriminate well. In combination of these techniques we hope that our approach will serve as a prototype for further development of quality assessment protocols and help the further acceptance for those models which are evaluated positively.

The following graphs compare the Template Modeling (TM) Score of our relaxed structures to the newly introduced Quality Threshold Criterion (T²).

A True Positive (TP) denotes a correctly identified native structure. A True Negative (TN) denotes a correctly identified low quality model. A False Negative (FN) denotes a misclassified native structure. A False Positive (FP) denotes a misclassified low quality model. Good statistical tests feature data points in the green regions and no data points in the red regions. The True Negative Ratio (TNR) is the ratio of the number of structures observed in the upper left region (TN) divided by the number of all low quality decoys, i.e. the structures in the lower left region (FP) and the upper left region (TN). It denotes the quality of the discrimination of a single decoy set.

The critical T² threshold was set to 1.05 (blue line). Relaxed native structures are encircled in red. On the right side single amino acid statistics were used. It is not possible to differentiate native and decoy structures by T² value, although the decoy structures have a slightly lower TM score. Using amino acid triplets, we can perfectly distinguish native and decoy structures by the T² value, as shown on the left side.

a) Combined results of all quality assessment results for the low and high quality models. The bulk of bad decoy structures (TM-Score below 0.7) are situated above the critical T² value and are therefore recognized. As far more bad decoys were sampled than native structures, the native structures are not visible in this figure. Especially for native- like decoys (TM-Score between 0.7 and 0.94), structures are observed crossing the critical T² score. These structures can neither be considered native protein structures, nor bad, as many of the structures lie within experimental resolution or differ due to native protein flexibility. b) Quality assessment results for all high quality protein structures considered native (TM-Score above 0.94). For TM-Scores above 0.95 most of the protein structures are observed below the critical T² score and therefore correctly classified as being native. It is expected that many native-like structures lie above the critical T² threshold: Steric overlap can lead to very large non-physical energies, while only slightly changing the topology. a and b: The red bar denotes the critical T² below which a model is recognized as native.

Further information is available in the thesis of Timo Strunk "High-Throughput Atomistic Modeling of Biomolecular Structure and Association".