How to detect a forged PDB
Most of you probably heard about the recent retraction of some 12 structures (1BEF, 1CMW, 1DF9/2QID, 1G40, 1G44, 1L6L, 2OU1, 1RID, 1Y8E, 2A01, and 2HR0) from the PDB. Reported first by The University of Alabama at Birmingham and follwing by structural biology blogs as p212121 and ByteSizeBio – also known as “Structuregate”. One question that arises is – was this preventable?
Soon after the issue was published, Gerard Kleywegt has posted an oficial statement on behalf of the wwPDB on the Retraction of “UAB PDB entries”, stating the pdb official policy and noting that:
wwPDB has convened expert, community-driven Validation Task Forces for X-ray (in 2008) and NMR (in 2009) to advise on the most suitable criteria to use for validating structure entries (model, data and fit of model to data) when they are deposited. The recommendations of these task forces will be implemented as part of the deposition and annotation procedures of the wwPDB partners.
To see if these procedures would have discovered the culprit we used the PDB Auto Deposit Input Tool (ADIT) validation, which as it seems run the PROCHECK and MolProbity software on the deposited structure. The results were quite convinving! Below is a table summaryzing MolProbity’s results on the structure 1BEF which was the first to be retracted:
| All-Atom Contacts |
Clashscore, all atoms: | 110.32 | 0th percentile* (N=576, 2.10Å ± 0.25Å) |
| Clashscore is the number of serious steric overlaps (> 0.4 Å) per 1000 atoms. | |||
| Protein Geometry |
Poor rotamers | 18.57% | Goal: <1% |
| Ramachandran outliers | 2.86% | Goal: <0.2% | |
| Ramachandran favored | 89.14% | Goal: >98% | |
| C? deviations >0.25Å | 1 | Goal: 0 | |
| MolProbity score^ | 4.04 | 1st percentile* (N=11758, 2.10Å ± 0.25Å) | |
| Residues with bad bonds: | 0.00% | Goal: 0% | |
| Residues with bad angles: | 0.00% | Goal: <0.1% | |
* 100th percentile is the best among structures of comparable resolution; 0th percentile is the worst.
RosettaHoles, by Will Sheffler is a Rosetta protocol designed to asses protein core packing, originally designed to select succesfull protein designs but was shown to be usefull for structural validation. In the paper published a year ago, RosettaHoles was tested against the entire PDB and detected 7 out of the 12 retracted structures as outliers.
What other validation tools would have done the work? Another question that might be interesting is, starting with the (apperantly) falsified coordinates could molecular modeling reconstruct the correct sturcture? Could at some point modeling serve as the validation itself ?
Some more reading and resources on this issue:














