TR-09-03 Calculating All Pairwise Similarities from the RCSB Protein Data Bank: Client/Server Work Distribution on the Open Science Grid

Chris Bizon, Andreas Prlic. Calculating All Pairwise Similarities from the RCSB Protein Data Bank: Client/Server Work Distribution on the Open Science Grid, Technical Report TR-09-03, RENCI, North Carolina, December 2009.

Proteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops, but proteins can still show similar activity in the cell and have an overall similar 3D structure, even if the sequence similarity is low. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in protein structure bioinformatics. The goal is to establish regions of equivalence between two or more molecules.

The RCSB Protein Data Bank (PDB) is a leading primary database that provides access to experimentally determined protein structures, nucleic acids, and complex assemblies. PDB is a vital part of the infrastructure supporting biomedical science worldwide and is used by around 200,000 unique scientists per month.

While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments is much more time consuming. The RCSB PDB has recently started to add new tools to the site, that allow users to quickly identify protein sequence neighbors and run pairwise protein structure comparisons. In order to allow users to also quickly identify more distant 3D relationships the goal of this project is to provide a pre-calculated set of all vs. all 3D protein structure alignments.