TR-10-02 Using high performance computing and domain-based functional annotation of proteins to enhance discovery of novel proteins, identify functional homology, and characterize phylogenetic relatedness

Jeffrey L Tilson, Gloria Rendon, Eric Jakobsson. Using high performance computing and domain-based functional annotation of proteins to enhance discovery of novel proteins, identify functional homology, and characterize phylogenetic relatedness, Technical Report TR-10-02, RENCI, North Carolina, June 2010.

Background
Next generation sequencing technology is putting significant pressure on computational researchers to implement software tools for analysis (identification, annotation, homology/orthology assignment, phylogeny, etc.) of the genes and gene products “on-the-fly” in parallel with the sequencing machines. This requires both leveraging supercomputing systems and alternative kinds of analyses. We seek to contribute to the solution of these problems through the deployment of high speed explicitly functional domain-based solutions through the system called MotifNetwork. We present case select studies of domain-based approaches to gene analysis that range from homology assessment to phylogeny reconstruction to pangenomic analysis as a demonstration of potential benefits of such approaches. For analyses, we used grid-computing to enable the computations necessary to apply these techniques to genome-size systems.

Results
We used MotifNetwork to apply functional domain-based methods to three biological test cases that represent broad biological areas of research.
First, we assess functional homology of over 3000 eukaryotic proteins with respect to the ligand-gated ion channel family by calculating domain-based similarity of genes with four different metrics: distinct-partners, inverse document coefficients, cumulative association coefficients, and the Jaccard function.

Second, we illustrate a methodology for predicting phylogenetic relatedness based on evolutionary domain analysis. It is applied to over 40 prokaryotic proteins that were identified as likely functional homologs with respect to the same family of ion channels.
Lastly, comparative genomics studies are conducted between. H. sapiens and 23 different strains of E. coli. The domain-based pangenome of E. coli is analyzed and compared against that of H. sapiens in a context of drug target identification and potential side effects.
Benchmarks of MotifNetwork indicate that execution times achieve reasonable performance scaling when using up to 256 processors available to this work and that our use of a data-grid for storage of the results, as implemented with iRODS, is well-suited for large-scale biological pipelines.

Conclusions
The combination of domain-based analyses and fast processing enabled by MotifNetwork should permit researchers to more accurately and efficiently perform research on a wide range of biological problems and thus alleviate the bottlenecks that now exist between sequencing of genes and their subsequent characterization. Our approach is especially suitable for biological problems that can be formulated as the identification of functional correspondences among a large set of proteins such as the three illustrative examples that are discussed in the paper which range from E. coli pangenomics, to functional homology and phylogenetic relatedness of the LIC family of ion channels.