TR-14-03 MaPSeq, A Computational and Analytical Workflow Manager for Downstream Genomic Sequencing

Genomic medicine is advancing at a remarkably fast past, with major technological achievements such as next-generation genomic sequencing producing large-scale genomic data sets within a reasonable timeframe and cost (Mardis, 2008; Horvitz and Mitchell, 2010; Koboldt et al., 2010; Kahn, 2011). Yet large-scale computation on the gigabyte- to petabyte-scale data sets that are generated from massively parallel genomic sequencing projects remains enormously challenging. Indeed, the National Consortium for Data Science (Ahalt et al., 2014), the Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data (2013), and the BD2K Data and Informatics Working Group, National Institutes of Health BD2K Initiative (2012) have recognized computational and analytical challenges as significant barriers to the advancement of genomic medicine.

Herein, we describe the Massively Parallel Sequencing (MaPSeq) system—an open source, secure, centralized, grid-based SOA that facilitates, manages, and executes the complex, project-specific, computational and analytical downstream steps involved in high-throughput genomic sequencing.