Kirk C. Wilhelmsen (RENCI), Charles P. Schmitt (RENCI), Karamarie Fecho (RENCI), Technical Report TR-13-03, Factors Influencing Data Archival of Large-scale Genomic Data Sets, A mathematical formalism to comprehensively evaluate the costs-benefits of archiving large data sets, Renaissance Computing Institute, 2013.
Next-generation genomic sequencing technologies and other high-throughput “-omics” technologies have enabled the rapid generation of large-scale data sets (Mardis, 2008; Koboldt et al., 2010). The costs associated with generating and storing these massive data sets have decreased precipitously, while computing power and storage capacity have simultaneously increased (Horvitz and Mitchell, 2010; Kahn, 2011). These new capabilities, coupled with new analytical algorithms and approaches to understanding and interpreting large-scale data (Horvitz and Mitchell, 2010; Koboldt et al., 2010), hold great promise to transform the field of genomics and realize the potential for personalized medicine.
However, the new capabilities raise questions regarding downstream reuse of genomic data and the costs-benefits of data archiving. While investigators fully recognize the drop in sequencing and storage costs, they rarely consider the additional costs associated with the archival of large genomic data sets, as well as secondary factors that may influence decisions related to archiving. These hidden costs and factors include: re-generation of new sources of genomic data (e.g., blood samples derived from lengthy, often expensive, clinical research studies); degradation of stored biological samples; introduction of errors during re-generation of genomic data sources and/or re-sequencing; long-term curation; data compression; data degradation; introduction of new technologies and analytical approaches (thus rendering stored data useless); data reuse needs; and time-related factors such as changes in reuse needs and costs of sequencing and storage.