Big data is only getting bigger, and that can cause big problems for researchers who need to store and share their data. Twenty doctoral students and post-doctoral associates from across the county learned the tools and techniques to solve these problems at the inaugural Cyber Carpentry Workshop at the University of North Carolina at Chapel Hill. Sponsored by the National Science Foundation (NSF) and hosted by the UNC School of Information and Library Science (SILS), the two-week workshop in late July introduced students to a variety of applications, platforms, and processes for data life-cycle management and data-intensive computation. The Renaissance Computing Institute (RENCI) provided support for the workshop in the form of instructors and project management staff.
“Previously, you had maybe a thousand files, maybe ten thousand,” said Arcot Rajasekar, SILS professor and RENCI chief domain scientist in data grid technology. “Now, you’re talking about 100 million files and doing simulations and emulations that can create petabytes of data. Managing that just by human interaction is not going to be effective; you need some automation there. In addition to the volume of data, you have to consider the velocity of data coming in and the multiple varieties of data you’re collecting. This is not easily done without a good level of management.”
Though not affiliated with Software Carpentry or Data Carpentry, Cyber Carpentry organizers drew inspiration from those projects. The workshop at Carolina brought together data professionals, educators, and researchers from RENCI, the iRODS Consortium, SILS, the Odum Institute, the University of Arizona (CyVerse), Indiana University (Jetstream), University of Virginia (Hydroshare), Drexel University, and Amazon (AWS)) to teach these intensive two-week courses.
The workshop familiarized participants with the concepts of virtualization, automation, and federation as defined through the Datanet Federation Consortium (DFC), an NSF-funded project that promotes sharing within and across science and engineering disciplines. Instructors introduced specific DFC web portals, including CyVerse, Dataverse, DataONE, and Hydroshare, as well as relevant software, metadata management strategies, and large-scale workflows.
Participants learned the basics of the integrated Rule-Oriented Data System (iRODS), which is free open source software for data discovery, workflow automation, secure collaboration, and data virtualization used by research and business organizations around the globe. Housed at RENCI, the iRODS Consortium guides development and support of iRODS. Terrell Russell, iRODS chief technologist, and Hao Xu, a RENCI research scientist, both taught courses about iRODS during the two-week workshop.
“The students in this workshop are not yet in charge of securing federal funding and writing data management plans, but they’ll be there very soon,” said Russell. “We want them to know about the tools they’ll need when the time is right.”
The workshop drew students from across the country, with NSF-funding providing travel and accommodation support. Anuja Majmundar, a doctoral student at the University of Southern California, said the Cyber Carpentry workshop offered a great opportunity for her to learn tools and procedures that could make data science more reproducible and scalable, especially for the diverse data streams she encounters in her research on health behaviors.
Jocelyn Colella, a PhD candidate in evolutionary genomics at the University of New Mexico, said gaining experience with containers – programs that can virtualize entire scientific workflows, including software, libraries, and data – was one of the highlights of her experience, and the introduction to the JetStream and CyVerse virtual environments had significant implications for her research.
“Coming from a smaller lab, it has been incredibly expensive to build the computing resources and data archival infrastructure necessary to deal with terabytes of genomic data,” she said. “Learning about the free computational and storage resources available through NSF-funded projects has revolutionized how I conceptualize my own workflows and will alter how I apply for grants going into the future.”