A Radical New Tack: True Collaboration

Data Commons Pilot Phase teams plan how a rising tide of data and tools can float all research boats

Last November the National Institutes of Health announced $9 million in pilot funding to explore feasibility and best practices for a new approach to advancing biomedical research. The initiative, known as Data Commons, is focused on making digital objects—that is, the data, models, and analytical tools that constitute the engine behind the modern research enterprise—available through collaborative platforms.

Plenty of data and tools exist, of course. But many are locked inside the computer systems of the institution that “owns” them, are too large to move around, are impossible to combine or compare with other data or tools, or are difficult to share due to privacy concerns, among many other challenges.  

Data Commons aims to change all that by making biomedical research data “FAIR”—findable, accessible, interoperable, and reusable. It’s not about a specific study or even a bunch of them, but about creating linkages that can advance the way all biomedical research is done. If every researcher has access to more—and more useful—data and tools, the thinking goes, we can dramatically accelerate discovery and innovation.

More than 360 people have been working hard over the past nine months to rapidly develop, test, and iterate on specific solutions and overarching strategies for advancing the Data Commons vision. A late summer meeting in Chapel Hill, NC brought together dozens of team representatives to take stock of progress, lessons, and directions. Here are a few take-aways.  

This isn’t Thunderdome

We all love a little healthy competition, but Data Commons isn’t the place for it. NIH’s Data Commons Pilot Phase program manager Vivien Bonazzi emphasized that, rather than multiple teams developing their own solutions and then seeing whose idea wins, Data Commons is built on the premise that meaningful collaboration is the key to solving the truly hairy problems posed by data FAIRness. There’s really just one fighter in the ring—one Commons at the end—and it’s going to be collaboratively produced by many.

One illustration of what this means in practice is the initiative known as DataSTAGE, a program of NIH’s National Heart, Lung and Blood Institute (NHLBI). As NHLBI Chief Information Officer Alastair Thomson explained, DataSTAGE is designed to advance all of the same goals as Data Commons, albeit for the specific areas of biomedical research that are aligned with NHLBI’s mission. Instead of being separate from or a competitor to Data Commons, Thomson sees DataSTAGE as essentially an early instantiation of it—a test bed of sorts where Commons work can find ready testers and whose products will ultimately be absorbed into, and replaced by, Data Commons.

Get feedback, early and often

The meeting kicked off with a series of lightning talks on each team’s milestones and reflections. Participants noted that some of the most illuminating moments of the meeting, and indeed throughout the project, have happened when teams have had the chance to offer feedback and riff on each other’s work.

The trick, attendees reflected, is finding the sweet spot when feedback can be both useful and actionable—not so early that the basic ideas aren’t congealed but not so late that the product can’t be meaningfully changed. As RENCI director Stan Ahalt noted, deeply reviewing someone else’s work takes time, though the benefits for the program as a whole are well worth it. He suggested building in dedicated time for this review and feedback process as the effort moves forward.  

It’s about the product, not the PI

Notwithstanding the considerable brainpower in the room, Data Commons is decidedly not about promoting celebrity geniuses. Bonazzi stressed that for collaboration on this scale to succeed, the actual work products must take precedence. The Data Commons Pilot Phase is operated under a unique organizational structure in which experts from across the country lump and divide the work on multiple dimensions. Various aspects of the challenge are tackled from different angles by teams who are encouraged to regularly compare notes and harmonize their efforts.

This “teams of teams” structure keeps the focus where it should be—on generating results that will ultimately speed biomedical research and innovation for the benefit of the country and the world.

By Anne Johnson, Lead Science Writer at Creative Science Writing

RENCI Provides Insight on Data Science in Courtrooms

Stan Ahalt, Director, and Sarah Davis, Research Project Manager, attended the Science in the Courtroom Seminar for Resource Judges, held August 29-31, 2018, at the U.S. Court of Appeals for the Federal Circuit in Washington, DC. The seminar – organized by Franklin Zweig, Esq., of the National Courts and Sciences Institute and Dr. James Evans of the UNC Department of Genetics and Bryson Center for Judicial Science Education – is part of an ongoing science training program for state and federal judges from around the country, educating the judges to become resources on scientific issues for judges in their jurisdictions.

As The Honorable Pauline Newman, U.S. Court of Appeals for the Federal Circuit, noted, in the past judges had a passive role regarding science in the courtroom, allowing any expert to testify about whatever might be relevant to the trial. Increasingly, however, judges are being asked to rule on the qualifications of scientists to testify as well as the relevance and reliability of the testimony they intend to offer. This gatekeeping function is especially difficult when the judges do not have expertise in, or even a basic understanding of, complex scientific issues. Accordingly, judges at the seminar heard presentations on DNA and frontiers of genetic engineering, health care outcome research, and the neurobiology of violence and addiction. The judges also participated in mock hearings to put that knowledge to use. In one such hearing, a mock lawyer questioned two scientific experts in CRISPR genetic modification technology to determine whether a hospital could stop a couple’s medical team from using CRISPR to attempt to remove genes associated with Huntington’s Disease from a viable IVF embryo. The mock judge ruled that the hospital could not stop the procedure, a ruling with which most of the judges in the room seemed to agree.

Dr. Ahalt gave the final presentation of the seminar on Big Data: Promise and Peril for Courts (and Society). The presentation had two purposes: to introduce the judges to basic concepts in data science and to highlight some areas where data science issues might arise in the courtroom. For example, Dr. Ahalt proposed the following scenario: he is plaintiff’s expert witness in a case where the plaintiff sues both his doctor and the company that created the medical AI platform (like Watson Health) on which the doctor relied in his diagnosis. Dr. Ahalt plans to testify regarding the inadequacy of the platform’s queries. What questions might the judges ask to determine whether he is qualified to testify and whether his testimony is reliable. One judge asked, “Inadequate compared to what?” That is, what was the platform being compared to? Another AI platform? The doctor?

RENCI plans to continue this important discussion regarding data science in the courtroom and in the legal field. We hope to develop a seminar, similar to the program held in Washington DC, that will train judges to be knowledgeable about – and skeptical of – the data science, algorithms, and analytics that they will increasing encounter in their courtroom.

By Sarah Davis, Research Project Manager

RENCI participates in NSF Cyber Carpentry workshop to prepare early-career researchers

Big data is only getting bigger, and that can cause big problems for researchers who need to store and share their data. Twenty doctoral students and post-doctoral associates from across the county learned the tools and techniques to solve these problems at the inaugural Cyber Carpentry Workshop at the University of North Carolina at Chapel Hill. Sponsored by the National Science Foundation (NSF) and hosted by the UNC School of Information and Library Science (SILS), the two-week workshop in late July introduced students to a variety of applications, platforms, and processes for data life-cycle management and data-intensive computation. The Renaissance Computing Institute (RENCI) provided support for the workshop in the form of instructors and project management staff.

Teacher and students discuss an issue with their team project.
From left: Andres Espindola-Camacho from Oklahoma State University, Jeremy Thorpe from Johns Hopkins University School of Medicine, Gaurav Kandoi from Iowa State University, and Yingru Xu from Duke University discuss an issue with their team project.

“Previously, you had maybe a thousand files, maybe ten thousand,” said Arcot Rajasekar, SILS professor and RENCI chief domain scientist in data grid technology. “Now, you’re talking about 100 million files and doing simulations and emulations that can create petabytes of data. Managing that just by human interaction is not going to be effective; you need some automation there. In addition to the volume of data, you have to consider the velocity of data coming in and the multiple varieties of data you’re collecting. This is not easily done without a good level of management.”

Though not affiliated with Software Carpentry or Data Carpentry, Cyber Carpentry organizers drew inspiration from those projects. The workshop at Carolina brought together data professionals, educators, and researchers from RENCIthe iRODS Consortium, SILS, the Odum Institute, the University of Arizona (CyVerse), Indiana University (Jetstream), University of Virginia (Hydroshare), Drexel University, and Amazon (AWS)) to teach these intensive two-week courses.

The workshop familiarized participants with the concepts of virtualization, automation, and federation as defined through the Datanet Federation Consortium (DFC), an NSF-funded project that promotes sharing within and across science and engineering disciplines. Instructors introduced specific DFC web portals, including CyVerse, Dataverse, DataONE, and Hydroshare, as well as relevant software, metadata management strategies, and large-scale workflows.  

Participants learned the basics of the integrated Rule-Oriented Data System (iRODS), which is free open source software for data discovery, workflow automation, secure collaboration, and data virtualization used by research and business organizations around the globe. Housed at RENCI, the iRODS Consortium guides development and support of iRODS. Terrell Russell, iRODS chief technologist, and Hao Xu, a RENCI research scientist, both taught courses about iRODS during the two-week workshop.

“The students in this workshop are not yet in charge of securing federal funding and writing data management plans, but they’ll be there very soon,” said Russell. “We want them to know about the tools they’ll need when the time is right.”

iRODS Chief Technologist Terrell Russell discusses the capabilities of the open source data management software with Cyber Carpentry participants.
iRODS Chief Technologist Terrell Russell discusses the capabilities of the open source data management software with Cyber Carpentry participants.

The workshop drew students from across the country, with NSF-funding providing travel and accommodation support. Anuja Majmundar, a doctoral student at the University of Southern California, said the Cyber Carpentry workshop offered a great opportunity for her to learn tools and procedures that could make data science more reproducible and scalable, especially for the diverse data streams she encounters in her research on health behaviors.

Jocelyn Colella, a PhD candidate in evolutionary genomics at the University of New Mexico, said gaining experience with containers – programs that can virtualize entire scientific workflows, including software, libraries, and data  – was one of the highlights of her experience, and the introduction to the JetStream and CyVerse virtual environments had significant implications for her research.

“Coming from a smaller lab, it has been incredibly expensive to build the computing resources and data archival infrastructure necessary to deal with terabytes of genomic data,” she said. “Learning about the free computational and storage resources available through NSF-funded projects has revolutionized how I conceptualize my own workflows and will alter how I apply for grants going into the future.”

This workshop was funded by the NSF Cyber Training program. Look for information about the 2019 summer workshop at cybercarpentry.web.unc.edu

Tracking the story of the ENIAC programmers

Jean Jennings Bartik (left), and Frances Bilas Smith in 1946 with ENIAC, the world’s first all-electonic computer. Photo credit: Computer History Museum

Six women who changed computing finally get their day in the spotlight.

More than 70 years ago, six brilliant mathematicians came to Philadelphia to take part in a secret U.S. Army project designed to help the Allies win World War II. These young pioneers of the computing age learned to program using only logical diagrams and their considerable talents—no programming languages or tools existed to help them.  Read more…

