Why Data Commons? Because scientists want to focus on science, not infrastructure


ESIP meeting participants discuss the challenges of a Data Commons at their recent summer meeting in Durham, NC.

After more than 25 years as a science communicator, I’ve come to recognize the things that all scientists, regardless of their disciplines, yearn for. It’s not an endless stream of funding or appreciation from the public for their work (although both would be nice).

Most scientists simply want to be able to concentrate on their science, rather than the tools, technologies, and resources that make modern-day collaborative science possible. Just like a Formula 1 driver, who wants to drive a car really fast without worrying about what’s going on under the hood, scientists want to do their work while computing, data management, security, etc., happens under the hood. That allows them to focus on solving problems rather than the infrastructure that supports problem solving.

It’s a beautifully simple concept that is, unfortunately, difficult to implement.

The term Data Commons (sometimes called Science Commons or Science as a Service) gets tossed around often among scientists and technologists working to make science more productive. So, in an effort to learn about the latest efforts to let scientists do science, I attended a session on Data Commons for the geosciences at the recent Earth Science Information Partner (ESIP) Federation meeting in Durham, NC.

A shout out to the following RENCI colleagues for organizing this session: CTO Charles Schmitt, Director of Environmental Initiatives Brian Blanton, Domain Scientist for Environmental Initiatives Chris Lenhardt, and Senior Research Software Developer Howard Lander. These guys and the teams they work with dedicate much brainpower to Data Commons and other strategies that facilitate more productive science and new scientific insights. Following is a recap of what I gleaned from the session:

  • Data Commons has different definitions depending on who you ask. For the ESIP session, Lenhardt defined it as integrated cyberinfrastructure for science research that brings together all the applications, services, and resources needed to conduct research. The researcher likely signs in to the system and sees a dashboard that pulls together tools, applications, integrated data management, curation, publishing, security, and more. The physical location of these tools and services is unimportant as long as the scientist can access them.
  • The realities of 21st-century research drive the need for a Data Commons. Interdisciplinary, collaborative work that generates more (often heterogeneous) data, and requires more models, simulations, and analytics makes federated resources a necessity. It also requires integrated data management so that data can be easily accessed, kept safe, and stored for future uses. Yes, Hollywood still loves the image of a slightly crazy, way too intelligent scientist working alone in some scary looking secret lab, but that’s not how it works.
  • There is no perfect system available that keeps all scientific support infrastructure under one roof, however, some scientific groups have created infrastructure that addresses at least part of the problem. EUDAT, the European Data Infrastructure, offers a suite of tools for finding, synching and exchanging, storing, sharing, and safely replicating research data, as well as a tool for sending data to compute resources. Other examples of Data Commons include the INCF (International Neuroinformatics Coordinating Facility) Dataspace for sharing neuroscience data, text, images, sounds, movies, models, and simulations, and the National Cancer Institute’s Genomic Data Commons, which provides cancer researchers with a unified data repository that enables data sharing across studies.
  • If the need is great, why all the discussion? Just get to work, right? If only it were so simple. Different science communities use different terminology even when referring to the same phenomena, specimen, or disease symptom. That makes sharing data across disciplines difficult. Different disciplines also have different ways of working and collaborating, different computing and analysis needs, and vary greatly in how collaborative and interconnected they are. With so much variation, a universal Data Commons is unlikely. However, domain-specific Data Commons that can link with other Data Commons and share features and tools is an obtainable goal.

Should the geosciences develop their own Data Commons? ESIP participants say yes. That means geoscientists need to collaborate with data scientists, networking and security specialists, and computer scientists to develop a commons that is tightly linked to real science problems and grows from the ground up based on the needs of the community. ESIP has started to address this challenge through its sustainable data management cluster, a group that promotes collaboration and coordination in managing environmental science data. Others suggest surveying what’s already been done to avoid duplication and reinvention. A national-level study on what kind of infrastructure is needed for science could also contribute to solving the data commons challenge.

Whatever happens, the need for scientific cyberinfrastructure continues to grow and scientists continue to wish for that under-the-hood solution that will finally free them from the multiple roles of domain scientist, data scientist, computer scientist, and network scientist. Let geologist be geologists, meteorologist be meteorologists, physicists be physicists…you get the picture.

-Karen Green