DataBridge tackles the problem of ‘dark data’

DataBridge-Logo-Final copyDataBridge, a National Science Foundation-funded project to make research data more discoverable and usable by a wide community of scientists, has the green light to expand its work into the neuroscience community, thanks to a new NSF EAGER award.

The award itself is relatively small (less than $100,000) and will allow the researchers to consult with neuroscientists, develop a prototype DataBridge for Neuroscience (DBfN), and a community workshop. However, the impact could be significant for a hot scientific field that is making breakthrough discoveries about the human brain.

The South Big Data Hub will play a key role in BDfN:

  • The Hub will provide computing and storage facilities for the implementation of the DBfN system. Those systems are located at RENCI at UNC-Chapel Hill, one of the lead institutions for the South Big Data Hub.
  • The South Hub will assist the research team in conducting a community workshop. The workshop is tentatively planned to take place at Georgia Tech, the other lead institution for the South Hub.
  • The South Hub’s network of domain scientists and industry experts will be leveraged to disseminate information about DBfN, including the workshop report, to wider audiences in the South and across all four Hub regions.
  • The researchers will work with the South Hub to develop DBfN into a full BD Hub spoke proposal that will help the national neuroscience community.

Still not sure what DataBridge is? The idea is simple and addresses the challenges that result from this key fact: even in the age of big data, most research data is created by small teams or individual investigators. That means most research data sets are small and usually stored locally, where it is impossible for future researchers to access it.

When considered as a whole, these small data sets equal big data; an untapped treasure trove of research results often referred to as “dark data.” DataBridge, led by Arcot Rajasekar at UNC-Chapel Hill and RENCI, aims to make dark data discoverable and available for investigation and collaboration.

DataBridge gathers metadata about data sets, including the scientific field of the data, when and where it was created or collected, and methods used. It then uses relevance detection algorithms to find similarities between a newly ingested data set and other data sets in the system. The system uses socio-metric network algorithms to cluster data sets into “communities” based on their similarities. When researchers use the DataBridge web interface, they can find similar and related data sets—much like Amazon.com recommends books based on past purchases or Facebook recommends new friends based on existing connections.

In its first iteration, also funded by the NSF, DataBridge focused on data sets in the social sciences. As the project expands into new communities, we wish them continued success in making data from the “long tail of science” more accessible and usable.

Learn more:

DataBridge white paper

DataBridge website

-Karen Green