TR-23-01: Category Theory and the DataBridge Experience


As part of the recent “Big Data” revolution, there has been an explosion in the production of scientific datasets. Because of the power of these datasets to assist in additional scientific research and the cost involved in producing these datasets, maximizing their usage is both an economic and scientific imperative. But the sheer number of these datasets has made, in many disciplines, locating data of interest a tedious time-consuming task without guarantee of success. This difficulty has motivated several efforts to ease the task of data discovery. One of the earliest of these efforts was the DataBridge, which uses various techniques to impute relevance amongst datasets. While developing the DataBridge system, we have integrated several seminal concepts that we believe are necessary for discovering data from a diverse digital corpus. In this paper, we discuss our approach, a category-based theoretical justification for that approach and a set of abstract concepts that we believe are central to the data discovery process.