CHAPEL HILL, NC – RENCI researchers will work with scientists from Clemson University and Washington State University on a project funded by the National Science Foundation to develop cyberinfrastructure aimed at providing researchers around the nation and world with a more fluid and flexible system of analyzing large-scale data.
The NSF awarded $2.95 million for a collaborative project that will unite biologists, hydrologists, computer engineers and computer scientists to design a system called Scientific Data Analysis at Scale (SciDAS).
Claris Castillo, a senior systems researcher, will lead the SciDAS effort as RENCI principal investigator and will be assisted by co-PI Ray Idaszak, RENCI’s director of DevOps. Clemson scientist Alex Feltus is the lead PI on the project. Other co-PIs are Clemson’s Melissa Smith and Stephen Ficklin of Washington State.
SciDAS seeks to help current researchers and future innovators discover data, move it smoothly across advanced networks, and improve flexibility and accessibility to national and global resources. It will enable a broad range of scientists to not only get information faster but also to use much larger data sets and tease out information that they might not even know exists.
“A key aspect of the SciDAS team is that we’ll be processing scientific data at the same time that we’re gluing together all the parts needed for a national cyberinfrastructure (CI) ecosystem,” said Feltus, associate professor of genetics and biochemistry in Clemson University’s College of Science. “We’re trying to avoid the problem of ‘if you build it they will come’ and instead enlist the input of a variety of scientists to join us on the ground floor and help us build it. Thus, our software will be refined by using real data by real users with real habits.”
RENCI will lead the effort to integrate existing cyber tools and technologies into the new SciDAS infrastructure that will be designed to support all aspects of distributed, data-driven research. Development of the SciDAS framework will involve integrating a number of NSF-funded CI systems into one package, including:
NSF CC-IIE RADII (Resource Aware Data-centric collaborative Infrastructure), an effort to couple data management (iRODS) and resource management (the ORCA control framework) from the ground up. Its tools and approaches allow scientists to easily map collaborative data-driven activities onto dynamically configurable cloud infrastructures.
- iRODS: the integrated Rule Oriented Data System, which federates distributed and heterogenuous data into a single virtual file system for easier, safer data sharing and data management.
- NSF SSI Hydroshare, an open-source collaborative system for sharing hydrologic
data and models. Hydroshare enables scientists to easily discover and access data and models in the cloud or retrieve them to their desktops.
- NSF CC-NIE ADAMANT (Adaptive Data-Aware Multi-Domain Application Network Topologies), which integrates the Pegasus workflow management system and the ORCAresource control framework. It leverages ExoGENI as well as national research and education networks to create elastic, isolated environments to execute complex distributed tasks.
- NSF CICI SAFE, a project working to securely automate and monitor the creation of virtual super-facilities that link scientists to multiple resources. CICI-SAFE automates the authorization and security monitoring needed to keep these very fast and dynamic network links safe.
“We will build on successful cyberinfrastructure projects developed here at RENCI, most of them with funding from the National Science Foundation,” said Castillo. “Through NSF support, RENCI has developed a number of cyberinfrastructure tools and environments that make science more productive. SciDAS will integrate those tools and work environments into a unified cyberinfrastructure tailored to support science applications at scale. It is a win for scientists and a way to extend the value of our funded projects.”
On a technical level, SciDAS will combine access to multiple national cyberinfrastructure resources, including NSF Clouds, the Open Science Grid, the Extreme Science and Engineering Discovery Environment, petascale supercomputers such as COMET, and a variety of nationwide university resources such as C.U.’s Palmetto Cluster. The distributed and scalable nature of both the data-sharing and the compute infrastructure will be exploited to boost the performance of workflows and scientific productivity.
“The 21st century presents huge problems for scientists to solve and it also offers great opportunities to create a better quality of life,” Castillo added. “Our mission is to streamline the process of discovery and data analysis by bringing together domain scientists and cyberinfrastructure experts. We are not building one solution to fit all needs. Instead, we see SciDAS as a nationwide, and someday worldwide, CI ecosystem that is flexible and scalable to meet the evolving computing and data analysis needs of many scientific communities.”
SciDAS video (produced/created by Clemson University)
See also: Clemson scientists receive $2.95M to improve and simplify large-scale data analysis
Jim Melvin, Clemson University College of Science, contributed to this article.