Leading data science expert joins RENCI as deputy director

Rebecca Boyles, MSPH, currently the founding director of the Center for Data Modernization Solutions at RTI International, will join the Renaissance Computing Institute (RENCI) at the University of North Carolina at Chapel Hill as deputy director on June 24, RENCI Director Ashok Krishnamurthy, PhD, announced today.

Boyles’ leadership of the Center for Data Modernization at RTI International focuses on bridging the research and information technology gap by applying a data ecosystem perspective that enables researchers to maximize the value of their data assets. Boyles also has worked closely already as a partner to RENCI, in particular as a leader on both NHLBI BioData Catalyst and the NIH HEAL Data Stewardship Group, two important projects that help researchers harness the power of data. 

As RENCI’s deputy director, Boyles will take responsibility for RENCI’s research division by managing and enhancing research partnerships with faculty at UNC-Chapel Hill, Duke University, and North Carolina State University; building relationships between RENCI and Triangle area businesses; and leading efforts to bring new federal research funding to RENCI and its partner institutions. She will also apply her trademark skills in developing fit-for-purpose solutions that enable researchers to use data for the public good. 

“Rebecca is an exceptional leader with deep expertise in building data science teams and executing on innovative and impactful projects,” said Krishnamurthy. “We have worked with her on a number of joint projects, and this history shows us that she will be able to make significant strategic contributions at RENCI and in partnership with UNC and our broader research community.” 

In addition to her passion for data science, research, and information technology, Boyles has also enabled strong strategic growth at organizations throughout her career. While a data scientist at the National Institute of Environmental Health Sciences, Boyles clarified the strategic vision for the environmental health science data ecosystem, leveraging existing data assets to respond to timely public health issues. She identified opportunities to catalyze scientific advancements in chemical safety and public health through interactions with broad stakeholder groups. She also liaised with NIH leadership and served as science officer on the Big Data 2 Knowledge (BD2K) program including the Data Discovery Index, Frameworks for Community-Based Standards, and The Center for Predictive Computational Phenotyping.

“I am thrilled to join RENCI’s efforts to tackle intractable, long-standing problems by driving the future of scientific computing in collaboration with their partner institutions,” said Boyles. “I look forward to bringing my background in environmental health and biomedical research, along with my experience partnering with diverse groups, to contribute to the pursuit of novel and effective solutions.” 

Boyles holds an MSPH in Environmental Science and Engineering from the Gillings School of Public Health at UNC-Chapel Hill, along with a BA in Biology from UNC-Chapel Hill. Her areas of expertise include data modernization, FAIR data principles, data and modeling applications, data analysis and data management, data integration, and data strategy and implementation. 

Download a picture of Rebecca Boyles.

What to expect at the iRODS 2024 User Group Meeting

The worldwide iRODS community will gather in Amsterdam, NL from May 28-31

Members of the iRODS user community will meet at the Amsterdam Science Park in Amsterdam, NL for the 16th Annual iRODS User Group Meeting to participate in four days of learning, sharing use cases, and discussing new capabilities that have been added to iRODS in the last year.

The event, sponsored by SURF, RENCI, Globus, and Hays, will provide in-person and virtual options for attendance. An audience of over 100 participants representing dozens of academic, government, and commercial institutions is expected to join.

“We are excited to connect with our user community to learn more about the impact and utility of iRODS on a global scale in fields such as public health, materials science, biotechnology, and more.” said Terrell Russell, executive director of the iRODS Consortium. “In addition to learning from one another’s deployments and use cases, the 2024 iRODS User Group Meeting will provide opportunities to network with users around the world and sow the seeds for future collaboration.”

In May, the iRODS Consortium and RENCI announced the release of iRODS 4.3.2. Along with preparation for work on 5.0.0 and important bug fixes for the 4.3 series, notable updates include the new GenQuery2 parser allowing for richer metadata queries into the catalog, fixes for keyword combinations and bad inputs, a number of documentation additions, and a few new deprecation declarations. 

Another new feature is the S3 API v0.2.0. Many software libraries, tools, and applications now read and write the S3 protocol directly. Last year, the iRODS Consortium announced that the then-new iRODS S3 API could present iRODS via the S3 protocol, and shared details about the requirements, design, and initial implementation. This year, users will hear about the first two releases, the implementation of various endpoints, and the state of Multipart transfers.

During last year’s UGM, users were presented an overview and demonstration of exploratory work with further authentication services such as OAuth 2.0, OpenID Connect, and the iRODS HTTP API. At this year’s event, the iRODS Consortium will share updates through the first three releases of the HTTP API, including optimizations and setting the iRODS server up as an OpenID Connect Protected Resource.

As always with the annual UGM, in addition to general software updates, users will offer presentations about their organizations’ deployments of iRODS. This year’s meeting will feature over 15 talks from users around the world. Among the use cases and deployments to be featured are:

  • iRODS Security Challenges Within an Enterprise Environment. Dow. Dow’s focus on data security necessitates a tailored approach for their internal users, leading to the development of the Scientific Data Management System (SDMS) Query Tool (SQT) — a user-friendly tool designed to facilitate secure access to specific datasets. The current gap with Metalnx for general users is that there is too much control over modifying data and collections. Additionally, it is difficult to synchronize the iRODS users to our existing Azure Security groups for permission management. This talk outlines the development of a Querying Tool utilizing the iRODS C++ API as a backend to communicate with iRODS. The talk will highlight the need for robust security architecture for Enterprise scale applications and where we are hoping to take the project to in the future.
  • Sharing data in a multi-system multi-role environment centered on iRODS. SURF and Erasmus University Rotterdam. SURF, the cooperative association of Dutch educational and research institutions, offers data infrastructure and services to the research communities. Some of its services are based on iRODS and are often used as building blocks for data platforms. One increasingly common architectural component in those platforms is a web portal where researchers can discover data using project specific queries. Once the data are found, they are made available to the researcher, directly, for example, with a download link or indirectly, triggering a copy to a computing environment where they are analyzed. The implementation of such workflow is time consuming. Its maintenance in the long term is often jeopardized by limited support available within the project and design choices too tailored for that use case makes its adoption by other organizations too difficult. We think that it is possible to model that workflow in a generic way as a reusable modular component and in a way flexible enough to support even the more stringent requirements associated with sensitive data. The component relies on iRODS and links together multiple web portals and repositories through an API layer based on FastAPI. We present here a proof of concept developed within the GUTS project, in collaboration with the project’s data management team and the research support.
  •  Integration of iRODS in a Federated IT Service through HTTP and Python API. CC-IN2P3. The Federated IT Service (FITS) project, a collaborative endeavor between the IN2P3 computing center and French national HPC Center named IDRIS, addresses the challenge of managing the escalating data volumes generated by research infrastructures. The project aims to consolidate computing and storage resources while maintaining control over hosting expenses and minimizing the ecological footprint of digital technologies. Within the FITS project, iRODS was selected as the storage pooling solution, leveraging its established use within the IN2P3 Computing Centre. This implementation enables project users to seamlessly access their data without being aware of its physical location. 
  • iRODS-based system turbocharged next-gen sequencing analysis during pandemic and beyond. National Institute for Public Health and the Environment (RIVM). The Dutch National Institute for Public Health and the Environment (RIVM) has numerous projects in various scientific domains that generate next generation sequencing data. Bioinformatics plays an important role in analyzing and interpreting this sequencing data. To support these analyses, we developed a platform that consists of a High Performance Compute (HPC) cluster, a Linux Scientific Workspace for software development and a Data Management System (DMS) based on iRODS. On top of this DMS, we also created a Job Engine: a tightly integrated process automation tool that manages the automated analyses of sequencing data on the HPC.

Bookending this year’s UGM are two in-person events for those who hope to learn more about iRODS. On May 28, the Consortium is offering beginner and advanced training sessions. After the conference, on May 31, users have the chance to register for a troubleshooting session, devoted to providing one-on-one help with an existing or planned iRODS installation or integration.

Registration for both physical and virtual attendance will remain open until the beginning of the event. Learn more at this year’s UGM at irods.org/ugm2024

About the iRODS Consortium

The iRODS Consortium is a membership organization that supports the development of the integrated Rule-Oriented Data System (iRODS), free open source software for data virtualization, data discovery, workflow automation, and secure collaboration. The iRODS Consortium provides a production-ready iRODS distribution and iRODS training, professional integration services, and support. The world’s top researchers in life sciences, geosciences, and information management use iRODS to control their data. Learn more at irods.org.

The iRODS Consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure at the University of North Carolina at Chapel Hill. For more information about RENCI, visit renci.org.

UNC Advances Hurricane-driven Flood Prediction Capabilities for Coastal Communities

On September 14, 2018, Hurricane Florence made landfall in the Wrightsville Beach area of coastal North Carolina. While the storm was a category 1, it caused catastrophic flooding throughout much of the state. The record amount of rain from the system combined with an already saturated soil. Rivers overflowed their banks, storm surge inundated coastal areas, and the water had nowhere to go. It was a rare compound flooding scenario that will be studied and remembered for a long time.

It is difficult to model compound flooding – fluvial (river), pluvial (surface flooding unrelated to rivers), and oceanic storm surge interaction – impacts, but this scenario is faced annually by communities in the path of tropical and extratropical storm systems. Unfortunately, the difficulty of modeling and understanding these events impedes already difficult hurricane decision-making, leaving countless communities at increased risk, and there is evidence that these compound flooding events may occur more frequently in the future (e.g., Wahl, T., et al. “Increasing risk of compound flooding from storm surge and rainfall for major US cities.” Nature Climate Change 5.12 (2015): 1093-1097.). But a new modeling approach for river representation in the widely used coastal model ADCIRC may help change that, providing predictions and insights to the decision-makers working to keep their communities safe during storm-related flood events.

The Renaissance Computing Institute (RENCI), University of North Carolina (UNC) Center for Natural Hazards Resilience, and Institute of Marine Sciences (IMS) at UNC-Chapel Hill, combined efforts under a grant from the National Oceanic and Atmospheric Administration (NOAA) to develop a better modeling approach for the compound flooding caused by these interconnected water systems. The resulting model advancement will help scientists represent river channel size variations and provide better insights into interactions between river channels and floodplains.

Current Models:

There are several models used to understand and predict coastal inundation scenarios, but two models are primarily used to understand flooding:

  1. ADCIRC is developed by a consortium of researchers in academia, government, and industry, with activities centered and coordinated at both UNC-Chapel Hill and Notre Dame. It is the most widely used storm surge modeling and analysis platform. In fact, FEMA uses the model for coastal flood insurance studies, defining storm surge levels for coastal insurance rates. However, the standard trapezoidal river channel representation used in ADCIRC only accounts for structures down to 30 m, with smaller structures (small rivers, man-made waterways, inlets, estuaries, etc.) creating a more burdensome computation. This creates inaccuracies when modeling compound flood events.
  2. HEC-RAS, a fluvial modeling system developed by the Army Corps of Engineers, accurately models river systems and has been the primary system used for real-time prediction of river flow and stage by the NOAA River Forecast Centers. It was originally developed as a model for inland river systems, where coastal waters do not reach. 

As a result, we currently have two unique and independently accurate models, one for storm surge flooding, and one for fluvial systems, but neither adequately accounts for impacts captured by the other. This means communities that fall into both flood risk zones are left outside our current ability to model and understand their unique circumstances.

Modeling Compound Flooding

The team’s new riverine feature in ADCIRC, led by Dr. Shintaro Bunya (a research scientist with UNC-Chapel Hill’s IMS and DHS-funded Coastal Resilience Center) and Prof. Rick Luettich (Earth, Marine, and Environmental Sciences (EMES) faculty member, Director of UNC-Chapel Hill’s IMS, and principal investigator of the Coastal Resilience Center), represents fluvial channels and man-made waterways using elongated, one-dimensional elements in the channel direction. The depth of the river and the height of the river bank are then specified at the same location. Previously not possible in ADCIRC, this “discontinuous” elevation permits a more accurate simulation of water flow and more easily accounts for smaller structures. The new river feature seamlessly fits into existing two-dimensional ADCIRC models and is as accurate at modeling fluvial flooding as HEC-RAS. The technique details and applications were recently published here.

Already, the model has proven its worth. The new river feature was demonstrated in a real-world application (see the figure below) using a large, ocean scale ADCIRC grid for detailed simulations along the North Carolina coastal region. The coastal river network, with about 200 m along channel resolution in the Neuse River, is represented by the narrow elements, detailed in insets A and B. The entire ADCIRC grid is shown in inset C. The orange-red colors show the predicted maximum water level contour in a Hurricane Florence (2018) simulation, and the plot in the upper right shows a comparison of observed versus predicted high water marks along the Neuse River. The agreement between observations and predictions is very high, indicating that this new approach to river channel representation in ADCIRC will be highly beneficial in predicting future flooding river flow conditions and their impacts on coastal flooding.

Figure. Real-world example of the new channel network feature in ADCIRC. 

This new model has the potential to provide better predictions for communities where evacuation decisions can be the hardest to make, in the hope that North Carolina and other coastal states are less likely to be caught off guard by the flood risks in these compound flooding events.

IT4Innovations National Supercomputing Center joins the iRODS Consortium

IT4Innovations National Supercomputing Center at VSB – Technical University of Ostrava, which is based in the Czech Republic, has become the newest member of the iRODS Consortium. The consortium brings together businesses, research organizations, universities, and government agencies from around the world to ensure the sustainability of the iRODS software as a solution for distributed storage, transfer, and management of data. Members work with the consortium to guide further development and innovation, expand its user and developer communities, and provide adequate support and educational resources.

IT4Innovations is the leading research, development, and innovation center active in the fields of High-Performance Computing (HPC), Data Analysis (HPDA), Quantum Computing (QC), and Artificial Intelligence (AI) and their application to other scientific fields, industry, and society. Since 2013, IT4Innovations has been operating the most powerful supercomputing systems in the Czech Republic, which are provided to Czech and foreign research teams from academia and industry.

Integrated Rule-Oriented Data System (iRODS) is an open-source software that is used by research, commercial and government organizations around the world. The iRODS software allows you to store, manage and share large amounts of data, including their metadata, between different organizations and platforms and provides a mechanism for defining rules for their storage, processing and distribution. iRODS is designed to support collaboration, interoperability and scalability of data infrastructures.

Martin Golasowski, senior researcher at IT4Innovations, summarizes the benefits of membership in the iRODS Consortium: “The demand for a comprehensive solution for fast and efficient data transfer between locations is increasing across the European scientific community. Membership in the iRODS Consortium will enable us to communicate directly with the development team of this solution and provide us with access to the latest features and support in providing these tools not only to the scientific community.”

“iRODS provides a virtual file system for various types of data storage, metadata management, and, last but not least, a mechanism for federating geographically distant locations for data transfer. These features are used in the LEXIS Platform, which simplifies the use of powerful supercomputers to run complex computational tasks through a unified graphical interface or using a specialized application interface. The transfer of large volumes of data between supercomputers and data storage is then performed automatically and transparently for those using iRODS and other data management technologies,” adds Martin Golasowski.

“We are very excited to have our friends in the Czech Republic join the Consortium,” said Terrell Russell, Executive Director of the iRODS Consortium. “Their expertise and collaborative insights have already made iRODS better for everyone. We look forward to continued progress working alongside IT4Innovations.”

The iRODS software has been deployed at thousands of locations worldwide for long-term management of PB data in various industries such as the oil and gas industry, biosciences, physical sciences, archives, and media and entertainment industry. The development team of the iRODS Consortium is based at the Renaissance Computing Institute (RENCI), which is affiliated with the University of North Carolina at Chapel Hill, USA. To learn more about iRODS and the iRODS Consortium, please visit irods.org.

To learn more about IT4Innovations National Supercomputing Center, please visit www.it4i.cz/en.

Exploring the power of distributed intelligence for resilient scientific workflows

New project led by USC Information Sciences Institute seeks to ensure resilience in workflow management systems

Image AI generated by author using DALL-E.

Future computational workflows will span distributed research infrastructures that include multiple instruments, resources, and facilities to support and accelerate scientific discovery. However, the diversity and distributed nature of these resources makes harnessing their full potential difficult. To address this challenge, a team of researchers from the University of Southern California (USC), the Renaissance Computing Institute (RENCI) at the University of North Carolina, and Oak Ridge, Lawrence Berkeley and Argonne National Laboratories have received a grant from the U.S. Department of Energy (DOE) to develop the fundamentals of a computational platform that is fault tolerant, robust to various environmental conditions and adaptive to workloads and resource availability. The grant is planned for five years and includes $8.75 million of funding.

“Researchers are faced with challenges at all levels of current distributed systems, including application code failures, authentication errors, network problems, workflow system failures, filesystem and storage failures and hardware malfunctions,” said Ewa Deelman, research professor, research director at the USC Information Sciences Institute and the project PI. “Making the computational platform performant and resilient is essential for empowering DOE researchers to achieve their scientific pursuits in an efficient and productive manner.”

A variety of real-world DOE scientific workflows will drive the research – from instrument workflows involving telescope and light source data to domain simulation workflows that perform molecular dynamics simulations.  “Of particular interest are edge and instrument-in-the-loop computing workflows,” said co-PI Anirban Mandal, assistant director for network research and infrastructure at RENCI. “We expect a growing role for automation of these workflows executing on the DOE Integrated Research Infrastructure (IRI). With these essential tools, DOE scientists will be more productive and the time to discovery will be decreased.”

Fig. 1: SWARM research program elements.

Swarm intelligence

Key to the project is swarm intelligence, a term derived from the behavior of social animals (e.g., ants) that collectively achieve success by working in groups. Swarm Intelligence, or SI, in computing refers to a class of artificial intelligence (AI) methods used to design and develop distributed systems that emulate the desirable features of these social animals – flexibility, robustness and scalability.

“In Swarm Intelligence, agents currently have limited computing and communication capabilities and can suffer from slow convergence and suboptimal decisions,” said Prasanna Balaprakash, director of AI programs and distinguished R&D staff scientist at Oak Ridge, and co-PI of the newly funded project.  “Our aim is to enhance traditional SI-based control and autonomy methods by exploiting advancements in AI techniques and in high-performance computing.”

The enhanced metasystem, called SWARM (Scientific Workflow Applications on Resilient Metasystem), will enable robust execution of DOE-relevant scientific workflows such as astronomy, genomics, molecular dynamics and weather modeling across a continuum of resources – from edge devices near sensors and instruments through wide-area networks to leadership-class systems.

Distributed workflows and challenges

The project develops a distributed approach to workflow development and profiling. The research team will develop an experimental platform where DOE scientists will submit jobs and workflows to a distributed workload pool. Once a set of workflows becomes available in the workflow pool, the agents need will estimate each task’s characteristics and the resource requirements with continual learning capability. “Such methods enhance the capabilities of the agents. The research will include mathematically rigorous performance modeling and online continual learning methods.” remarked Krishnan Raghavan, an assistant computer scientist in Argonne’s Mathematics and Computer Science division and a co-PI of SWARM.  

In SWARM there is no central controller: the agents must reach a consensus on the best resource allocation. “In imitation of biological swarms, we will investigate how coalitions can adapt to various fault tolerance strategies and can reassign tasks, if necessary,” said Argonne senior computer scientist Franck Cappello, who is leading the development efforts on fault recovery and adaptation algorithms. Here the agents will coordinate decision-making for optimal resource allocation while minimizing communication between agents such as by formation of hierarchies and by adoption of adaptive communication strategies.

Evaluation

To demonstrate the efficacy of the swarm intelligence-inspired approach, the team will evaluate the method by swarm simulations, by emulation and prototyping on testbeds.  “We will re-imagine how workflows can be managed to improve both compute and networking at micro and macro levels”, said Mariam Kiran, Group Leader for Quantum Communications and Networking at ORNL.

This article was written in collaboration with USC ISI, RENCI, Oak Ridge National Laboratory, Lawrence Berkeley National Laboratory, and Argonne National Laboratory.

RENCI to showcase latest technological innovations at SC23

Every sector of society is undergoing a historic transformation driven by big data. RENCI is committed to transforming data into discoveries by partnering with leading universities, government, and the private sector to create tools and technologies that facilitate data access, sharing, analysis, management, and archiving.

Each year, the Supercomputing conference provides the leading technical program for professionals and students in the HPC community, as measured by impact, at the highest academic and professional standards. RENCI will host a booth (#1663) at SC23 where team members will share collaborative research projects and cyberinfrastructure efforts aimed at helping people use data to drive discoveries.

A full schedule of sessions at the RENCI booth can be found on our website.


18th Workshop on Workflows in Support of Large-Scale Science

Anirban Mandal, the Assistant Director of Network Research & Infrastructure Group at RENCI and co-PI of the DOE-funded Poseidon project, will co-chair the18th Workshop on Workflows in Support of Large-Scale Science (WORKS), taking place November 12 -13. WORKS 2023 focuses on the many facets of scientific workflow management systems, ranging from actual execution to service management and the coordination and optimization of data, service, and job dependencies.

iRODS 4.3.1, HTTP, OIDC, and S3

The open source iRODS (Integrated Rule-Oriented Data System) data management platform presents a virtual filesystem, metadata catalog, and policy engine designed to give organizations maximum control and flexibility over their data management practices and enforcement. As iRODS has always defined its own protocol and RPC API, interoperability with other protocols has been left to application developers and administrators. This year’s releases of iRODS 4.3.1 as well as standalone APIs exposing iRODS via HTTP and S3 help new users use their existing, familiar tools to integrate with an iRODS Zone.

iRODS will host a free mini-workshop on Monday, November 13 at 9 AM ET to cover the above efforts and give a glimpse of where the team is headed next. Additionally, iRODS team members will present talks on these topics and be available for further discussion at the RENCI booth on the exhibit floor from November 14-16.

iRODS in the Cloud: Organizational Data Management

iRODS Executive Director Terrell Russell will give a talk at the Google booth (#443) on November 16 at 12:30 PM MT. This talk will give an overview of the philosophy of iRODS as well as some examples of how running iRODS in the Google Cloud can help get a handle on the metadata and bookkeeping associated with an enterprise deployment.

FABRIC Status and FPGA Drop-In

The NSF-funded FABRIC project recently completed installation of a unique network infrastructure connection, called the TeraCore — a ring spanning the continental U.S. — which boasts data transmission speeds of 1.2 Terabits per second (Tbps), or one trillion bits per second. FABRIC previously established preeminence with its cross-continental infrastructure, but the project has now hit another milestone as the only testbed capable of transmitting data at these speeds—the highest being twelve times faster than what was available before.

FABRIC leadership team members Ilya Baldin and Paul Ruth will present a talk at the RENCI booth on the current status of the testbed and future plans for development at the below times. Each of the talks is followed by a 30 minute office hours session at the RENCI booth for anyone wanting a one-on-one discussion or help with account setup.

  • Tuesday, November 14 at 11:00 AM MT
  • Wednesday, November 15 at 2:00 PM MT
  • Thursday, November 16 at 10:30 AM MT

In conjunction with ESnet and IIT, the FABRIC team will host an FPGA drop-in at the RENCI booth on Wednesday, November 15 at 11:00 AM MT. Those interested in running FPGA-based experiments on FABRIC are encouraged to stop-by for a discussion during the block. ESnet smartNIC, a fully open source P4 + FPGA development environment for FABRIC developers is fully deployed in the NSF FABRIC testbed. Attendees will get a chance to meet the developers, ask questions and get a 1:1 explanation of how to do P4 development on FABRIC, without any prior FPGA design experience. The team will cover everything from “hello world” tutorials, to deep dives on the Verilog architecture, DPDK and other driver software.

FABRIC at INDIS 2023

FABRIC will be represented at the 2023 INDIS Workshop Technical Session on Tuesday, November 14 at 2 PM MT at the SCinet Theater on the exhibit floor. PI Ilya Baldin will talk about FABRIC as part of a panel and a number of FABRIC users will show demos of their FABRIC experiments.

Unleashing the Power within Data Democratization: Needs, Challenges, and Opportunities

On Thursday, November 16 at 1:30 PM MT, FABRIC PI Ilya Baldin will sit on a panel discussing the needs, challenges, and opportunities of the data science community leveraging the existing cyberinfrastructures and software tools while strategizing on what is missing to connect an open network of institutions, including resource-disadvantaged institutions.

A full list of FABRIC activities at SC23 is available on the FABRIC website.


About RENCI

RENCI (Renaissance Computing Institute) develops and deploys advanced technologies to enable research discoveries and practical innovations. RENCI partners with researchers, government, and industry to engage and solve the problems that affect North Carolina, our nation, and the world. RENCI is an institute of the University of North Carolina at Chapel Hill.

RENCI awarded NSF grant to develop cyberinfrastructure training program for X-ray scientists

Enhancing the ability of scientists to use the latest computing and data tools will help quicken the pace of scientific discoveries

RENCI scientists and collaborators from Cornell University and University of Southern California (USC) have been awarded a $1 million, three-year grant from the National Science Foundation (NSF) to develop an innovative training program for scientists who use the Cornell High Energy Synchrotron Source (CHESS) X-ray facility. The program will be designed to help the scientists increase their computing skills, awareness and literacy with an ultimate goal of accelerating scientific innovations in synchrotron X-ray science.

A RENCI team headed by Anirban Mandal, assistant director of the Network Research & Infrastructure Group (NRIG), will lead the CyberInfrastructure Training and Education for Synchrotron X-Ray Science (X-CITE) project. It will bring together experts in cyberinfrastructure, X-ray science and other related areas from RENCI, Cornell University and USC to develop an innovative training program for researchers using CHESS, an NSF-supported high-intensity X-ray source at Cornell. CHESS is used to conduct research in materials science, physics, chemistry, biology, environmental science and other areas.

“Scientists don’t always have the computing and data expertise necessary to fully harness the instruments, data and computing tools available to transform data into insights and knowledge,” said Mandal. “We want to help reduce barriers so that scientists can effectively utilize computing capabilities and data resources at CHESS as well as cyberinfrastructure resources available through national computing and data services.”

Teaching scientists about computing tools

To get scientists up to speed on computing and data tools, the training program will cover programming essentials, systems fundamentals, distributed computing with the cyberinfrastructure ecosystem, X-ray science software and issues of data curation and applying the FAIR data principles of findability, accessibility, interoperability and reusability.

“As scientific instruments have become more sophisticated, there has been an explosion in the volume and rate of data produced by scientific facilities like CHESS,” said Mandal. “The data generated no longer fits on a laptop, and there are now computational models and AI methods that scientists can use to steer experiments based on the results they are getting. It is very difficult for scientists to keep pace with all these new capabilities.”

Mandal points out that it is important for scientists to get up to date on FAIR principles because federal research funding agencies are planning to roll out new mandates requiring scientists to share the data they generate. This will require designing metadata and figuring out how to push data into repositories in a way that makes it findable and usable by other researchers — tasks that scientists might not be accustomed to doing.

Drawing on RENCI’s expertise

The RENCI team will focus on developing common computer science modules for Python and other programming languages. This work will leverage RENCI’s expertise in this area, including Senior Research Software Developer Erik Scott’s experience as an instructor for the student program within the CI Compass project. The USC team, led by Research Professor of Computer Science Ewa Deelman, will contribute distributed computing training materials. Training materials for the specialized X-ray science software used at CHESS will be the focus of the Cornell team, which is led by Matthew Miller, associate director of CHESS.

The X-Cite training materials and activities will be available in several formats, including self-paced modules, videos, cyberinfrastructure catalogs, in-person instruction sessions, CHESS user workshops and tutorials offered at scientific conferences. The project team will also develop a coordination network to help disseminate the training materials, communicate the cyberinfrastructure needs for the X-ray science community and discuss best practices for training.

NSF FABRIC project announces groundbreaking high-speed network infrastructure expansion

FABRIC completes work on the TeraCore ring, creating a unique continental-scale experimental network capable of transmitting data at 1.2Tbps

The NSF-funded FABRIC project has completed installation of a unique network infrastructure connection, called the TeraCore—a ring spanning the continental U.S.—which boasts data transmission speeds of 1.2 Terabits per second (Tbps), or one trillion bits per second. FABRIC previously established preeminence with its cross-continental infrastructure, but the project has now hit another milestone as the only testbed capable of transmitting data at these speeds—the highest being twelve times faster than what was available before. An additional benefit of this infrastructure is to allow FABRIC to federate with other experimental and science facilities at 400Gbps.

“I’m very pleased to learn that the 1.2Tbps TeraCore in FABRIC has been installed and is now operational,” said Deep Medhi, NSF Program Director for FABRIC. “This will provide researchers with unprecedented capability in the FABRIC platform to push data-intensive research that avails the benefit of this capability.” 

FABRIC is building a novel network infrastructure geared toward prototyping ideas for the future internet at scale. FABRIC currently has over 800 users on the system performing cutting-edge experiments and at-scale research in the areas of networking, cybersecurity, distributed computing, storage, virtual reality, 5G, machine learning, and science applications. Users now have the capability to test how their experiments run at much higher speeds, including developing endpoints that can source and sink, and protocols that can transfer data at up to 1.2Tbps over continental distances. While previously federated facilities were connected to FABRIC at 100Gbps, with TeraCore becoming operational, the team is also now working to connect several federated facilities at 400Gbps. 

“I’m excited for the opportunities that the new 1.2Tbps FABRIC TeraCore ring brings,” said Frank Würthwein, Director of the San Diego Supercomputer Center (SDSC) at UC San Diego. “In the near future, we expect to be able to peer SDSC’s compute and storage capabilities with the TeraCore at 400Gbps by connecting to FABRIC in LA. This will allow FABRIC and Prototype National Research Platform (PNRP) research communities access to unique sets of resources possessed by these platforms, including programmable NICs and FPGAs in both platforms, hundreds of TB of NVMe drive capacity at PNRP, and many others.”

Another reason the TeraCore ring is so instrumental is the fact that much of this research is publicly funded and urgently needed but has been dependent on for-profit companies’ technology. “The advancement to 1.2Tbps brings FABRIC a step closer to making academic research infrastructures more competitive with internet-scale companies,” said Ilya Baldin, FABRIC Project Director. The TeraCore ring opens the door for expanded academic network infrastructure experimentation, thereby accelerating vitally important innovation and discovery. Additionally, this development sets up FABRIC’s infrastructure for future expansion, allowing the possibility to further upgrade portions of the infrastructure as opportunities become available. 

The TeraCore ring was built using spectrum from the fiber footprint of ESnet6, the cutting-edge, high-speed network operated by the Energy Sciences Network (ESnet) that connects the tens of thousands of scientific researchers at Department of Energy laboratories, user facilities, and scientific instruments, as well as research and education facilities worldwide. 

“The scientific research community needs to be able to share, analyze, and store data as fast and efficiently as possible to solve today’s scientific challenges. Advancements such as FABRIC’s TeraCore ring are a major step in this direction that we’re proud to have helped facilitate,” said ESnet Executive Director Inder Monga. 

The FABRIC infrastructure includes the development sites at the Renaissance Computing Institute/UNC-Chapel Hill, University of Kentucky, and Lawrence Berkeley National Laboratory, and the production sites at Clemson University, University of California San Diego, Florida International University, University of Maryland/Mid-Atlantic Crossroad, University of Utah, University of Michigan, University of Massachusetts Amherst/Massachusetts Green High Performance Computing Center, Great Plains Network, National Center for Supercomputing Applications at the University of Illinois Urbana-Champaign, and Texas Advanced Computing Center. FABRIC TeraCore uses optical equipment from Ciena and Infinera and networking equipment from Cisco. 

If interested, contact the team at info@fabric-testbed.net to start a conversation around getting your facility connected to the FABRIC infrastructure. 

FABRIC is supported in part by a Mid-Scale RI-1 NSF award under Grant No. 1935966, and the core team consists of researchers from the Renaissance Computing Institute (RENCI) at UNC-Chapel Hill, University of Illinois-Urbana Champaign (UIUC), University of Kentucky (UK), Clemson University, Energy Sciences Network (ESnet) at Lawrence Berkeley National Laboratory (Berkeley Lab), and Virnao, LLC.

Data Matters short-course series returns in August 2023

Annual short-course series aims to bridge the data literacy gap

Now in its tenth year, Data Matters, a week-long series of one and two-day courses aimed at students and professionals in business, research, and government, will take place August 7 – 11, 2023 virtually via Zoom. This short course series is sponsored by the Odum Institute for Research in Social Science at UNC-Chapel Hill, the National Consortium for Data Science, and RENCI.

In recent years, employers’ expectations for a data literate workforce have grown significantly. According to a 2022 Forrester Research Report, 70% of workers are expected to use data heavily in their jobs by 2025 – up from only 40% in 2018. Data Matters recognizes this rapidly changing data landscape and provides attendees the chance to learn from expert instructors about a wide range of topics in data science, analytics, visualization, and more.

“Upskilling is critical to maintaining a competitive workforce in today’s economy. With the rapid increase of data science tools being used in sectors such as business, research and government, it is essential that workers seek out educational opportunities that empower them to address new challenges in their field,” said Amanda Miller, associate director of the National Consortium for Data Science. “Our short-course series has fifteen courses that can be tailored to achieve individual data science goals, whether registrants are looking to refresh their knowledge or trying to learn something new in a welcoming, understanding environment.”

Data Matters instructors are experts in their fields from UNC-Chapel Hill, NC State University, Duke University, NC Central University, UT San Antonio, Oklahoma State, and more. This year’s topics include information visualization, deep learning, exploratory data analysis, statistical machine learning, artificial intelligence, and more, with classes such as:

  • Basic Statistics in R, Vanessa Miller. This course focuses on analyzing a dataset to answer a research question. Students will get hands-on practice with selecting the statistical procedure to answer a research question, performing the appropriate statistical test, and interpreting the output. May be of particular interest to those who are switching to R from another program such as SAS or Stata.
  • Advanced Visualization in R: R Shiny, Angela Zoss. This course will cover the basics of creating R-based web applications with Shiny, an R package that blends data science and statistical operations with interactive interface components. Participants will learn to connect interactive inputs with R operations, develop skills in web application design, and explore different options for hosting Shiny applications on the web. Basic familiarity with R is required.
  • Overview of AI and Deep Learning, Siobhan Day Grady. Many key advances in AI are due to the advances in machine learning, especially deep learning. Natural language processing, computer vision, speech translation, biomedical imaging, and robotics are some areas benefiting from deep learning methods. We will look at the history of neural networks, how advances in data collection and computing caused the revival in neural networks, the different types of deep learning networks and their applications, and tools and software available to design and deploy deep networks. 
  • Geospatial Analytics, Laura Tateosian. This course will focus on how to explore, analyze, and visualize geospatial data. Using Python and ArcGIS Pro, students will inspect and manipulate geospatial data, use powerful GIS tools to analyze spatial relationships, link tabular data with spatial data, and map data. In these activities, participants will use Python and the arcpy library to invoke key GIS tools for spatial analysis and mapping.

Data Matters offers reduced pricing for faculty, students, and staff from academic institutions and for professionals with nonprofit organizations. Head to the Data Matters website to register and to see detailed course descriptions, course schedules, instructor bios, and logistical information. 

Registration is now open at datamatters.org. The deadline for registration is August 3 for Monday/Tuesday courses, August 5 for Wednesday courses, and August 6 for Thursday/Friday courses.


About the National Consortium for Data Science (NCDS)

The National Consortium for Data Science (NCDS) is a collaboration of leaders in academia, industry, and government formed to address the data challenges and opportunities of the 21st century. The NCDS helps members take advantage of data in ways that result in new jobs and transformative discoveries. The organization connects diverse communities of data science experts to support a 21st century data-driven economy by building data science career pathways and creating a data-literate workforce, bridging the gap between data scientists in the public and private sectors, and supporting open and democratized data. Learn more at datascienceconsortium.org/.

The NCDS is administered by founding member RENCI, a research institute for data science and applications of cyberinfrastructure at the University of North Carolina at Chapel Hill. For more information about RENCI, visit renci.org.

What to expect at the iRODS 2023 User Group Meeting

The worldwide iRODS community will gather in Chapel Hill, NC from June 13 – 16

Members of the iRODS user community will meet at UNC-Chapel Hill in North Carolina for the 15th Annual iRODS User Group Meeting to participate in four days of learning, sharing use cases, and discussing new capabilities that have been added to iRODS in the last year.

The event, sponsored by RENCI, Omnibond, Globus, and Hays, will provide in-person and virtual options for attendance. An audience of over 100 participants representing dozens of academic, government, and commercial institutions is expected to join.

“The robust list of presentations at the 2023 iRODS User Group Meeting illustrates the impact and utility of iRODS on a global scale, from talks on unique applications of the software to demos of innovative clients and integrations.” said Terrell Russell, executive director of the iRODS Consortium. “We are excited to host the user community in our hometown of Chapel Hill, NC and provide opportunities for learning, networking, and collaboration throughout the week.”

In May, the iRODS Consortium published the 2023 Technology Roadmap, which documents the state of the technical direction chosen for the iRODS data management software. A notable focus named in this plan is to make implementing the iRODS Protocol less complicated by designing a new HTTP API.

Other plans include updates to the GenQuery interface and the iCommands client. The iRODS GenQuery interface has long defined the way users and administrators can search the iRODS namespace, its storage systems, users, and metadata, while honoring the iRODS permission model. The next generation of GenQuery, GenQuery2, is now available for experimentation. The current iRODS iCommands are a culmination of many years of effort, but they are beginning to show their age, especially in terms of design and extensibility. The iRODS Consortium aims to create a brand new CLI that focuses on using modern libraries (iRODS or otherwise), modern C++, being extensible and modular, and providing a single binary.

During last year’s UGM, users learned about a new PAM (pluggable authentication module) plugin, a universal implementation for all authentication methods. At this year’s event, the iRODS Consortium will provide an overview and demonstration of exploratory work with further authentication services such as OAuth 2.0, OpenID Connect, and the new iRODS HTTP API, and how integrations with these services may work best in the future.

As always with the annual UGM, in addition to general software updates, users will offer presentations about their organizations’ deployments of iRODS. This year’s meeting will feature over 15 talks from users around the world. Among the use cases and deployments to be featured are:

  • GoCommands: A cross-platform command-line client for iRODS. CyVerse / University of Arizona. The diversity of scientific computing platforms has increased significantly, ranging from small devices like Raspberry Pi to large computing clusters. However, accessing iRODS data on these varied platforms remains a common but challenging requirement. The official command-line tool for iRODS, iCommands, is limited to a few platforms like CentOS7 and Ubuntu 18/20. As a result, users on other platforms like MacOS, Windows, and Raspberry Pi OS have no straightforward performant means of accessing iRODS. GoCommands is another command-line tool for iRODS designed to address the portability issue of iCommands. Written in Go programming language, building its executable for diverse platforms is straightforward. The tool is a single executable that does not require any dependency installation. Pre-built binaries for MacOS, Linux (any distros), and Windows, regardless of their CPU architectures, are already available. In addition, the tool does not require elevated privileges for installation and run. This makes it possible for users on nearly any platform to access iRODS.
  • iBridges: A comprehensive way of interfacing with iRODS. Utrecht University, Wageningen University, and University of Groningen. iRODS is a rich middleware providing means to facilitate data management for research. It implements all necessary concepts like resources, metadata, permissions, and rules. However, in research most of the concepts are still new. Hence, researchers and their support staff are challenged using the current interfaces and tools to 1) learn about those concepts and 2) familiarize themselves with the different APIs and command line interfaces. This creates the need for a steep learning curve for researchers and research supporters, slowing down the adoption of iRODS. To ease the usage of iRODS we present iBridges. iBridges is a standalone desktop application, written in Python, to provide users of Windows, Linux, and MacOS with a graphical user interface (GUI) to interact with iRODS servers. The tool is agnostic to any rules/policies in the server. Out-of-the-box iBridges supports three main functions: browsing and manipulating data objects, upload/download data, and searching data collections.
  • ManGO: A web portal and framework built on top of iRODS for active research data management. KU Leuven. At the University of Leuven. Belgium, we are building the infrastructure and software layers to leverage iRODS as a major building block in active research data management. This involves various workflows and processing of data and metadata during the lifetime of a research project. One of the important components consists of a modular and adaptable web portal built using the iRODS Python client. Given the wide range of use cases, the web framework employs some classical architectural patterns to decouple specialized domain specific needs from the core system. It also has features that make it behave like a content management system, including a (view) template override system that make the representation of collections and data objects dependent on for example specific metadata or collection structure. Metadata is a prime focus to steer many aspects of this framework along its core use for research data, and a considerable effort was also put in a user friendly metadata schema management system. In this talk, we will present the current status as well as near future plans.
  • iRODS Object Store on Galaxy Server: Application of iRODS to a Real Time, Multi-user System. Penn State University. Galaxy is an open-source platform for data analysis that enables users to 1) Use tools from various domains through its graphical web interface, 2) Run code in interactive environments such as Jupyter or RStudio, 3) Manage data by sharing and publishing results, workflows, and visualizations, and 4) Ensure reproducibility by capturing the necessary information to repeat data analyses. To store data Galaxy utilizes ObjectStore as its data virtualization layer. It abstracts Galaxy’s domain logic for data persistence technology. Currently, Galaxy mainly uses a Disk ObjectStore for data persistence. To extend Galaxy’s data persistence capabilities, we had previously extended Galaxy’s ObjectStore to support iRODS. In this work, we discuss the steps in deploying iRODS Object Store on the USA-based Galaxy server (usegalaxy.org) and the challenges we faced. To the best of our knowledge, after CyVerse, this is one of the few applications of iRODS to a real time, multi-user system.

Bookending this year’s UGM are two in-person events for those who hope to learn more about iRODS. On June 13, the Consortium is offering beginner and advanced training sessions. After the conference, on June 16, users have the chance to register for a troubleshooting session, devoted to providing one-on-one help with an existing or planned iRODS installation or integration.

Registration for both physical and virtual attendance will remain open until the beginning of the event. Learn more at this year’s UGM at irods.org/ugm2023

About the iRODS Consortium

The iRODS Consortium is a membership organization that supports the development of the integrated Rule-Oriented Data System (iRODS), free open source software for data virtualization, data discovery, workflow automation, and secure collaboration. The iRODS Consortium provides a production-ready iRODS distribution and iRODS training, professional integration services, and support. The world’s top researchers in life sciences, geosciences, and information management use iRODS to control their data. Learn more at irods.org.

The iRODS Consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure at the University of North Carolina at Chapel Hill. For more information about RENCI, visit renci.org.