Exploring the power of distributed intelligence for resilient scientific workflows

New project led by USC Information Sciences Institute seeks to ensure resilience in workflow management systems

Image AI generated by author using DALL-E.

Future computational workflows will span distributed research infrastructures that include multiple instruments, resources, and facilities to support and accelerate scientific discovery. However, the diversity and distributed nature of these resources makes harnessing their full potential difficult. To address this challenge, a team of researchers from the University of Southern California (USC), the Renaissance Computing Institute (RENCI) at the University of North Carolina, and Oak Ridge, Lawrence Berkeley and Argonne National Laboratories have received a grant from the U.S. Department of Energy (DOE) to develop the fundamentals of a computational platform that is fault tolerant, robust to various environmental conditions and adaptive to workloads and resource availability. The grant is planned for five years and includes $8.75 million of funding.

“Researchers are faced with challenges at all levels of current distributed systems, including application code failures, authentication errors, network problems, workflow system failures, filesystem and storage failures and hardware malfunctions,” said Ewa Deelman, research professor, research director at the USC Information Sciences Institute and the project PI. “Making the computational platform performant and resilient is essential for empowering DOE researchers to achieve their scientific pursuits in an efficient and productive manner.”

A variety of real-world DOE scientific workflows will drive the research – from instrument workflows involving telescope and light source data to domain simulation workflows that perform molecular dynamics simulations.  “Of particular interest are edge and instrument-in-the-loop computing workflows,” said co-PI Anirban Mandal, assistant director for network research and infrastructure at RENCI. “We expect a growing role for automation of these workflows executing on the DOE Integrated Research Infrastructure (IRI). With these essential tools, DOE scientists will be more productive and the time to discovery will be decreased.”

Fig. 1: SWARM research program elements.

Swarm intelligence

Key to the project is swarm intelligence, a term derived from the behavior of social animals (e.g., ants) that collectively achieve success by working in groups. Swarm Intelligence, or SI, in computing refers to a class of artificial intelligence (AI) methods used to design and develop distributed systems that emulate the desirable features of these social animals – flexibility, robustness and scalability.

“In Swarm Intelligence, agents currently have limited computing and communication capabilities and can suffer from slow convergence and suboptimal decisions,” said Prasanna Balaprakash, director of AI programs and distinguished R&D staff scientist at Oak Ridge, and co-PI of the newly funded project.  “Our aim is to enhance traditional SI-based control and autonomy methods by exploiting advancements in AI techniques and in high-performance computing.”

The enhanced metasystem, called SWARM (Scientific Workflow Applications on Resilient Metasystem), will enable robust execution of DOE-relevant scientific workflows such as astronomy, genomics, molecular dynamics and weather modeling across a continuum of resources – from edge devices near sensors and instruments through wide-area networks to leadership-class systems.

Distributed workflows and challenges

The project develops a distributed approach to workflow development and profiling. The research team will develop an experimental platform where DOE scientists will submit jobs and workflows to a distributed workload pool. Once a set of workflows becomes available in the workflow pool, the agents need will estimate each task’s characteristics and the resource requirements with continual learning capability. “Such methods enhance the capabilities of the agents. The research will include mathematically rigorous performance modeling and online continual learning methods.” remarked Krishnan Raghavan, an assistant computer scientist in Argonne’s Mathematics and Computer Science division and a co-PI of SWARM.  

In SWARM there is no central controller: the agents must reach a consensus on the best resource allocation. “In imitation of biological swarms, we will investigate how coalitions can adapt to various fault tolerance strategies and can reassign tasks, if necessary,” said Argonne senior computer scientist Franck Cappello, who is leading the development efforts on fault recovery and adaptation algorithms. Here the agents will coordinate decision-making for optimal resource allocation while minimizing communication between agents such as by formation of hierarchies and by adoption of adaptive communication strategies.

Evaluation

To demonstrate the efficacy of the swarm intelligence-inspired approach, the team will evaluate the method by swarm simulations, by emulation and prototyping on testbeds.  “We will re-imagine how workflows can be managed to improve both compute and networking at micro and macro levels”, said Mariam Kiran, Group Leader for Quantum Communications and Networking at ORNL.

This article was written in collaboration with USC ISI, RENCI, Oak Ridge National Laboratory, Lawrence Berkeley National Laboratory, and Argonne National Laboratory.