LACSI

Overview

The Los Alamos Computer Science Institute (LACSI) was created to foster computer science and computational science research efforts at the Department of Energy’s Los Alamos National Laboratory (LANL) that are both internationally recognized and relevant to the laboratory’s goals. Led by a team of collaborators at LANL and the Rice University Center for Research on High Performance Software (HiPerSoft), the LACSI project is building a presence in computer science research at LANL that is commensurate with the strength of the physics community at Los Alamos.

The computer science problems addressed through LACSI include problems that are common on parallel computing systems comprised of heterogeneous collections of distributed and parallel systems. LACSI research also addresses the problems associated with increasingly multidisciplinary applications that include libraries and other application components that use diverse programming languages, models and parallelization strategies. Because of these challenges, it is extraordinarily difficult to achieve high fractions of peak hardware performance on large-scale parallel systems.

LACSI researchers work to optimize the behavior of these complex applications by improving performance analysis software, replacing simple measurement tools with deep integration of compile-time transformation with measurement and analysis. In addition, researchers are integrating real-time, adaptive performance optimization and just-in-time compilation into distributed parallel systems to meet the realities of systems with usage demands that vary over time. This integration, based on user-specified, compiler-synthesized, and measurement-validated performance contracts, will help to create a new generation of nimble, high-performance applications.

The RENCI Contribution

RENCI’s work with the LACSI project focuses on the following areas:

Fault Indicator Monitoring

As the number of nodes scales to tens of thousands, hardware component failures will occur more frequently. One important aspect of fault tolerant software is the ability to steer applications away from nodes likely to experience hardware failure within a given period time. Prediction models employed by adaptive control systems require realtime failure indicator data from individual nodes and/or data reduction processes. Building on the Pablo infrastructure, we are integrating three sets of indicators:

  • Disk warnings based on SMART protocols.
  • Switch and network interface card (NIC) status and errors.
  • Node motherboard health, including temperature and fan status

Our design is based on three Autopilot sensors and an Autopilot client application for realtime sensor data plotting and analysis. SmartSensor reads SMART data (Self-Monitoring Analysis and Reporting Technology) from hard drives, e.g. temperature and seek error rate. ACPISensor reads available ACPI data (Advanced Configuration and Power Interface), e.g. temperature and CPU throttle rate. LmsSensor reads data directly from low-level hardware sensors via the lm_sensors linux package including CPU and motherboard temperatures and power supply voltages. To date, we have tested these sensors running on four nodes of our rhapsody linux x86 cluster. To collect, display, and analyze data, we developed an Autopilot client named Gracie. Gracie is based on the open source Grace data plotting and analysis tool. The client queries the AutopilotManager for any tagged sensors, connects to all sensors, and plots data in realtime. The data may be saved and analyzed offline using Grace standard features.

Fault Injection and Assessment

Measuring the frequency of failure modes and indicators is necessary but not sufficient if we are to build resilient software for multi-teraflop and petaflop systems. In addition, we must test the resilience of application and system software to hardware and software failures; only with such testing can we develop more robust implementations. Because the relative frequency of hardware and software failures on individual systems is extremely low, it is only on larger systems that their frequency becomes large enough for testing. However, even in such cases, the errors are not systematic or reproducible. This makes obtaining statistically valid testing data arduous and expensive. Instead, we need an infrastructure that can be used to systematically produce faults as test cases.

We are investigating how single-bit soft errors in memory and network affect MPI applications running on PC clusters. The methodology we use is software fault injection. Our experiments showed that most applications are very sensitive to even single errors. The errors were often undetected, yielding erroneous output with no user indicators. We also found that even minimal internal application error checking and program assertions can detect some of the faults we injected.

We have implemented the diskless checkpointing. It has been experimented on TeraGrid and NCSA Tungsten. The largest configuration we tested is 400 MPI processes and 20 spares, and each MPI processes dump 200 MB of data. The performance achieved is 5.3 GB/s. Currently we are experimenting with real scientific codes like sPPM and Sweep3D. Since the diskless checkpointing uses spare memory, the memory pressure enables us to investigate approaches that can reduce checkpoint dumps. We adopted the data compression and found that the compression works very well for sPPM; the size of checkpoint dumps reduced by 80-90%.

Acknowldgements

This material is based upon work supported by Subcontract No. 12783-001-05 49 issued to Rice University from the Regents of the University of California (Los Alamos National Laboratory).

Dynamic Adaptation and Steering

Today, most applications run to completion on the resources that they acquire at program launch. However large-scale systems are prone to failure and long-running applications for such systems must sense and respond to component failure.

Performance steering offers an opportunity to adjust a running program for more efficient execution and to adapt to changing resource availability (e.g. due to component failures or resource sharing). The challenge is to develop strategies that enable applications to monitor their own behavior and reactively adjust their behavior to optimize performance according to one or more metrics. More generally our goal is to develop tools and approaches to manage the challenge of large scale and integration with multiple subsystems.

Strategies for automatic performance steering based on performance and fault models offer the potential to enable long-running programs to repeatedly adjust themselves to changes in the executing environment – possibly to opportunistically acquire more resources as they become available, to rebalance load, adapt to failures or control power consumption. Validated performance “contracts” among applications, systems and users that combine temporal and behavioral reasoning from performance predictions, previous executions, and compile time analyses are one promising approach. Our work explores the use of performance contracts to guide the monitoring of application and resource behavior; contracts will include dynamic performance signatures and techniques for locally(per process) and globally(per application and per system) evaluating observed behavior relative to that expected.

Acknowldgements

This material is based upon work supported by Subcontract No. 12783-001-05 49 issued to Rice University from the Regents of the University of California (Los Alamos National Laboratory).

Performance Measurement and Analysis

Understanding the behavior of scientific applications on extreme-scale parallel systems remains a big challenge. On terascale and petascale systems, large scale applications often have irregular behavior with time varying resource demands. The performance problems can vary from communication latency, load imbalance, to hardware failure. At RENCI we are developing a set of tools that provide both run-time performance and resource monitoring and post-mortem performance analysis.

Autopilot

Autopilot is an infrastructure for dynamic performance tuning and resource management for scientific applications. It provides a flexible set of performance sensors, decision procedures, and policy actuators to realize adaptive control of applications and resource management policies on large scale systems. As part of the LACSI effort, Autopilot is extended to automatically sense, communicate, and respond to changing conditions in a HPC environment enabling adaptive, runtime control over execution performance.

SvPablo

SvPablo is a graphical source code browser and performance capture/correlation tool. It provides a graphical environment that permits users to insert performance measurement commands directly into user code, capture performance data during execution, and visualize the performance associated to each source code construct. In addition, with the integration with Autopilot, the performance data can also be captured at run time for performance monitoring and steering. One of the key problems with running applications on large scale systems is scalability. SvPablo’s graphical Scalability Analyzer displays both summary scalability data and detailed efficiency data correlated with source code. It also helps user to understand how the bottlenecks may move across application code as the number of assigned processor changes.

As the processor count in large systems grows to hundreds of thousands, the assumption of fully reliable hardware and software becomes much less credible. We must explore both performance and fault-tolerance that permit applications and systems to recognize and recover from transient failures and to adapt to permanent failures by continuing operation in a degraded mode. As part of LACSI efforts, we will integrate SvPablo with fault monitoring sensors, and then use measured performance and fault indicator data to investigate ways for possible fault recovery and maximize the probability of successful job completion.

Data Volume Reduction

As large-scale parallel and distributed systems are becoming increasingly available, improving the performance of scientific codes running on these systems poses major challenges in performance data collection and system resource management. Capturing performance data on these large systems using traditional event tracing methods incurs prohibitively expensive computation overhead and long time delays. Such large volume measurements from thousands of processors distributed across LANs and WANs are likely to cause significant perturbations in program executions and network latencies.

To enable effective monitoring of system resources via performance data capture, our research focused on using statistical and mathematical methods for performance data volume reduction. In particular, we investigated the potential of statistical sampling and application signatures on various scientific codes executed in large systems.
Statistical sampling provides techniques to approximate a subset of observations, capable of statistically representing a chosen performance metric for the entire observation set. Statistical sampling cost increases with the sample size which, in turn, depends on two specifications — a desired accuracy in the estimate of the metric and a confidence level in the chosen accuracy. As a result, the success of statistical sampling will rely on finding a minimal subset that can provide estimates lying within the desired accuracy and confidence level.

We validated our approach by conducting several experiments estimating processor utilization, node availability, and network reachability on a variety of platforms — a Linux cluster connecting hundreds of nodes, a large-scale shared memory system, and WAN distributed memory system. We achieved over 90% success rate in our estimates while reducing the sample size by at least one order of magnitude.

Application signatures provides a complementary approach to data volume reduction based on curve-fitting. Signatures can succinctly represent performance metric trajectories in parallel and distributed scientific codes. They compress event trace data that captures performance metric dynamics while retaining many of the advantages of event tracing but with lower overhead.

We evaluated signature creation and comparison for several scientific codes running on diverse configurations including Linux clusters, SUN workstations and an IBM SP2. Experimenting with signatures on write request sizes and compression efficacy for traces with frequent events, our computed signatures were under 2% of the sizes of the raw event traces. The associated markers showed great ability in comparing signatures of various scales created under different execution contexts.

By coupling statistical sampling with periodic snapshots of performance metrics, one can reliably capture the state of a large system while dramatically reducing the total data volume. Complementarily, application signatures provide highly compressed representations of performance metrics in event traces. They can be useful for real-time validation of performance contracts, identifying when execution behavior lies outside acceptable ranges. The ability to compare signatures with various scaling enables assessment of the interplay between computation, communication, and I/O performance on execution dynamics.

Acknowldgements

This material is based upon work supported by Subcontract No. 12783-001-05 49 issued to Rice University from the Regents of the University of California (Los Alamos National Laboratory).

Funding

This work is supported by Subcontract No. 12783-001-05 49 issued to Rice University from the Regents of the University of California (Los Alamos National Laboratory).

Project Leader

  • Ken Kennedy, Rice University
  • Lennart Johnsson, University of Houston
  • Deepak Kapur, University of New Mexico
  • Jack Dongarra, University of Tennessee at Knoxville

RENCI Team

  • Kevin Gamiel
  • Ying Zhang

Daniel A. Reed, C. Lu and C. L. Mendes. “Reliability Challenges in Large Systems,” Future Generation Computer Systems, Spring 2005.

Daniel A. Reed and Celso L. Mendes. “Intelligent Monitoring for Adaptation in Grid Applications,” Proceedings of the IEEE, Vol. 93, No.2, 2005.

Charng-da Lu and Daniel A. Reed. “Assessing Fault Sensitivity in MPI Applications” SC2004 Technical Paper (Best Technical Paper Award), Supercomputing 2004, Pittsburgh, PA, November 2004

Karthik Pattabiraman. “Design and Evaluation of a Power-Aware Parallel I/O System,” Master Thesis, Computer Science, Urbana, IL, 2004.

Daniel A. Reed, C. L. Mendes and Charng-da Lu. “Intelligent Application Tuning and Adaptation,” In I. Foster & C. Kesselman (Eds.) The Grid: Blueprint for a New Computing Infrastructure, chapter 1, 2nd edition, Morgan Kaufmann, November 2003.

Daniel A. Reed, Charng-da Lu and Celso Mendes. “Big Systems and Big Reliability Challenges,” Proceedings of Parallel Computing 2003, pages 729-736, Dresden, Germany, September 2003.

Charng-da Lu and Daniel A. Reed. “Compact Application Signatures for Parallel and Distributed Scientific Codes,” SC2002 Technical Paper, Proceedings of Supercomputing 2002, Baltimore, MD, November 2002.

Celso Mendes, Daniel A. Reed, “Monitoring Large Systems via Statistical Sampling,” Proceedings of the LACSI Symposium, Santa Fe, NM, October 2002.

Fredrik Vraalsen, Ruth A. Aydt, C.L. Mendes, and Daniel A. Reed. “Performance Contracts: Predicting and Monitoring Grid Application Behavior,” Grid Computing – GRID 2001, Proceedings of the 2nd International Workshop on Grid Computing, Springer-Verlag Lecture Notes in Computer Science, Denver, CO, November 12, 2001.

Jeff S. Vetter and Daniel A. Reed. “Real-time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids,” The International Journal of High Performance Computing Applications, Volume 14, No. 4, pp. 357-366, Winter 2000.

Daniel A. Reed, “High-End Computing: The Challenge of Scale,” Director’s Colloquium, Los Alamos National Laboratory, Los Alamos, NM, May 2004.

Charn-da Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” University of Illinois, Urbana-Champaign, IL, March 2004.

Ying Zhang, “SvPablo: A Toolkit for Scalability Analysis,” Supercomputing 2003, Phoenix, AZ, November 2003.

Charng-da Lu, Karthik Pattabiraman, and Daniel A. Reed, “Fault Injection into MPI Programs,” Poster at LACSI Symposium, Sante Fe, NM, October 2003

Celso L. Mendes, “SvPablo at LACSI,” Second LACSI Applications and Tools Workshop, Los Alamos, NM, July 2003.

Celso L. Mendes, “SvPablo: Scalable Performance Analysis,” First LACSI Applications and Tools Workshop, Los Alamos, NM, February, 2003.

Charng-da Lu, “Compact Application Signatures for Parallel and Distributes Scientific Codes,” Supercomputing 2002, Baltimore, MD, November 2002

Celso L. Mendes, “Monitoring Large Systems via Statistical Sampling,” LACSI Symposium, Santa Fe, NM, October 2002.

Celso L. Mendes and Ying Zhang, “Scalable Tools for High Performance Computing,” Workshop on Performance Tools, LACSI Symposium, Santa Fe, NM, October 2001.

Partners

Links

LACSI Home Page at Rice University
LACSI Home Page at University of Houston
LACSI Home Page at University of New Mexico
LACSI Home Page at University of North Carolina at Chapel Hill
LACSI Home Page at University of Tennessee at Knoxville