EconPapers    
Economics at your fingertips  
 

Measuring the Resiliency of Extreme-Scale Computing Environments

Catello Di Martino (), Zbigniew Kalbarczyk () and Ravishankar Iyer ()
Additional contact information
Catello Di Martino: University of Illinois at Urbana Champaign
Zbigniew Kalbarczyk: Bell Labs - Nokia
Ravishankar Iyer: Bell Labs - Nokia

A chapter in Principles of Performance and Reliability Modeling and Evaluation, 2016, pp 609-655 from Springer

Abstract: Abstract This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU $$+$$ + GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.

Keywords: File System; Node Failure; Service Node; Memory Error; Blue Water (search for similar items in EconPapers)
Date: 2016
References: Add references at CitEc
Citations: View citations in EconPapers (1)

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:ssrchp:978-3-319-30599-8_24

Ordering information: This item can be ordered from
http://www.springer.com/9783319305998

DOI: 10.1007/978-3-319-30599-8_24

Access Statistics for this chapter

More chapters in Springer Series in Reliability Engineering from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-04-01
Handle: RePEc:spr:ssrchp:978-3-319-30599-8_24