Measuring the Resiliency of Extreme-Scale Computing Environments
Catello Di Martino (),
Zbigniew Kalbarczyk () and
Ravishankar Iyer ()
Additional contact information
Catello Di Martino: University of Illinois at Urbana Champaign
Zbigniew Kalbarczyk: Bell Labs - Nokia
Ravishankar Iyer: Bell Labs - Nokia
A chapter in Principles of Performance and Reliability Modeling and Evaluation, 2016, pp 609-655 from Springer
Abstract:
Abstract This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU $$+$$ + GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.
Keywords: File System; Node Failure; Service Node; Memory Error; Blue Water (search for similar items in EconPapers)
Date: 2016
References: Add references at CitEc
Citations: View citations in EconPapers (1)
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:ssrchp:978-3-319-30599-8_24
Ordering information: This item can be ordered from
http://www.springer.com/9783319305998
DOI: 10.1007/978-3-319-30599-8_24
Access Statistics for this chapter
More chapters in Springer Series in Reliability Engineering from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().