Bad Nodes Considered Harmful: How to Find and Fix the Problem
Marco Seiz,
Johannes Hötzer (),
Henrik Hierl,
Stefan Andersson () and
Britta Nestler
Additional contact information
Marco Seiz: Institute of Applied Materials (IAM), Karlsruhe Institute of Technology (KIT)
Johannes Hötzer: Institute of Applied Materials (IAM), Karlsruhe Institute of Technology (KIT)
Henrik Hierl: Institute for Digital Materials (IDM), Hochschule Karlsruhe — Technik und Wirtschaft (HSKA)
Stefan Andersson: Amazon Web Services (AWS)
Britta Nestler: Institute of Applied Materials (IAM), Karlsruhe Institute of Technology (KIT)
A chapter in Sustained Simulation Performance 2018 and 2019, 2020, pp 123-130 from Springer
Abstract:
Abstract Large, distributed systems of computing units are the current state of the art for conducting high-performance computing. With large systems comes an increasing chance of failure of any component in the system, necessitating research as how to cope with failure. Failures may manifest as compute nodes shutting down, but also in differing performance among compute nodes. This chapter concerns itself with investigating a recent occurrence of the latter and how to avoid this in large scale runs.
Date: 2020
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-030-39181-2_11
Ordering information: This item can be ordered from
http://www.springer.com/9783030391812
DOI: 10.1007/978-3-030-39181-2_11
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().