Fault-Tolerance Mechanisms for a Parallel Programming System — A Responsiveness Perspective
Holger Karl ()
Additional contact information
Holger Karl: Humboldt-University of Berlin, Institut für Informatik
A chapter in Communication-Based Systems, 2000, pp 43-54 from Springer
Abstract:
Abstract Clusters of workstations are an attractive environment for high performance computing. For some applications, however, clusters still lack certain properties. One such property is responsive (dependable and timely) execution of programs. This paper studies two mechanisms (checkpointing and replication) to improve the responsiveness (the probability of meeting a deadline in the presence of faults) of a parallel programming system, Calypso, by ameliorating a single point of failure of Calypso. Experiments show that checkpointing is a suitable tool to achieve high responsiveness and that already a very modest degree of replication is sufficient for improved responsiveness.
Keywords: Message Passing Interface; Multicast Group; Fault Injection; Fault Rate; Defense Advance Research Project Agency (search for similar items in EconPapers)
Date: 2000
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-94-015-9608-4_4
Ordering information: This item can be ordered from
http://www.springer.com/9789401596084
DOI: 10.1007/978-94-015-9608-4_4
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().