Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure
Bentolhoda Jafary,
Lance Fiondella and
Ping-Chen Chang
Journal of Risk and Reliability, 2020, vol. 234, issue 4, 636-648
Abstract:
Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.
Keywords: Checkpointing; fault tolerance; correlated component failure; Markov model; deadline (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.sagepub.com/doi/10.1177/1748006X19893569 (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:sae:risrel:v:234:y:2020:i:4:p:636-648
DOI: 10.1177/1748006X19893569
Access Statistics for this article
More articles in Journal of Risk and Reliability
Bibliographic data for series maintained by SAGE Publications ().