EconPapers    
Economics at your fingertips  
 

Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure

Bentolhoda Jafary, Lance Fiondella and Ping-Chen Chang

Journal of Risk and Reliability, 2020, vol. 234, issue 4, 636-648

Abstract: Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.

Keywords: Checkpointing; fault tolerance; correlated component failure; Markov model; deadline (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.sagepub.com/doi/10.1177/1748006X19893569 (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:sae:risrel:v:234:y:2020:i:4:p:636-648

DOI: 10.1177/1748006X19893569

Access Statistics for this article

More articles in Journal of Risk and Reliability
Bibliographic data for series maintained by SAGE Publications ().

 
Page updated 2025-03-19
Handle: RePEc:sae:risrel:v:234:y:2020:i:4:p:636-648