Heterogeneous 1-out-of-N warm standby systems with online checkpointing
Gregory Levitin,
Liudong Xing and
Yuanshun Dai
Reliability Engineering and System Safety, 2018, vol. 169, issue C, 127-136
Abstract:
As a common practice in computing-related applications, checkpointing is used to facilitate an effective system recovery in the case of the occurrence of failures. Checkpoints are performed to save data associated with completed portion of a mission task. In the case of a failure, through rollback and data retrieval the system can resume the mission task from the last successful checkpoint instead of from the very beginning of the mission, saving time and cost. This paper models and optimizes 1-out-of-N: G warm standby systems subject to uneven online checkpointing, where checkpoints can be performed in parallel with execution of the primary mission task for improving efficiency of computing elements. Both data checkpoint and retrieval take dynamic time, depending on the amount of work completed. System elements can be heterogeneous in the time-to-failure distribution, performance, and level of readiness to take over the mission task during the warm standby mode. A numerical method is first suggested to evaluate mission performance indices including mission success probability, expected mission completion time, and expected mission operation cost. Examples are provided to demonstrate influence of mission deadline and element resource sharing parameter (i.e., CPU time distribution between the checkpointing procedure and the primary mission task) on the mission performance metrics. The optimal checkpoint distribution and optimal element activation sequencing problems are considered for different combinations of optimization objectives and constraints. A co-optimization problem is further addressed, which aims to find the optimal combination of checkpoint distribution and element activation sequence. Example optimization solutions illustrate the tradeoff among the three mission requirements (reliability, completion time, operation cost) for warm standby systems with online checkpoints.
Keywords: Online checkpoint; Warm standby; Mission cost; Mission reliability; Mission time; Optimization; Real-time; Sequencing (search for similar items in EconPapers)
Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (14)
Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S0951832017301904
Full text for ScienceDirect subscribers only
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:eee:reensy:v:169:y:2018:i:c:p:127-136
DOI: 10.1016/j.ress.2017.08.011
Access Statistics for this article
Reliability Engineering and System Safety is currently edited by Carlos Guedes Soares
More articles in Reliability Engineering and System Safety from Elsevier
Bibliographic data for series maintained by Catherine Liu ().