EconPapers    
Economics at your fingertips  
 

A Failure Detection System for Large Scale Distributed Systems

Andrei Lavinia, Ciprian Dobre, Florin Pop and Valentin Cristea
Additional contact information
Andrei Lavinia: University Politehnica of Bucharest, Romania
Ciprian Dobre: University Politehnica of Bucharest, Romania
Florin Pop: University Politehnica of Bucharest, Romania
Valentin Cristea: University Politehnica of Bucharest, Romania

International Journal of Distributed Systems and Technologies (IJDST), 2011, vol. 2, issue 3, 64-87

Abstract: Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.

Date: 2011
References: Add references at CitEc
Citations:

Downloads: (external link)
http://services.igi-global.com/resolvedoi/resolve. ... 4018/jdst.2011070105 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:igg:jdst00:v:2:y:2011:i:3:p:64-87

Access Statistics for this article

International Journal of Distributed Systems and Technologies (IJDST) is currently edited by Nik Bessis

More articles in International Journal of Distributed Systems and Technologies (IJDST) from IGI Global
Bibliographic data for series maintained by Journal Editor ().

 
Page updated 2025-03-19
Handle: RePEc:igg:jdst00:v:2:y:2011:i:3:p:64-87