Fault Tolerance in a Distributive System

The scope of this study was restricted to fault tolerance in the distributive operating system where failure in a component cannot affect the functioning of other part.


The study is to point out the importance of incorporating fault tolerance measure to especially life critical system and provide proper solutions to system faults upon their occurrence and make the system more dependable by increasing its reliability.


This seminar presentation is significant in several aspects. Apart from being a necessary condition in Madonna University, it is also an opportunity for the researcher to conduct an independent study based on empirical data Furthermore, it will help to advance knowledge and hence, be of great value to other researchers taking computer science as a course and researching on related issues and to members of the academic in general.

Finally recommendations made in this study will serve as possible solutions to appropriate agencies especially manufacturer of life critical system e.g. aircraft of the need to incorporate the fault tolerance mechanism for achieving safety.


The scope of this study was restricted to fault tolerance in the distributive operating system where failure in a component cannot affect the functioning of other part.


This research is designed specifically to examine fault tolerance in distributive system, hence because of research constraints such as TIME which is of essence and no researcher have the luxury of it all and FINANCE that is usually inadequate and limited at the disposal of researcher.


Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems.

Fault: Fault can be termed as “defect” at the lowest level of abstraction. It can lead to erroneous system state. Faults may be classified as transient, intermittent or permanent. They can be of the following types;

Processor Faults (Node Faults): Processor faults occur when the processor behaves in an unexpected manner. It may be classified into three kinds:

Fail-Stop: Here a processor can both be active and participate in distribute protocols or is totally failed and will never respond. In this case the neighboring processors can detect the failed processor.

Slowdown: Here a processor might run in degraded fashion or might totally fail.

Byzantine: Here a processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation.

Network Faults (Link Faults): Network faults occur when (live and working) processors are prevented from communicating with each other.

Error: Undesirable system state that may lead to failure of the system.

Failure: Faults due to unintentional intrusion

Fault tolerance: Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.

Recovery: Recovery is a passive approach in which the state of the system is maintained and is used to roll back the execution to a predefined checkpoint.

Redundancy: With respect to fault tolerance it is replication of hardware, software components or computation.

Security: Robustness of the system characterized by secrecy, integrity, availability, reliability and safety during its operation.


1. THE MASKING TYPE OF FAULT TOLERANCE: IS most desirable but most expensive to implement. Applications with this kind of fault tolerance are able to tolerate faults in transparent manner. While for last case where neither safety nor liveness is guaranteed is the most undesirable. Among the two intermediate fail safe is favorable (and is active area of research) over non-masking because of the importance of leaving the system in safe state.

2. THE NON MASKING TYPE OF FAULT TOLERENCE: The output of the system may not be desirable or correct but still the result is delivered. Recently specialization of non-masking fault tolerance called self-stabilization is actively worked after. Programs of this kind are able to withstand any kinds of transient faults. However programs of such kind are difficult to construct and test.


The basic characteristics of fault tolerance require:

1. No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.

2. Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorize faults based on locality, cause, duration, and effect.

3. Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Firewalls or other mechanisms that isolate a rogue transmitter or failing component to protect the system are required.

4. Availability of reversion modes In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. Fault-tolerant systems are typically based on the concept of redundancy.

 Redundancy: Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. This can consist of backup components which automatically "kick in" should one component fail. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with the exception of the front tires, which are used to steer). The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s.

Two kinds of redundancy are possible: space redundancy and time redundancy.

Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result.

    1 reviews
  • Raj Janorkar

    Fault Tolerance in a Distributive System

    2 years ago