Fault tree analysis (FTA) is a type offailure analysis in which an undesired state of a system is examined. This analysis method is mainly used insafety engineering andreliability engineering to understand how systems can fail, to identify the best ways to reduce risk and to determine (or get a feeling for) event rates of a safety accident or a particular system level (functional) failure. FTA is used in theaerospace,[1]nuclear power,chemical and process,[2][3][4]pharmaceutical,[5]petrochemical and other high-hazard industries; but is also used in fields as diverse as risk factor identification relating tosocial service system failure.[6] FTA is also used in software engineering for debugging purposes and is closely related to cause-elimination technique used to detect bugs.
In aerospace, the more general term "system failure condition" is used for the "undesired state" / top event of the fault tree. These conditions are classified by the severity of their effects. The most severe conditions require the most extensive fault tree analysis. These system failure conditions and their classification are often previously determined in the functionalhazard analysis.
Fault tree analysis can be used to:[7][8]
![]() | The examples and perspective in this sectionmay not represent aworldwide view of the subject. You mayimprove this section, discuss the issue on thetalk page, or create a new section, as appropriate.(May 2022) (Learn how and when to remove this message) |
Fault tree analysis (FTA) was originally developed in 1962 atBell Laboratories by H.A. Watson, under aU.S. Air ForceBallistics Systems Division contract to evaluate theMinuteman IIntercontinental Ballistic Missile (ICBM) Launch Control System.[9][10][11][12] The use of fault trees has since gained widespread support and is often used as a failure analysis tool by reliability experts.[13] Following the first published use of FTA in the 1962 Minuteman I Launch Control Safety Study,Boeing andAVCO expanded use of FTA to the entire Minuteman II system in 1963–1964. FTA received extensive coverage at a 1965System Safety Symposium inSeattle sponsored by Boeing and theUniversity of Washington.[14] Boeing began using FTA forcivil aircraft design around 1966.[15][16]
Subsequently, within the U.S. military, application of FTA for use with fuses was explored byPicatinny Arsenal in the 1960s and 1970s.[17] In 1976 theU.S. Army Materiel Command incorporated FTA into an Engineering Design Handbook on Design for Reliability.[18] The Reliability Analysis Center atRome Laboratory and its successor organizations now with theDefense Technical Information Center (Reliability Information Analysis Center, and now Defense Systems Information Analysis Center[19]) has published documents on FTA and reliability block diagrams since the 1960s.[20][21][22] MIL-HDBK-338B provides a more recent reference.[23]
In 1970, theU.S. Federal Aviation Administration (FAA) published a change to 14CFR 25.1309airworthiness regulations fortransport categoryaircraft in theFederal Register at 35 FR 5665 (1970-04-08). This change adopted failure probability criteria foraircraft systems and equipment and led to widespread use of FTA in civil aviation. In 1998, the FAA published Order 8040.4,[24] establishing risk management policy including hazard analysis in a range of critical activities beyond aircraft certification, includingair traffic control and modernization of the U.S.National Airspace System. This led to the publication of the FAA System Safety Handbook, which describes the use of FTA in various types of formal hazard analysis.[25]
Early in theApollo program the question was asked about the probability of successfully sending astronauts to the moon and returning them safely to Earth. A risk, or reliability, calculation of some sort was performed and the result was a mission success probability that was unacceptably low. This result discouraged NASA from further quantitative risk or reliability analysis until after theChallenger accident in 1986. Instead, NASA decided to rely on the use offailure modes and effects analysis (FMEA) and other qualitative methods for system safety assessments. After theChallenger accident, the importance ofprobabilistic risk assessment (PRA) and FTA in systems risk and reliability analysis was realized and its use at NASA has begun to grow and now FTA is considered as one of the most important system reliability and safety analysis techniques.[26]
Within the nuclear power industry, theU.S. Nuclear Regulatory Commission began using PRA methods including FTA in 1975, and significantly expanded PRA research following the 1979 incident atThree Mile Island.[27] This eventually led to the 1981 publication of the NRC Fault Tree Handbook NUREG–0492,[28] and mandatory use of PRA under the NRC's regulatory authority.
Following process industry disasters such as the 1984Bhopal disaster and 1988Piper Alpha explosion, in 1992 theUnited States Department of LaborOccupational Safety and Health Administration (OSHA) published in the Federal Register at 57 FR 6356 (1992-02-24) itsProcess Safety Management (PSM) standard in 19 CFR 1910.119.[29] OSHA PSM recognizes FTA as an acceptable method forprocess hazard analysis (PHA).
Today FTA is widely used insystem safety andreliability engineering, and in all major fields of engineering.
FTAmethodology is described in several industry and government standards, including NRC NUREG–0492 for the nuclear power industry, an aerospace-oriented revision to NUREG–0492 for use byNASA,[26]SAEARP4761 for civil aerospace, MIL–HDBK–338 for military systems,IEC standard IEC 61025[30] is intended for cross-industry use and has been adopted as European Norm EN 61025.
Any sufficiently complex system is subject to failure as a result of one or more subsystems failing. The likelihood of failure, however, can often be reduced through improved system design. Fault tree analysis maps the relationship between faults, subsystems, and redundant safety design elements by creating a logic diagram of the overall system.
The undesired outcome is taken as the root ('top event') of a tree of logic. For instance, the undesired outcome of a metal stamping press operation being considered might be a human appendage being stamped. Working backward from this top event it might be determined that there are two ways this could happen: during normal operation or during maintenance operation. This condition is a logical OR. Considering the branch of the hazard occurring during normal operation, perhaps it is determined that there are two ways this could happen: the press cycles and harms the operator, or the press cycles and harms another person. This is another logical OR. A design improvement can be made by requiring the operator to press two separate buttons to cycle the machine—this is a safety feature in the form of a logical AND. The button may have an intrinsic failure rate—this becomes a fault stimulus that can be analyzed.
When fault trees are labeled with actual numbers for failure probabilities,computer programs can calculate failure probabilities from fault trees. When a specific event is found to have more than one effect event, i.e. it has impact on several subsystems, it is called a common cause or common mode. Graphically speaking, it means this event will appear at several locations in the tree. Common causes introduce dependency relations between events. The probability computations of a tree which contains some common causes are much more complicated than regular trees where all events are considered as independent. Not all software tools available on the market provide such capability.
The tree is usually written out using conventionallogic gate symbols. A cut set is a combination of events, typically component failures, causing the top event. If no event can be removed from a cut set without failing to cause the top event, then it is called a minimal cut set.
Some industries use both fault trees andevent trees (seeProbabilistic Risk Assessment). An event tree starts from an undesired initiator (loss of critical supply, component failure etc.) and follows possible further system events through to a series of final consequences. As each new event is considered, a new node on the tree is added with a split of probabilities of taking either branch. The probabilities of a range of 'top events' arising from the initial event can then be seen.
Classic programs include theElectric Power Research Institute's (EPRI) CAFTA software, which is used by many of the US nuclear power plants and by a majority of US and international aerospace manufacturers, and theIdaho National Laboratory'sSAPHIRE, which is used by the U.S. Government to evaluate the safety andreliability ofnuclear reactors, theSpace Shuttle, and theInternational Space Station. Outside the US, the softwareRiskSpectrum is a popular tool for fault tree and event tree analysis, and is licensed for use at more than 60% of the world's nuclear power plants for probabilistic safety assessment. Professional-gradefree software is also widely available; SCRAM[31] is an open-source tool that implements the Open-PSA Model Exchange Format[32] open standard for probabilistic safety assessment applications.
The basic symbols used in FTA are grouped as events, gates, and transfer symbols. Minor variations may be used in FTA software.
Event symbols are used forprimary events andintermediate events. Primary events are not further developed on the fault tree. Intermediate events are found at the output of a gate. The event symbols are shown below:
The primary event symbols are typically used as follows:
An intermediate event gate can be used immediately above a primary event to provide more room to type the event description.
FTA is a top-to-bottom approach.
Gate symbols describe the relationship between input and output events. The symbols are derived from Boolean logic symbols:
The gates work as follows:
Transfer symbols are used to connect the inputs and outputs of related fault trees, such as the fault tree of a subsystem to its system. NASA prepared a complete document about FTA through practical incidents.[26]
Events in a fault tree are associated withstatisticalprobabilities or Poisson-Exponentially distributed constant rates. For example, component failures may typically occur at some constantfailure rate λ (a constant hazard function). In this simplest case, failure probability depends on the rate λ and the exposure time t:
where:
if
A fault tree is often normalized to a given time interval, such as a flight hour or an average mission time. Event probabilities depend on the relationship of the event hazard function to this interval.
Unlike conventionallogic gate diagrams in which inputs and outputs hold thebinary values of TRUE (1) or FALSE (0), the gates in a fault tree output probabilities related to theset operations ofBoolean logic. The probability of a gate's output event depends on the input event probabilities.
An AND gate represents a combination ofindependent events. That is, the probability of any input event to an AND gate is unaffected by any other input event to the same gate. Inset theoretic terms, this is equivalent to the intersection of the input event sets, and the probability of the AND gate output is given by:
An OR gate, on the other hand, corresponds to set union:
Since failure probabilities on fault trees tend to be small (less than .01), P (A ∩ B) usually becomes a very small error term, and the output of an OR gate may be conservatively approximated by using an assumption that the inputs aremutually exclusive events:
An exclusive OR gate with two inputs represents the probability that one or the other input, but not both, occurs:
Again, since P (A ∩ B) usually becomes a very small error term, the exclusive OR gate has limited value in a fault tree.
Quite often, Poisson-Exponentially distributed rates[33] are used to quantify a fault tree instead of probabilities. Rates are often modeled as constant in time while probability is a function of time. Poisson-Exponential events are modelled as infinitely short so no two events can overlap. An OR gate is the superposition (addition of rates) of the two input failure frequencies or failure rates which are modeled asPoisson point processes. The output of an AND gate is calculated using the unavailability (Q1) of one event thinning the Poisson point process of the other event (λ2). The unavailability (Q2) of the other event then thins the Poisson point process of the first event (λ1). The two resulting Poisson point processes are superimposed according to the following equations.
The output of an AND gate is the combination of independent input events 1 and 2 to the AND gate:
In a fault tree, unavailability (Q) may be defined as the unavailability of safe operation and may not refer to the unavailability of the system operation depending on how the fault tree was structured. The input terms to the fault tree must be carefully defined.
Many different approaches can be used to model a FTA, but the most common and popular way can be summarized in a few steps. A single fault tree is used to analyze one and only one undesired event, which may be subsequently fed into another fault tree as a basic event. Though the nature of the undesired event may vary dramatically, a FTA follows the same procedure for any undesired event; be it a delay of 0.25 ms for the generation of electrical power, an undetected cargo bay fire, or the random, unintended launch of anICBM.
FTA analysis involves five steps:
FTA is adeductive, top-down method aimed at analyzing the effects of initiating faults and events on a complex system. This contrasts withfailure mode and effects analysis (FMEA), which is aninductive, bottom-up analysis method aimed at analyzing the effects of single component or function failures on equipment or subsystems. FTA is very good at showing how resistant a system is to single or multiple initiating faults. It is not good at finding all possible initiating faults. FMEA is good at exhaustively cataloging initiating faults, and identifying their local effects. It is not good at examining multiple failures or their effects at a system level. FTA considers external events, FMEA does not.[35] In civil aerospace the usual practice is to perform both FTA and FMEA, with afailure mode effects summary (FMES) as the interface between FMEA and FTA.
Alternatives to FTA includedependence diagram (DD), also known asreliability block diagram (RBD) andMarkov analysis. A dependence diagram is equivalent to a success tree analysis (STA), the logical inverse of an FTA, and depicts the system using paths instead of gates. DD and STA produce probability of success (i.e., avoiding a top event) rather than probability of a top event.
{{cite book}}
: CS1 maint: location missing publisher (link){{cite book}}
: CS1 maint: numeric names: authors list (link)