The FMEA method
A powerful reliability tool for data analysis that lasts for decades
By William Goble
A Failure Mode and Effects Analysis (FMEA) is a system reliability and safety review technique created in the 1960s as part of the U.S. Minuteman rocket program to find and mitigate unanticipated design problems. A rather simple technique, the failure modes of each component in a given system are listed in a table, and the effect of that failure is postulated and documented. The method is systematic, effective, and detailed, although sometimes called time-consuming and repetitive. The reason the method is so effective is every failure mode of every single component is examined. Here is a table example following the format from MIL-HNBK-1629, one of the original references (see “FMEA tabular format” table).
Column one describes the name of the component under review, while column two is available to list the component’s identification number (part number or code number). Together, columns one and two must uniquely identify the reviewed component. Column three describes the function of the component, while column four describes the predicted failure modes. One row will likely be used for each component failure mode. Column five is used to record the known cause of the failure mode if applicable. The effect of that failure on the system is recorded in column six. The remaining columns vary, depending on which version of the many iterations of FMEA are being followed.
FMEA finds problems
The FMEA method has grown in popularity over the years and has become an essential part of many design processes, especially in the automotive industry. This is primarily because it has been shown over time to be effective and useful despite any negatives of the method. During a FMEA, one may hear “Oh, no,” and it becomes clear that a particular component failure effect is a serious problem that had previously gone unrecognized. When these problems are significant enough, a corrective action item is recorded. The design is then improved to detect, avoid, or control the problem.
Process industry applications
Variations of the FMEA technique are used in the process industries. One place where FMEA is used is for hazard identification in a petro-chemical plant design. This technique fits in nicely with the familiar Hazard and Operability Study (HAZOP) technique as FMEA and HAZOP methods are nearly the same. Both variations of a common theme list the components of a system in a tabular format. The fundamental difference between FMEA and HAZOP is HAZOP uses guide words to stimulate the participants to identify system abnormalities, whereas FMEA uses known equipment failure modes.
A variation of the FMEA technique as applied to control systems is called Control Hazards and Operability Analysis (CHAZOP). Known failure modes of control equipment, such as a basic process control system (BPCS), an actuator-valve assembly or a sensing transmitter are listed, and the effect of that failure is documented. An action item is recorded when this effect is a significant problem, therefore prompting an improvement in the control system design.
An example FMEA
The “Simple reactor” figure shows a simplified reactor with an emergency cooling system from Control Systems Safety Evaluation and Reliability, Third Edition, Chapter 5 (www.isa.org/link/BK_CSSER). The system consists of a gravity feed water tank, a control valve (VALVE1), a cooling jacket around the reactor, a cooling jacket drain pipe, a temperature-sensing switch (TSW1), and a power supply. Normal operation consists of the temperature-sensing switch closed (conducting) because the reactor temperature is below a dangerous limit. Electrical current flows from the power supply through the valve and the temperature-sensing switch. This electrical current (energy) keeps the valve closed. If the temperature inside the reactor gets too high, the temperature-sensing switch opens. This stops the flow of electrical current, and the control valve opens. Cooling water flows from the tank, through the valve, then the cooling jacket, and finally the jacket drain pipe. This water flow cools the reactor, therefore lowering its temperature.
The FMEA procedure requires the creation of a table with all failure modes listed for each of the system components. The “Simple reactor FMEA” table shows the results of this example system level FMEA. The FMEA has identified six critical items that should be reviewed to determine the need for correction.
The system designer, in the case of a simple reactor, may consider installing two temperature switches and wiring them in series. Alternatively the system designer may choose a smart IEC 61508 safety certified temperature transmitter with automatic diagnostics and a relay output. The certified transmitter would reduce proof-testing effort to detect one temperature-sensing switch failed shorted. A second drain pipe could be installed in parallel with the first, therefore preventing a single clogged drain from causing a critical failure. A level sensor on the water tank could warn of insufficient water level. Many other possible design changes could be made to mitigate the critical failures or to reduce the number of false trips.
FMEA method evolution
The FMEA method was expanded in the 1970s to include semi-quantitative ratings (a number between one and 10) for severity, likelihood, and detection. Four columns were then added to the table. Three columns include ratings and a fourth for the risk priority number (RPN), which was obtained by multiplying the three numbers. This expanded method is called a Failure Modes, Effects and Criticality Analysis (FMECA). The “FMECA reactor example” table shows the reactor example with RPN numbers added (columns 7,8,9, and 10).
FMEA techniques have continued to evolve over the years. Some of the more recent variations include using the method for processes as well as designs. Similar to listing components, each step in a process is listed. Each step includes all anticipated ways in which the step can go wrong, equivalent to listing known failure modes of each component. Once the list has been completed, the method is the same as a design FMEA. After these two fundamentally different types of FMEA were created, the “design FMEA” was then called DFMEA, and a “process FMEA” was called PFMEA in some literature. Similar to a design FMEA, the process FMEA has been shown to be effective in finding unanticipated problems.
Failure Modes Effects and Diagnostic Analysis
The always evolving FMEA method prompted the development of the Failure Modes Effects and Diagnostic Analysis (FMEDA) technique. The late 1980s presented a need to model the automatic diagnostic capability of smart devices. There was a new “architecture” in the safety PLC market called one out of two with diagnostic switch (1oo2D), which competed with the existing triple modular redundant architecture (called two out of three, 2oo3). As the impact on safety and availability of this new architecture was highly dependent on diagnostic coverage, a measurement of the diagnostic coverage was important. A FMEDA accomplishes this by adding additional columns to include a failure rate for each system failure mode and a probability of diagnostic detection column for each line in the analysis.
Similar to the FMEA, the FMEDA technique also lists all components and their failure modes, as well as the effect of the component failure mode. The table has now added columns that express each failure mode of the system, the probability of any diagnostic to detect that particular failure, and the quantitative failure rate for that failure mode. When the FMEDA is completed, the diagnostic coverage factor is calculated based on a failure rate weighted average of the diagnostic coverage of all parts.
Failure rate numbers and a failure mode distribution are required for each component in order to perform an FMEDA. Therefore, a component database of this information is required, as shown in The “FMEDA process” figure.
The component database must consider the key variables that impact component failure rates. This includes environmental stress factors. Fortunately, standards exist to characterize the environments in the process industries, and profiles can be created. The “Environmental profiles for the process industries” table shows the profile set for the process industries from Electrical and Mechanical Component Reliability Handbook, Second Edition (www.exida.com).
The “safety factor” built into each particular product design is another important variable in the failure rate. This can be determined through a detailed study of each design, including the ratings of each component and expected stress conditions.
Field Failure Data Analysis for FMEDA
Design analysis can be used to create theoretical failure databases; however, accurate information is obtained only when the component failure rates and modes are based on a collection of field failure studies as shown in the “Field failure studies” figure. Any unexplained difference in a product failure rate calculated from field failure data and FMEDA must be resolved. Sometimes, the field failure data collection process needs improvement. Sometimes, the component database is upgraded, mostly by recognizing new failure modes and component types.
Fortunately for the process industries, some functional safety certification bodies study field failure return data as part of most product assessments, providing a strong source of field failure data. Some projects also gather field failure data from end users. After more than 10 billion unit operating hours of field failure data from dozens of studies, the FMEDA component database is greatly improving, especially for functional safety. The resulting FMEDA product data is commonly used to do safety integrity verification calculations.
The FMEDA technique can also be used to evaluate manual proof test coverage of safety instrumented functions. This number is important when safety instrumented function verification calculations are done to determine if a given design meets a particular safety integrity level. Any particular proof test procedure can detect some of the potentially dangerous failures, but not all. The FMEDA can identify which failures are, or are not, detected by the proof test. This is done by adding another column where probability of detection during the proof test is estimated for each component failure mode. While following this detailed, systematic method, it becomes clear that some, potentially dangerous failures have not been detected by a particular proof test.
Dealing with the negatives
The biggest challenge when performing a FMEA (or any of the variations) is time consumption. Many analysts have complained about the boring, time-consuming process. A strict and focused facilitator is needed to keep the process moving. It should always be remembered that solving the problem is not part of the analysis; the problems are solved once the analysis has been completed. If these rules are followed, the result is time-effective improvements in safety and reliability.
ABOUT THE AUTHOR
Dr. William Goble is a principal engineer and director of the functional safety certification group at exida, an accredited certification body. He has over 40 years of experience in electronic design, software, and safety system design. His Ph.D. is in quantitative reliability/safety analysis of automation systems.