# Equipment, don't fail me now

## Calculating failure probabilities works better with systematic approach

### FAST FORWARD

• Instrumentation specialists see benefits in designing safety functions to requirements.
• Specifying targets for failure probabilities, other variables, crucial for safe operation.
• Markov modeling offers different approach to calculating failure probabilities.

##### By Peter Morgan

Specialists in control and instrumentation were once confident to rely on their own experience and good design practice to design protection systems. Now they must adhere to a quantitative approach to designing systems deemed safety systems. Even the ubiquitous burner management system (BMS) is by virtue of its function, a safety instrumented system (SIS), and you should design it according to ISA 84.01 as well as the applicable National Fire Protection Agency standard.

One step in this approach is calculating the target probability of failure on demand (PFD) for the system. Because calculating PFDs for repairable systems commonly seems complicated, the approach does not curry favor with the average control and instrumentation specialist; some manufacturers defer the design analysis to others or they do not do it at all. But there is benefit in the approach for designing general protective systems in addition to meeting the mandatory requirements for a SIS.

The control and instrumentation specialist might not be involved in the design of the SIS logic solver. But he will most likely be involved in specifying the target PFD for the SIS logic solver as well as specifying field instrumentation redundancy. In this case, when the boiler drum level falls to an unsafe low level, the BMS is required to shut off the fuel to avoid steam generator damage. If the transmitter is not working when the low level occurs, there will of course be no trip, and without operator action, equipment damage.

Mean time before failure

When instruments see wide use in an industry, experience quickly establishes how long an instrument is likely to operate before it fails. If 1,000 level transmitters are in use and only 50 fail in a given year, we can say there is a 1-in-20 chance of our particular transmitter failing in a given year, or the mean time between failure (MTBF) is 20 years. We do not have to wait 20 years or longer to find out what the MTBF is since the widespread use of the transmitter provides us with the required statistics. Our transmitter could fail in its first operation, but on average it will fail once in 20 years.

Mean time to repair

When the transmitter fails, we can replace it. This of course requires notification of failure before calling the technician in. That is why in practical applications, we would continuously compare the level measurement with an alternate redundant measurement. Until we replace the transmitter, the boiler is not protected against a low-level excursion, which could occur at any time during the interval the transmitter fails.

The repair time will have a significant affect on any calculation of the availability of the level trip function. It will also depend on the maintenance resources at a particular plant site. For our purposes, we will assume a maintenance technician on site can replace the transmitter in four hours.

Probability of failure on demand

For our example level transmitter, in a 20 year interval (the MTBF), the transmitter will not be available to detect a low level for a four-hour period (the mean time to repair). The probability of not tripping on low level is therefore 4/175,000 or 2.2E-5. The system's reliability to function on demand is not just dependent on the input device PFD but also on the PFD for the logic solver and final elements. The PFDs of the individual components in series provide the overall PFD. You might wonder why, with such a low PFD, redundant transmitters would ever be required. The calculation assumes failure is alarmed as soon as it occurs; this would not be the case without a second or third measurement to provide comparison and fault detection.

Markov modeling

While the intuitive process for calculating PFD serves well for single devices, as complexity increases by adding redundancy, you might benefit by applying a more systematic approach, Markov modeling. For illustrative purposes and to allow direct comparison to the previous calculation, consider a single repairable device. (For a more comprehensive introduction to Markov modeling, visit www.isa.org/link/MarkovPM.)

If the name of the analytical approach is unfamiliar to the practicing control and instrumentation specialist, the method will strike a chord for its similarity to the analysis of feedback control system. Repairable device

The values P(n) and P(n+1) are the fractional times spent in each state, so 1 means the system is continuously in that state and 0 means the system is never in that state. The average rate at which the system transitions from one state P(n) to the other is given by the product P(n)λn , where λn is the average failure rate for the subject component. Similarly, if the system is returned to state P(n) from state P(n+1) at an average rate of µn (by repair or replacement), the rate at which the system returns to the unfailed state is given by the product P(n+1) µn.

The system depicted below will reach a steady state when the repair rate is equal to the failure rate. In this condition, P(n) and P(n+1) are the final fractional times spent in each state. The fractional time spent in each state and the fractional probability of being in a particular state at any time is one and the same.

The state equations in this case are:

λP0 - µ P1 = 0

P0 + P1 = 1 (since the sum of the fractional time in the normal state and the failed state is 1).

This simultaneous equation is easily solved by substitution to give:

P1= λ/(µ+ λ)

When repair rates are much greater than failure rates:

P1 = λ / µ

This is the same as MTTR/MTBF, where MTTR is the mean time to repair and MTBF is the mean time between failures for the system component (transmitter).

For our example level transmitter P1, the fractional time the transmitter is in the failed state and the PFD is as before 4/175,000 or 2.2E-5.

Transmitter logic

Calculating the fractional probability of the system being in a degraded or failed state while always possible by hand becomes arduous and prone to error as the system becomes more complex. This is where the Markov model and matrix inversion come into their own. For two-out-of-three transmitter logic, we can assume a trip will initiate when at least two transmitters indicate the trip state. One transmitter can fail without inhibiting a trip; however two failures (to the dangerous state) will cause the system to fail dangerously.

The logic solver monitors individual signals to verify the transmitters are not frozen and compares each transmitter with the median value to alert the operator of a possible transmitter failure when there is a deviation. The system is periodically tested (annually) to expose undetected failures.

The Markov model at left identifies four states:

0 - System is OK (all transmitters normal).
1 - One transmitter is failed, and failure is detected.
2 - One transmitter is failed, but the failure is not detected.
3 - System is in the fail state, and the condition is detected.
4 - System is in the fail state, and the condition is not detected.

Defining the failure and repair rates:

λDD - Individual device (transmitter) failure rate for detected failures (per year)

λDU - Individual device (transmitter) failure rate for undetected failures (per year)

µ - Repair rate for detected failures (per year)

µT - Repair rate for undetected failures (per year)

-(3λDD+λDU) P0 + µ P1 + µT P2 + µ P3 + µT P4   = 0  (1)
3λDD P0 - (2λDD+2λDU + µ)P1  = 0                               (2)
3λDU P0 - (2λDD+2λDU + µT )P2  = 0                            (3)
2λDD P1 + 2λDD P2 - µ P3 = 0                                         (4)
2λDU P1 + 2λDU P2 - µT P4  = 0                                     (5)
P0+P1+P2+P3+P4   = 1                                                    (6)

Adding like terms on the left side of equation (6) to the left side of equation (1) and adding 1 to the right side of equation (1) incorporates the properties of equation (6) and reduces the number of equations to 5 to allow a solution for the fractional state probabilities using matrix inversion of the resultant "Square" matrix.

[1-(3λDD+3λDU)] P0 + [1+µ] P1 + [1+µT ] P2 + [1+µ] P3 + [1+µT] P4 = 1
3λDD P0 - (2λDD+2λDU + µ)P1     = 0
3λDU P0 - (2λDD+2λDU + µT )P2     = 0
2λDD P1 + 2λDD P2 - µ P3    = 0
2λDU P1 + 2λDU P2 - µT P4     = 0

λDD = 0.1 failures per year (10 years between detected failures)
λDU = 0.01 failures per year (100 years between undetected failures)
µ  = 2,190 (number of hours in a year/ repair time (4 hours))
µT  = 2 (2/manual test interval in years)

Although an undetected failure could occur at any time between manual system tests, on average, we can assume they occur half way through the test interval. On this basis, we can assume the system components failed for half the test period, in which case the rate at which devices are returned to their functional state is 2/manual test interval (in years).

The final state equations can now be written in the form of a matrix so we can obtain the state probabilities P0 to P4, in particular P3 and P4, by matrix inversion. The inverse of a matrix is another matrix, which when multiplied by the original matrix, gives a matrix with values of 1 for all diagonal elements and 0 for off-diagonal elements. When we multiply both sides of the state matrix equation by the inverse, we obtain the following matrix: The right-hand matrix product gives a column matrix of values equal to those in the first column of the inverse of the P matrix; the values in the first column of the inverse matrix are the required fractional state probabilities. This means we need to calculate only the first column of the inverse matrix. However, since we can easily obtain the P matrix inversion using MINVERSE function in a spread sheet, we can obtain the redundant values in the matrix without effort.

Although we can obtain the inverse matrix by hand calculation, this process can be arduous for all but the simplest of systems. The use of the matrix inversion function available in spreadsheets is not only quick and easy but avoids the errors so easily introduced in a lengthy hand calculation.

Peter Morgan, P.Eng., is principal consultant of Control System Design Services Inc. His e-mail is morgan@controlinsight.com.

## Markov models

##### By William Goble and Harry Cheddie

You can effectively use a set of modeling tools based around Markov models to solve a wide variety of reliability and safety problems. These models work well as they are stochastic processes, processes in which you cannot accurately predict outcomes, but you can obtain outcome probabilities. A Markov safety instrumented system (SIS) is a memory-less system where the probability of moving from one state to another is dependent only upon the current state and not past history of getting to the state. This is the primary characteristic of a Markov model, which is well-suited to problems where a state naturally indicates the situation of interest. In some models (characteristic of reliability and safety models), a variable follows a sequence of states. These problems are called Markov chains.

Markov models can deal with complex issues found in the probabilistic modeling of reliability and safety. The models can show system success versus system failure.

Markov models can show redundancy with different levels of redundant components. A system with two subsystems only requires one for successful system operation. All failures are immediately recognized, and the repair probability models as a constant.

If both units in a dual redundant system are identical (or close enough so that we do not care which one fails), a model like the one above can be simplified to show only the number of failed units in each state. SOURCE: Safety Instrumented Systems Verification: Practical Probabilistic Calculations, 2005, ISA. (www.isa.org/sisverify).