June 2008

PLCs and safety PLCs: Lessons from pucker events

By J. Troy Martel

While replacing relay logic control with a programmable logic controller (PLC) based system, the question arose as to whether it is safe to use a regular PLC on a boiler. The answer is no.

All digital components within a PLC eventually fail. How the system responds/performs upon component failure is a function of the scope, degree, and quality of operating system diagnostics.

It has been my observation that most PLCs are quite reliable, detect component failures, and are programmed to assume a fail-safe state resulting in shutdown of the process and perhaps loss of production and revenue.

Unfortunately, not all PLCs detect component failures correctly, resulting in a non-failsafe state, which may result in loss of assets, revenue, reputation, employment, and even life.

I have observed both fail-safe and fail-dangerous types of component failures in PLCs in safety and critical control applications during my career. All resulted in some sort of financial loss, but fortunately none resulting in loss of life.
I can assure you, however, each non-failsafe event caused considerable concern, inducing a puckered stance.

My first pucker event occurred in 1976 during the development and validation of a general-purpose PLC system in a burner management application. A single chip failed, resulting in a loss of one fourth of the program memory, a kind of digital lobotomy.

The PLC did not assume a failsafe state as was described in the vendor supplied documentation, but it continued to operate erratically. I fear what would have happened had this failure occurred several weeks later when the furnace was operational.

In an effort to understand the failure and conduct a root-cause analysis, I visited the PLC vendor engineering and manufacturing facilities. The chief engineer thoroughly described the diagnostic routines employed in the design. It seems memory was tested only on startup, but not during system operation.

He was visibly shaken when he understood we were proposing to use his product in a potentially life-critical application. He strongly discouraged its use in that application.

My second pucker event occurred while I was employed at a major petro-chemical company.

Believing I had learned my lesson from the first fail-dangerous event, I selected a major PLC vender, thoroughly reviewed and analyzed system diagnostics at the engineering and manufacturing facilities, and designed a dual PLC architecture for use in multiple safety and critical control applications within a world-class ethylene plant.

During system development, verification, and validation of one of the applications, we noticed one of the two PLCs was operating differently. Although both were programmed identically and both monitored the same input devices, outputs were not the same.

We transported the misbehaving PLC to the vendor repair facility. The technician discovered, replaced, and discarded the failed digital component. They made no effort to understand why the system diagnostic routines did not detect the failure and assume fail-safe state.

Senior engineering management and I returned to the vendor's engineering office seeking an answer.

After review, the vendor engineers discovered system diagnostics did not adequately cover that specific chip. Diagnostic routines were modified to monitor this chip in future product releases.

However, the vendor did not offer to upgrade the other 20 PLCs in our safety applications, nor did the vendor notify any other users of the flaw. It is only a matter of time before other users experience the same dangerous failure.

As a follow-up, the ethylene plant did experience another similar event several years later. Fortunately, the redundant PLC shielded the process from the failure.

I experienced one more pucker event when we photographed another general purpose PLC in a safety critical application for the company news rag.

We proudly opened all cabinet doors and the PLC enclosure. Although we fully verified and validated operation of this system prior to commissioning, it failed the very first challenge approximately three weeks later.

A painstaking review and testing revealed the photoflash penetrated the window of the UV-PROM, which garbled the instruction being accessed by the central processing unit (CPU).

The CPU promptly erased the shutdown sequence, which was contained in RAM. When challenged later, the system failed to activate the sequence.

During our discussions with this vendor, it was revealed the PLC did not utilize any instruction validation techniques, so the CPU accepted the garbled instruction.

Please note the photoflash event simply illuminated the lack of memory validation. The event could just as easily have been initiated by emergency shutdown, radio frequency interference, or some other external phenomenon.

With these background experiences, we developed a methodology for testing all PLCs proposed for safety applications. We basically used fault insertion techniques, which simulate chip/component failure, e.g. address, control, I/O bus faults, clock faults, CPU, and memory faults.

Prospective vendors cooperated in testing by supplying PC board/component layout drawings and details of PLC operation, including diagnostics.

We also inserted software faults, e.g. purposely making "accidental" mistakes during configuration and programming. The passing criterion was that after inserting a single hardware or software fault, the PLC would correctly diagnose the problem in a timely manner and take appropriate action.

If it was a non-redundant PLC, we expected appropriate action to be failsafe state with all discrete outputs off, analog outputs at minimum, and some sort of failure indicator activated.

If a redundant PLC design, we expected it to continue correct operation, some sort of failure indicator activated, and support the on-line repair/replacement of the faulted component.

After conducting fault insertion testing of a number of PLCs, it became amusing to see vendor reactions. All believed their PLC would pass the test and were quite surprised at the results. One major PLC lit up like a pinball machine, outputs turning on and off randomly. Another set an alarm, but misdiagnosed the fault and continued operating erratically. Another locked up when the scan time was set to a negative time value.

In another, all programmed timers failed after the first network erased from the program. Yet in another, the numerical representation of an analog signal changed significantly when a fault went into the analog input module.

We altered a single bit in the complied program on the PC hard disk and loaded it to another PLC, which then operated erratically. It seems many vendors employ diagnostics, but they never actually had validated their performance.

The point of this message is to iterate PLCs do not always perform as expected upon component failure. If you plan to employ a general purpose PLC in a safety application, I recommend the following:

  • Develop a relationship with vendor management, not just the sales force. Inform them of your plan and obtain their concurrence to use the PLC in a safety application. You do not want the vendor giving testimony against you should an accident occur later. Inquire if the PLC has experienced non-failsafe events. If so, what action did the vendor take to preclude the event occurring again?
  • Work with vendor engineers to thoroughly understand PLC design, configuration, programming, operation, and diagnostics. Configure and program your application in conformance with anticipated action upon component failure.
  • Perform hardware and software fault insertion testing to validate vendors documented performance upon component failure.
  • Establish a project quality/management plan for the development and implementation of the safety application. The plan should include documentation of what you plan to do; why you elected the plan; a list of components selected for the application; and why you selected them. The plan should describe how you will employ trained personnel to develop the design, configuration, programming, implementation, validation testing, operation, and periodic testing of the system. The plan should also require documentation describing the work actually performed, and verification against initial planning.

Finally, do not just wrap yourself and your application in some standard, believing that will be sufficient. All standards are the result of consensus, and as such, are the minimum requirements.

Ultimately, your decisions, guidance, and management will determine the safety of the application. Recognize you cannot make the right decision every time. However, you do have the ability to make a decision and then make it right.

ABOUT THE AUTHOR

J. Troy Martel (troy.martel@safeoperatingsystems.com) is a registered professional engineer (PE). He is an ISA senior member and president of Safe Operating Systems, Inc. in North Carolina.