- By Paul Gruhn
By Paul Gruhn, PE, CFSE
Would you rather learn from the mistakes of others, or make them all yourself? While we learn better making our own mistakes, the risks to everyone involved can be very high, so learning from others is important. Granted, not everyone likes to reveal or admit to their mistakes; however, when problems occur or human error causes serious disasters to happen at places like Flixborough, Chernobyl, Bhopal, and Texas City, we need to know why so we don’t repeat the same mistakes (and potentially kill thousands of people in the process). Of course, not all mistakes have such high visibility or impact, but they are still extremely important.
The following stories are meant to be both educational and entertaining. They involve human errors, hardware failures, inadequate training, poor planning, management of change, and more. All stories took place at various plant sites using process control and safety systems. The stories are true but do not include company names; the names of individuals have been changed. There are many lessons to be learned.
Working on Systems Live Comes with a Risk
A Falling Meter Shuts Down a Unit
One end user wanted a safety system retrofit done live. Albert had been successfully upgrading the system all day. During the upgrade, the system presented false alarms to the operator, but he had been forewarned and had grown used to them. Albert noticed that one of the ground fault detectors indicated a problem and that the system wasn’t responding properly. He started checking field wiring with a meter perched on a wire duct. As he bent down to check another circuit, one of the leads from the meter got caught under his knee. The meter was pulled off the wire duct, and it bounced off his back onto a group of circuit breakers. As the circuit breakers tripped, Albert heard output relays in the panel tripping out. He went to inform the operator (who was on the telephone at the time) and said, “I think I’ve shut you down.” The operator thought Albert was joking (as he’d been getting alarms from the system all day) and wouldn’t believe him. When Albert pointed to a video monitor of the flare (which was now showing a very large flame) and said, “Look!”, it was obvious that the unit had indeed shut down. At least it was possible to start the unit relatively quickly. “Oops, sorry!”
Redundant Systems Can Fail
A safety system was believed to have caused several unexpected plant trips, and Chet was called to the site to investigate. Another vendor had done a power survey and reported to the end user that there were issues with the power on the safety system. Chet looked at the system and didn’t believe there were any power problems. He had found and corrected most of the causes for the various shutdowns. Some were software and configuration-related. (The system had been incorrectly configured by nonvendor personnel who had not been through an official training class.)
Plant management requested, however, that Chet make sure the power was okay. Examining signals with a scope did not reveal anything. He connected a power controller (i.e., a device to monitor and set voltage for individual power supplies) to the back of the power system to investigate further. He was trying to use the selector switch on the controller to look at the supply in question, but for some unknown reason, he couldn’t access it. While he was doing all this in the rack room, he heard an alarm go off in the control room. Chet stopped what he was doing. The phone rang in the rack room and someone in the group answered it. After speaking for a moment, the person in the rack room put the phone down and said, “He hung up.” The operator, who had been on the other end of the phone, realized how bad things were at that point and didn’t have time to talk or ask whether the people in the rack room had caused the problem he was dealing with. Just then Chet noticed that the cooling fans on the system had stopped. At that point, he realized that he had completely killed all the 24-volt power to the system. (This should not have been possible in a system with completely redundant power, but it happened anyway). Chet’s heart sank. He quickly unplugged the power controller. There was an arc between the connectors and a large black mark appeared on the power controller. The fans started turning. He went around to the front of the panel to find that there was indeed power, but the system had not rebooted completely. Chet rebooted the system in the correct sequence, and got it up and running again. “Oops, sorry!”
Chet went into the control room and saw a veritable sea of paper on the floor. The distributed control system (DCS) alarm printer was spewing out paper at an unbelievable rate. Chet said he has been in many shutdown situations before, but he’s never heard as much noise as from the flare when this plant shut down.
Big Red Pushbuttons Don’t Always Have the Same Function
Edward worked on the supervisory control and data acquisition (SCADA) and high-power electrical substations for a rapid transit system in a major city. The system had manual means (shutdown buttons) of shutting down the system installed both inside and outside the substation so personnel could shut everything down during an emergency.
The SCADA system was undergoing commissioning and the trains were running during the process. A young apprentice was working for the electricians and this was the first time he had worked in the substation, although he had received training. The apprentice was told to go outside to one of the trucks to get some tools. When he walked up to the door to leave, the apprentice assumed the big red button was simply there to open the door (as is quite common in many other areas). Unfortunately, it was one of the manual shutdown buttons and it shut off the power in the substation. The electricians were startled when all the breakers tripped (they make quite a noise). They walked outside and saw a train full of passengers coming to an unplanned stop due to the loss of power. “Oops, sorry!”
The “fix” was to move the shutdown button further away from the exit door and install a cover over it.
Poor Separation of Control and Safety
Improper Interlock Design and Poor Management of Change
A process vessel ruptured because a hardwired interlock was wired into the control system by mistake. People realized that critical interlocks should be handled by the safety system, not the control system, however, there was a time crunch to meet the start-up deadline. Rather than correct the error and wire the interlock directly to the safety system, the contractor sent the signal over a serial communications link from the control system to the safety system. This corrected the communication problem that resulted in data not being transferred, which caused the vessel to rupture.
A review determined that the same design error existed on the second process train. The wiring was corrected on the second train by one group of people, but others (on the next shift) noticed that the corrected wiring didn’t match the original design drawings, so they changed it back. Luckily, the first group noticed the change during their pre-start-up checks the following morning
- Unplanned events, or combinations of events, can always occur (Murphy is alive and well). Multiple independent layers of defense will hopefully prevent any single failure from having a catastrophic impact.
- Factory people make mistakes, but those mistakes are usually less serious than the ones unqualified and/or untrained people make.
- Redundant systems still fail.
- Follow management of change procedures.
- Keep your drawings up to date.
- Pay attention to details.
- Working on systems live.
- Poor implementation of the requirements.
- Changes in the field without realizing how the original implementation was achieved.
- Trying to get off the job as fast as possible.
We want to hear from you! Please send us your comments and questions about this topic to InTechmagazine@isa.org.