By Bill Lydon
Troubleshooting control and automation systems is a fundamental skill that is valuable throughout your career, and the techniques and experience are transferable to all types of problems in other areas. The most obvious benefit is finding the problem and fixing it to get production running again when something goes wrong. The experience gained in troubleshooting and fixing problems also provides a wealth of knowledge for designing control and automation applications.
I know this firsthand, because in the beginning of my career I did a great deal of controls and automation system troubleshooting and gained invaluable knowledge and experience. Working with systems that have problems means you observe unusual patterns of data and operation that need to be considered when designing control and automation applications. For example, an engineer can understand the impact of a critical sensor failure and use that information to design an application to set the process into a safe operating mode when the censored value reflects a failure.
Ideally, troubleshooting is an orderly process of understanding the problem, identifying the cause or causes of the problem, and implementing solutions to return operations to normal. The problem to be addressed is determined by the difference between proper operation and how it is working abnormally. Once the cause is identified, the appropriate actions can be taken to either correct the issue or mitigate the effects. The latter is sometimes referred to as a "workaround."
The alternative to an orderly and systematic troubleshooting approach is often referred to as "shot gunning," that is, making a big mental leap without verifying the source of the problem and then taking an action like replacing a part. For example, a car will not start, and the troubleshooter assumes the cause is a worn-out battery and immediately replaces it-which does not fix the problem. The problem might be a faulty starter motor, starter switch, broken power cable, or something else.
There is a difference between "shot gunning" and an "educated guess." Without experience, replacing parts without diagnosis is like playing roulette in a casino; the house is against you. Troubleshooters with a great deal of experience learn patterns and symptoms they have seen before and may at times replace a part without further diagnosis. Experienced troubleshooters typically only use a strategy, and they believe there is a high probability it will fix the problem.
Controls and automation systems with problems are acting abnormally, not functioning as they were originally designed. Examples of things that can create problems include sensors with erroneous readings, loose electrical connections, network communication errors, electrical interference from newly installed equipment, overheated control cabinets, power supply problems, and powerful fluctuations.
Software-based systems have created some new dimensions that also have to be included in troubleshooting. When you consider all local lines of software code, there is a much higher level of complexity compared to hardware systems. In my experience, there are two major categories of software-related issues:
- Problem after a software update. For example, after an update the software does not recognize an unusual control program configuration that was not considered in the software update.
- Condition never considered in the software design. These kinds of problems typically occur when you make a change or addition to the system, creating a condition that was never accounted for in the original software design.
It is important to review the history of software updates when troubleshooting and search the update notes online for reported issues.
There are different troubleshooting philosophies, but these are common elements and a sequence to start the process:
Observe and gather information
As much as possible, observe and gather information, avoiding preconceived notions. The goal is to understand the current reality about what is happening. These are some typical questions:
- What are the symptoms?
- When did the problem first start?
- What else that is related to this control or automation has unusual data or behavior?
Ask people working in the area what they observed. What is different now compared to when things were working properly?
Software and firmware issues in systems have increased the complexity and potential problems created by system updates that can change the behavior of applications and controllers. In many systems, the information-gathering process should include the history of software and firmware updates.
Record the chronology of the problem and changes made to the system, including operating parameters, alarms, and alerts. Chronology is the science of arranging events in their order of occurrence. It helps provide an understanding of the problem in the context of the environment. As you proceed with troubleshooting a problem, record the chronology of steps and information gathered in a notebook or electronically with a tablet computer or smartphone. Review the chronology for clues about the problem.
Based on my experience, I cannot stress enough the value of making a chronological and data record as you proceed through troubleshooting. Early in my career I was taught by an experienced troubleshooter to carry 3x5 cards and record each step and data as I was troubleshooting automation systems. That was one of the most valuable lessons for troubleshooting.
Identify root causes
Following a logical path of reasoning based on symptoms can solve many problems, but there are other problems that are more difficult. Sometimes after observing symptoms, the root causes are obvious. With more complex control and automation, however, it takes more effort to identify root causes. There may be more than one component (i.e., sensor, actuator, relay power supply, or network communications) that is contributing to the problem. Today this is complicated with software, firmware, network configuration, and potential cybersecurity problems.
Identifying root causes may follow a logical path of reasoning based on the observable symptoms. In my experience many problems fall in this category, but with greater system complexity there are increasingly bigger troubleshooting challenges.
Finding the source of the problem is the detective work of troubleshooting. In complex systems where there can be multiple things contributing to a problem, it is advisable to change one thing at a time and check if that solves the problem. Steps taken should be added to your chronology notes.
It pays to keep an open mind and think outside the box. There may be ways to find the problem that are not obvious.
Working with machine tool-based controls early in my career, there was a perplexing problem with integrated circuits that overheated on control boards. When I opened the cabinet, the problem would go away. I wrestled with this for quite a while. Then I talked to a very experienced troubleshooter I worked with, and he reached into his bag of tools and pulled out a hairdryer! I created cardboard baffles we used to partition off parts of this large circuit board and first heated one half of the board, then the other, to find the area failing under heat. We simply divided the problem area in half and did the procedure again, continually narrowing the focus. Neat method!
I had a similar problem a couple of years later with the minicomputer, and I used my troubleshooting hairdryer to isolate the problem and fix it. Always keep in mind that troubleshooting is finding an abnormal operating condition that can be caused by a wide range of things.
The most valuable discipline when troubleshooting systems is not making assumptions.
In another example years later, I was involved in one of the early industrial network installations of DeviceNet on a liquid crystal display production line. Our company provided the control software that was running the production line. After the line was commissioned and running, there was an intermittent problem with the automation that occurred periodically around midnight.
Because this was an important new installation, every vendor involved was on site trying to troubleshoot the problem, bringing in sophisticated network analyzers and other equipment. There was not an absolute root cause established, but suppliers made changes based on hunches to fix the problem. The problem persisted.
One night I went in with the plant operations person to watch the operation of the system. When the cleaning person was working in the area, default occurred. We quickly found the root cause. The network was implemented with cables interconnected using round screw-type quick connectors. We noticed the cleaning person bumped cable connectors when using a broom to sweep underneath the assembly line. We found a cable connector that was not tightly screwed together, and this solved the problem!
We all made the faulty assumption in the beginning that since this was a new high-tech production line, the problem had to be complex. Everyone had a complex theory about the problem they worked to find with sophisticated test equipment.
Based on the urgency of the problem, it might be necessary to implement a temporary workaround solution to restore operations to some level. A workaround is typically used when there is a special circumstance, such as a lack of parts to fix the problem immediately and properly. Generally, the controller or automation will not perform up to normal specifications using the workaround. Workarounds on complex systems have to be done considering the implications so you do not create unstable or unsafe operations. For example, bypassing a faulty safety switch to keep the process running would not be an appropriate workaround.
Joseph Alford, consultant, Automation Consulting Services, has more than 35 years of experience and is a highly active ISA member. He shared his thoughts on process troubleshooting:
"One of the most important traits that a process operator can have is the ability to quickly and accurately diagnose process upsets and respond accordingly. Chances are that, for new processes/plants, various process abnormal situation analysis techniques were used by engineers and scientists in developing the process. These may have included FMEA [failure modes and effects analysis] or perhaps "rationalization" exercises as part of specifying alarm parameters, and may have resulted in the creation of graphics, such as fault trees or fishbone "cause-effect" diagrams (also known as Ishikawa diagrams) and content in an alarm management database. Regardless, some thought and documentation regarding process failure modes was undoubtedly pursued in developing a manufacturing process.
"There are several challenges in the pursuit of effective process troubleshooting. One challenge is that there are usually several possible causes to a given process upset (e.g., an abnormal tank pressure reading may be due to a pressure sensor failure, seal or gasket failure, relief or control valve failure, or out-of-control exothermic reaction). Many of the possible causes will not be immediately detectable with relevant sensors, so cannot be automatically reduced to a single probable cause. So, to help operators with manual troubleshooting, what is useful is an online callable list of the possible causes and some indication of the probability of their occurrence and/or priority in checking them out, that is, what root cause should the operator check first?
"A second challenge is the need to make process troubleshooting information as quickly and easily available to operators as possible (i.e., time is money, and time delays in troubleshooting and responding to process upsets will often result in an escalation in the severity of the process upset). So, things like fault trees, fishbone diagrams, or alarm rationalization databases should not remain as items in hard documents but should be distilled into useful information for operators and made available online as part of the human-machine interface in systems they routinely use (e.g., process control computers)."
Withhold judgment until you have gathered information without jumping to conclusions. Cables with barrel connectors screwed together made up the network.
Successfully troubleshooting and solving a problem can be immensely rewarding. One phrase that I found helpful when troubleshooting was from my AC/DC fundamentals professor in college. He started every class by stating, "Where does the reasoning begin?"