A postcard from the promised land of alarm management
We haven't seen a nuisance alarm or major alarm flood in months.The plant has never run so well. The weather is fine. Wish you were here.
- The monitoring system generates key performance and diagnostic metrics.
- The benchmark documents the current state of alarm management, including performance metrics and practices.
- The philosophy documents the expectations for managing the alarm system, including definitions and performance targets.
By Nicholas Sands and Donald Dunn
Once you have achieved the benefits of a well-managed alarm system, you wonder how you ever ran the plant without it. It is a paradigm shift within the culture of an operating plant, similar to a major lifestyle change, like losing 50 pounds or quitting smoking. This fundamental shift in an organization's culture is not easy and is rarely embraced by everyone, but this shift is required as the work processes continue. The benefits are real, quantifiable, and achievable. The journey from chaos to the "promised land" has been made by others, and you can make it there, too.
The benefit is improved plant performance through improved operational discipline, which means doing the right thing at the right time, every time. A well-managed alarm system will notify the operator at the right time, and only the right time, for a specific action. A well-trained operator, or an operator assisted with some guidance from a well-designed alarm system, will know the right action to take in response to the alarm. An operator, not overloaded with alarms, will take the right action at the right time to correct the process condition. This was clearly and concisely stated by Campbell Brown in his "Horses for Courses - A Vision for Alarm Management" paper: "The fundamental goal is that Alarm Systems will be designed, procured, and managed so as to deliver the right information, in the right way, and at the right time for action by the Control Room Operator (where possible) to avoid, and if not, to minimize, plant upset, asset, or environmental damage, and to improve safety."
The fundamental conclusion made by alarm management pioneers M. L. Bransby and J. Jenkinson: "Poor performance costs money in lost production and plant damage and weakens a very important line of defense against hazards to people." Thus, improved operational discipline results in fewer incidents, increased plant reliability, reduced quality problems, reduced environmental excursions, and less equipment damage. To estimate the bottom-line benefits of improved alarm management, start with a list of these events in the last year and associate a cost for each event. Then, with input from different members of the plant, estimate how many of the events were preventable, or could at least have been mitigated if you made a significant step change in the ability to indicate and respond to abnormal conditions. The result is the quantifiable benefit at stake.
Often it takes only a few significant avoidable events to justify an alarm management improvement effort. It may be similar to a significant personal event, like a heart attack, that finally provides the motivation for a lifestyle change.
Before setting out on the journey, you should know it is not an easy path. Typically, it requires you to do things differently than you have in the past. Not everyone in the organization will embrace the changes. Some will want the results to change, but not want to change the practices that created the current chaos. This is insanity, according to Albert Einstein.
Here are a few things you might want for the journey:
- A leader - someone with the time and authority to organize the effort
- A GPS - a measurement system to tell you if you are on the right path
- A guidebook - a field guide to point out tricks and traps along the way
- A map - a guidance document, which you have to make yourself
- A guide - an experienced person that can help you get there more quickly
The leader is sometimes called an alarm management champion. He or she coordinates the effort and keeps the team moving forward. The leader should be appointed by management as an indication of its support. This is an important step. Management needs to provide the resources-time and money-to do the work. While an effort led from the plant floor can make an impact, successful efforts in changing the organization require management support. Management should review the progress and even take an active role at times.
The GPS is an alarm monitoring system that captures alarms as they are activated and generates the key performance metrics of the alarm system. The monitoring system will help you know your starting point and help show progress along the path of your alarm management journey. Some of the key performance metrics are:
- Average alarm rate per operator
- Percent time in alarm flood
- Number of out-of-service alarms
Average alarm rate is an important metric because it is related to the human factors considerations of how many alarms an operator can handle in a defined period of time. The rate is an indication of alarm load or overload. The average alarm rate, normalized per operator, is a good metric to track the progress in your journey. Often the starting point is very bad, perhaps thousands or tens of thousands of alarms per day. The recommended goal per ANSI/ISA-18.2-2009 is around 150 to 300 alarms per day, per operator.
Key alarm performance metrics with recommended targets
| Alarm performance metrics
||Metric target value
| Average alarm rate
| Percent time in alarm flood
Percent time in alarm flood is an important metric since average alarm rate does not tell the whole story. You may have long periods with few alarms and then overwhelm the operator with dozens of alarms at once, perhaps during a plant shutdown. In alarm floods, typically defined as periods with more than 10 alarms per operator in a 10-minute period, the operator is very likely to miss alarms. This was the case in the well-known incident at the Texaco refinery, Milford Haven, in 1994 where several hundred alarms were recorded in the last few minutes before the explosion. The recommended goal is less than 1% of the time in the alarm flood range.
The number of out-of-service alarms is an important metric that indicates the number of alarms that have been suppressed or hidden from the operator. The average alarm rate and percent time in flood would indicate perfect performance if all the alarms were placed out-of-service. A high number of out-of-service alarms indicates a potential problem, and possibly a safety culture problem.
Different control systems have different terms placing an alarm out-of-service, which means manually turning off an alarm. Common terms include inhibit, disable, and hide. These terms describe a function in the control system and not the process used to control the function. The terms are not used in ISA-18.2, which uses the term suppression for the general function of preventing an alarm from indicating an abnormal situation to the operator. The term out-of-service is used to describe alarms that have been manually suppressed.
The guidebook is a book on alarm management that explains how to get through the various steps of the journey. ISA-18.2 is not a guidebook. It provides a common language for alarm management and describes important activities and requirements. It describes what should be done, but not how to go about doing it. A guidebook is much more useful for figuring out how to do things. There are several books available that are worth adding to your library, for example, Alarm Management: A Comprehensive Guide, 2nd Edition. In addition, it is always a good idea to get formal training.
With a leader, a monitoring tool, and a guidebook, you are ready to begin. An important step is to understand where you are starting from and where you want to go. By itself, the monitoring system cannot pinpoint your starting point. There is much more to alarm management than performance metrics. A benchmark, or initial audit, is a good way to identify the practices in place and working well, and the practices that might be missing or are not being followed. Some key areas to audit are how changes to the alarm system are managed, how operators are trained, and how temporarily placing alarms out-of-service is controlled.
The map is the alarm philosophy that documents where you want to go and some steps on how to get there. It defines how a site will address alarm management. While there are many common components, it is specific to a site and should address the elements of the current state that need to be changed. It includes key definitions, provides practices and procedures, and documents roles and responsibilities. It contains guidelines on how to classify and prioritize alarms, and how changes to alarms will be managed. It also establishes targets for key performance metrics, like the acceptable alarm load for the operator as measured by average alarm rate.
In developing the alarm philosophy, and in other steps of the journey, it may be useful to have a guide, a person who has taken the journey before. A consultant or an experienced internal resource might help you complete the journey more quickly and with fewer missteps. (Disclaimer: The authors are not alarm management consultants and are employed in the process industries. Note: We are not selling anything.) Choose your guide with care.
Now it is time to set off on the journey, and it can be a long haul. There are several tasks to complete, but they can be broken down many different ways according to site needs and resources. All alarms will need to be rationalized and documented. Problem alarms will need to be modified. Advanced alarming will likely be needed in certain cases. And all the changes will need to be managed and include operator training.
Rationalization involves reviewing and justifying potential alarms to ensure they meet the criteria for being an alarm as defined in the philosophy. It also involves defining the attributes of each alarm (such as limit, priority, classification, and type), as well as documenting the consequence, response time, and operator action. Although safety alarms generally tend to be some of the most critical in a plant, they still must go through the rationalization process. The product of rationalization is a list of configuration requirements recorded in the Master Alarm Database. Alarm classification and prioritization are extremely important parts of rationalization. They are not mutually exclusive or redundant. Classification is a tool for managing requirements. Prioritization is exclusively for the benefit of the operator.
Classification identifies groups of alarms with similar characteristics (e.g., environmental or safety) and common requirements for training, testing, documentation, or data retention. Aggregating alarms by requirements helps the organization more adequately allocate resources to ensure they are addressing the hazards within the facilities. It should be emphasized that the use of classification and or classes is to be defined in the philosophy and is not for the benefit of the operator. Classification is done during rationalization using the consequences and how the need for the alarm was identified-for example, during a Layer of Protection Analysis (LOPA).
Alarm priority is typically determined based on the severity of the potential consequences and the time to respond. Most companies have a risk matrix that may be used for risk assessments, typically established by a corporate risk management group. If possible, the information in this risk matrix, consequence descriptions, and categories should be used as a basis for an alarm severity matrix. Three or four alarm priorities are recommended. Grouping alarms based on priority helps an operator adequately allocate their time or ensure they are addressing the hazards within the facilities.
Monitoring can be enhanced by using alarm class and alarm priority as filters. Reporting frequent alarms by alarm class, such as safety or LOPA listed alarm, can identify potential hazards or poorly designed alarms. Frequent high priority alarms indicate either a problem in alarm design or a frequent potential hazard usually not desirable. Using class and priority in monitoring reports can help drive significant improvements very quickly.
The monitoring system can identify problem alarms with diagnostic metrics. Unlike performance metrics, which indicate the overall performance of the alarm system, diagnostic metrics point to specific alarms with specific issues. The most frequent alarms, stale alarms, and chattering alarms are just some of the diagnostics available. The diagnostic metrics are very powerful for making dramatic improvements in alarm system performance. If you have a monitoring system, it can be very tempting to start with making quick improvements before you have completed your benchmark. Resist the temptation. Document the current state before you begin making improvements.
Sometimes it takes advanced alarming-additional logic that modifies alarm attributes-to resolve alarm issues. In fact, it is almost always needed to reach the goal. A typical example is state-based alarming that automatically suppresses alarms on equipment when the equipment is not running. Advanced alarming logic should be designed with care so as not to introduce hazards.
Over time, each alarm can be rationalized, each problem alarm can be addressed, and advanced alarming can be added. The steps take time. Each change should be managed to keep documentation and training up to date. These steps lead to improved alarm system performance and improved operational discipline, which results in improved plant performance.
The journey from alarm system chaos to the promised land of alarm system performance is not only possible, but it has been accomplished by many sites. The journey is not easy and takes preparation and work. There are some key steps in preparing for the journey: getting an alarm management leader, a measurement system, a reference book, an alarm philosophy, and perhaps a consultant. There is no quick fix to the problem of alarm management. D.V. Reising and T. Montgomery in their paper, "Achieving Effective Alarm System Performance," note: "There is no 'silver bullet' or 'one shot wonder' for good alarm management. The most successful sites will likely approach alarm management as an ongoing, continuous improvement activity, not unlike preventive maintenance or total quality management programs."
As part of the continuing development of ANSI/ISA-18.2-2009, a series of ISA18 technical reports (TRs) is being developed to help alarm management practitioners put the requirements and recommendations of ISA-18.2 into practice. If you are interested in contributing your knowledge and experience to the TR development effort, please contact ISA18 co-chairs Nicholas Sands or Donald Dunn.
ABOUT THE AUTHORS
Nicholas Sands (Nicholas.P.Sands@USA.dupont.com) is a process control engineer at DuPont. He has co-chaired the ISA18 committee since 2003 and currently serves as the vice president of the ISA Professional Development Department. Donald G. Dunn (Donald.Dunn@aramcoservices.com) leads a Consulting Engineering group at Aramco Services Company, which is a subsidiary of Saudi Aramco. He has co-chaired ISA18 since 2003 and will be vice president of the ISA Standards & Practices Department in 2011-2012.