Security process hazard analysis review
Determining security level requirements
By Edward M. Marszal, PE
Managing the risk of hazardous process plants is a difficult and resource-intensive activity. In order to reduce costs and improve productivity as technology evolves, process plants employ new equipment and techniques that introduce new hazards. Over the past few decades, the process industries have almost entirely shifted from control systems that were either analog electronic or pneumatic to distributed control systems (DCSs) and programmable logic controllers. These computer-based systems made quantum leaps in functionality over their analog counterparts with respect to calculation complexity, data storage, and communication, but introduced a new threat—deliberate and malicious cyberattacks.
At this point, most process industry plants have not implemented much cybersecurity on their industrial control systems (ICSs), leaving perimeter guarding to the discretion of their information technology departments. Even so, cyberattacks rarely cause physical damage to process plants. Process engineers have safeguarded their plants against failures that can cause significant safety consequences, and this is true whether or not the failure occurs organically through random hardware failures or deliberately through a cyberattack. The safeguards employed by these process engineers are common, inexpensive, and very often inherently safe against cyberattack, because most of these devices were invented dozens or hundreds of years before the advent of the computer.
Even though cyberthreats are not adequately addressed with existing process hazard analysis (PHA) methods, there is no reason to abandon everything that we know about process risk assessment and start from scratch. Instead, industry is extending tried-and-true methodologies for PHA to address the problem of deliberate cyberattacks. By doing so, none of the existing PHA effort is wasted or needlessly duplicated. Instead, a small amount of additional effort is utilized by starting with traditional PHA and focusing only on scenarios where cyberattacks are the cause or scenarios where cyberattacks can prevent all the safeguards from operating properly. It is these key scenarios that will generate recommendations to implement safeguards that are inherently safe against cyberattack, or to define the appropriate level of safeguarding from cyberattack, as defined by a security level (SL).
Security levels are categories that define a set of policies, procedures, and practices that must be implemented to secure an industrial control system zone. Unlike the quantitative safety integrity level (SIL) defined in the IEC 61511/ISA-84 standard for safety instrumented functions, which is a band of average probability of failure on demand, an SL is a set of qualitative requirements that explain how a system should be designed and operated. IEC/ISA-62443 defines four security levels, one through four, with SL 1 being the least secure and SL 4 being the most secure. The levels are defined (in the abstract) as:
- SL 1: Prevents the unauthorized disclosure of information via eavesdropping or casual exposure
- SL 2: Prevents the unauthorized disclosure of information to an entity actively searching for it using simple means with low resources, generic skills, and low motivation
- SL 3: Prevents the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with moderate resources, skills specific to industrial automation and control systems (IACSs), and high motivation
- SL 4: Prevents the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with extended resources, IACS-specific skills, and high motivation
The above definitions of SL are quite philosophical, providing few concrete design specifications. Much more information is required to fully understand the differences in design practices between the various SLs. So much so, in fact, that an entire document in the IEC 62443 standard set is dedicated to explaining the differences between the various security levels: IEC 62443-3-3. Selecting an SL for each ICS zone provides a set of requirements to implement in subsequent cybersecurity life-cycle steps.
The SPR study
The security PHA review (SPR, pronounced “spur”) study is an evolution of PHA. It assigns performance targets to ICS cybersecurity and makes recommendations to implement safeguards that are inherently safe against cyberattack in lieu of setting high SL targets. The SPR approach was specifically developed to fit more naturally with the normal project life cycle of the design, implementation, and operation of process industry plants while also leveraging existing engineering tasks and reports generated for general process safety. In this way, the limitations of existing cyberrisk analysis approaches can be eliminated while maximizing the use of information and documentation generated in other stages of the engineering life cycle.
The SPR study (figure 1) is specifically designed to generate the required SL using the existing process hazard analysis as the foundation and starting point. The SPR process allows companies to select the SL of an ICS zone in a manner that is analogous to the way that layer of protection analysis allows them to select SIL targets for safety instrumented functions (SIF).
Figure 1. Simplified security PHA review process
The process begins with the collection of the results of a process hazard analysis. This can either be done with the report of an existing PHA, or as an additional step during a PHA studyz—while the study is in progress. Each scenario of the PHA is then reviewed to determine if it is “hackable,” which means that the scenario could be forced to occur by a malevolent actor who has taken control of the ICS. First, the cause or initiating event is reviewed to determine if it can be hacked. Generally, this would be true for any computer control loop failure or equipment item starting or stopping. It would not be true for human interactions with mechanical process equipment that is not connected to a computer. If the cause cannot be hacked, the analyst moves to the next scenario.
Next, the safeguards are reviewed to determine if they can be hacked. In general, all control loops, safety instrumented system functions, and operator responses to alarms are hackable, but mechanical devices such as relief valves are not. If any one of the safeguards cannot be hacked, the analyst moves on to the next scenario.
If the cause of a scenario and all of the safeguards can be hacked, then the overall scenario is determined to be hackable. This means that if a malevolent actor could take control of the ICS, that person would be able to generate the scenario under consideration and realize its consequence. For each hackable scenario, the consequence category from the PHA needs to be determined. Based on the risk tolerance criteria of the process owner, an IEC/ISA SL would then be assigned to that scenario. Of course, if the consequence is severe and causes an SL that is not desirable, the analysis team has the option of recommending a safeguard that is inherently safe against cyberattack. This would remove the scenario from consideration as a driver of the selected SL. After all the scenarios have been reviewed in this way, the SL that is assigned to a zone is the highest of all of the SLs that were assigned to the scenarios that are associated with the ICS equipment of that zone.
Process facilities are systematically assessed to determine what hazard scenarios could occur that could cause a significant consequence. For each of these scenarios, analysts assess the available safeguards to determine if they are adequate. This exercise is called a process hazard analysis. PHAs are performed using a variety of techniques. The most common and comprehensive technique is the hazard and operability (HAZOP) study. In a HAZOP study, analysts divide a facility into “nodes” of similar operating conditions and walk them through a set of deviations, such as high pressure, low temperature, or reverse flow. For each of these guide words, a multidisciplinary team (e.g., operations, safety, and engineering) determines if there is a cause of deviation beyond safe operating limits. If so, the team determines the consequence if the deviation were to occur, and then lists all the safeguards that are available to prevent that deviation from occurring—or at least escalating to the point where damage can occur. An example HAZOP worksheet is shown in figure 2.
Figure 2. Sample HAZOP worksheet
When a HAZOP is performed, a team of engineers looks at virtually every failure that can possibly occur and ensures that there are appropriate safeguards to protect against each one. If the team determines the degree of safeguarding is inadequate, it will recommend adding new protection layers or making modifications to improve existing safeguards. Using this process, virtually any process deviation that can be conceived is analyzed.
Although this process systematically and thoroughly assesses potential hazard scenarios, it currently does not make absolutely certain that a plant is inherently safe against cyberattack. The hazard scenarios are assessed to determine if safeguards are appropriate, but there is typically no additional consideration that the safeguards could all have been disabled by malicious attacks. This is the purpose of the SPR study.
The process industries commonly employ a number of safeguards that are inherently safe against cyberattack. One of these safeguards can be employed to protect a process plant against virtually any conceivable cyberattack. The real work of protecting process industry plants against cyberattack vectors that can cause large amounts of physical damage is to make the process for selecting and installing these safeguards thorough and systematic. Where they are not installed and the plant is vulnerable to a cyberattack, engineers should define an appropriate SL.
The common process industry safeguards that are inherently safe against cyberattack include:
- pressure relief devices
- mechanical overspeed trips
- check valves
- motor monitoring devices
- instrument loop current monitor relays
Security PHA review example: thermal runaway reaction
A chemical process employs a reactor that contains a series of packed beds of catalyst to remove chemical impurities from a feed stream by reaction with hydrogen. The chemical feed is vaporized and mixed with hydrogen before it enters the reactor. Once the reactants enter the reactor vessel and contact the catalyst bed, an exothermic reaction occurs, significantly increasing the temperature of the reactant materials and the vessel. To reduce the temperature of the reaction products leaving the first bed, an additional cool hydrogen quench is supplied under flow control in between each catalyst bed. A simplified process flow diagram of the process is shown in figure 3.
Figure 3. Hydrogen reactor simplified process flow diagram
If the hydrogen quench were to fail, for instance, because the flow control loop supplying the quench hydrogen failed with its control valve in the closed position, the temperature in the next bed of the reactor would significantly increase. Additionally, as the temperature increases, the reaction rate also increases—causing a faster reaction and more heat release, thus a higher temperature. This vicious cycle continues and quickly gets to the point where subsequent quenches are no longer effective, and the temperature in the reactor and its outlet piping exceed the maximum allowable working temperature (MAWT), causing a loss of containment of the process contents as the piping and vessel melt and open to the atmosphere. This scenario was considered during a HAZOP-style PHA. The worksheet for the low-flow deviation is shown in figure 4.
Figure 4. Runaway reaction PHA worksheet
The SPR begins with an analysis of the initiating event. In this case, the initiating event is the failure of a flow control loop. Because the control loop is contained in a distributed control system, it is computer based. If a malevolent actor remotely took over the DCS, the position of the valve could be manipulated to the closed position. As such, the initiating event is determined to be hackable.
Next, all of the initiating events are reviewed to determine if they can be hacked. In this case, there are two safeguards that are related to operator intervention based on alarms and one that is an SIF. The two operator intervention safeguards are determined to be hackable, because the alarm annunciation occurs in the DCS. If a malevolent actor were to take control of the DCS, the operator could be blinded to the loss of the flow condition if the hacker disabled the alarm and froze the human-machine interface value in its last good state. The one SIF is also determined to be hackable, because it resides in an SIS that is based on a programmable logic controller. If the control system were taken over by a malevolent actor, the output of the SIF could be frozen in an energized state, making the SIF unable to respond to the hazardous condition.
In this case, the team determined that all the safeguards could be hacked. As a result, the next step is to identify the consequence category of the scenario, and use that consequence category to determine the SL required to make the risk of this scenario tolerable from a cybersecurity perspective. The consequence and SL are related by the operating company’s tolerable risk criteria (figure 7).
Figure 7. Tolerable risk criteria
The consequence category is high in this case, based on the potential for a single fatality from the fire that could accompany the loss of containment event. In accordance with the risk tolerance criteria in figure 8, this results in an SL assignment of SL 2.
Figure 8. Consequences
In this example, the assigned SL can be reasonably achieved by typical cybersecurity mechanisms that the plant is familiar with, so the project team accepts the SL assignment without further deliberation, and the SPR study continues. But consider a case where the SPR process resulted in the assignment of a very high SL that required a significant redesign of the cybersecurity mechanisms of the ICS that are beyond the capabilities of the plant equipment and staff to implement.
To explore this situation, consider the same process scenario again, but in this case, assume that the consequences are much higher. For instance, in another similar case, a release of the reactor material after loss of containment could cause a large toxic gas cloud instead of a localized fire. If the result of the release of the toxic gas cloud is multiple off-site fatalities, now the risk of the situation is entirely changed. Figure 9 presents a revised PHA study report excerpt for this situation.
Figure 9. Runaway reaction PHA worksheet (revised consequence)
In this new case, the SPR would proceed in exactly the same way. The initiating event analysis would show that it is hackable, and the safeguard analysis would show that all the safeguards are hackable. But in this case, instead of a consequence category of “high” that results in an SL of 2, the consequence category is “very-very high,” resulting in an SL of 4. An SL of 4 is a very difficult target to achieve, and most ICS design, operation, and maintenance practices would not achieve SL 4 without very difficult and expensive modifications to equipment and practices. In a case like this, it may be prudent for the team to recommend implementation of a safeguard that cannot be hacked, so that the consequence of this scenario does not factor into the selection of the required SL.
Upon review of the common safeguards that cannot be hacked, it is determined that no self-contained mechanical device, like a pressure relief valve, is capable of preventing the scenario under consideration. Furthermore, because the hazardous event is a runaway reaction with no limit on the potential temperature that could be achieved, changing the vessel design to increase the MAWT will also not be effective. In this case, the only effective safeguard that is inherently safe against cyberattack is an analog “mimic” of the safety instrumented function.
The analog “mimic” of the SIF UZC-207 will employ the second thermocouple of a dual element thermocouple set in the existing thermowell. The second thermocouple element will be wired to an analog temperature transmitter that will convert the temperature measurement to a 4–20 mA signal. The 4–20 mA signal will be analyzed by an analog current monitor relay that will open a contact in the 24 VDC signal to the solenoid valve for UZV-207, de-energizing the solenoid, venting the valve’s actuator, and causing the valve to go to a closed position. As designed, this entire analog mimic is inherently safe against cyberattack, and any cyberattack that is waged on the digital complement (UZC-207) will not interfere in the safety functionality of the analog mimic function. The design of the mimic is shown in more detail in figure 10.
Figure 10. Hydrogen reactor SIF with analog “mimic”
Because the scenario can no longer be hacked, the SPR analysis yields a result of “no requirements” for the SL for this scenario.
Protecting the process industry
Process industry plants contain hazards that can have very severe consequences if a loss of containment occurs. Process industry design engineers have dozens or even hundreds of years of experience in protecting these facilities. Many of the safeguards that have been designed to protect process plants were developed years before computers even existed, and thus are inherently safe against cyberattack.
When properly employed at the required locations, these safeguards can make a process plant inherently safe against cyberattack. Application of these safeguards in the required locations can be performed in a thorough and systematic fashion through an SPR study. This process involves going through the process hazard analysis reports that have already been completed for a process plant and reviewing each scenario. The review involves considering the cause and safeguards to determine if they can be hacked. If so, and if the consequence is significant, then the plant should employ a safeguard that is inherently safe against cyberattack.
The SPR process determining the required SL of ICS is in its infancy, but being very rapidly adopted. It is being rapidly adopted because the process is simple and obvious to process safety practitioners once it is explained and the rationale for undertaking the additional study steps are defined.