March 2009

Fault tolerance and disaster recovery

By Michael Whitt

In a perfect world, our control systems would install and operate indefinitely with no faults. Experience teaches us otherwise.

Control system specifications should deal with fault tolerance and failure recovery issues.

A Hazard and Operability (HazOp) study is a good vehicle for determining the level of risk acceptable for a particular production process. If a formal HazOp is not practical, then it is up to the specification writer to do his own analysis.

According to the National Institute of Standards and Technology (NIST), there are four approaches to achieving dependability:

  1. Fault avoidance - Design phase: Use care to eliminate potential causes of faults in the design phase. This is where a HazOp or some other detailed method of analysis is most useful. The HazOp process, when coupled with the proper application of industry standards and accepted practices, followed by a thorough design check, should help filter out most-but never all-of the conditions in which an avoidable fault could exist. For software applications, one should execute a thorough Functional Acceptance Test (FAT) in which process simulation or loop-back logic validates the software prior to shipping it to the site. The control specification should describe the level of detail that will be required during the FAT.
  2. Fault removal - Pre-commissioning phase: Regardless of the thoroughness of the FAT, mating the control system up to the actual field devices always uncovers problems. Vendor-provided equipment rarely operates per their preliminary literature; control schemes may need modification. This is where a well-defined Site Acceptance Test (SAT) is most useful. The SAT builds on the FAT by removing the simulation and having the software operate the actual field devices. Removing faults during or after commissioning is extremely expensive.
  3. Fault tolerance - Post-commissioning phase: Allow the system to operate without an intolerable interruption in production even in the presence of ongoing hardware or software faults. A Hot-Standby set of processors that switch primaries upon fault detection would be an example of a fault tolerant system, as would an Ethernet ring topology that allows a device to fail without interrupting the operation of the remaining devices on the network.
  4. Fault evasion - Post-commissioning phase: Sense the trend toward a problem situation, and initiate corrective action before the problem occurs. Setting an alarm if network data throughput deviates beyond a set percentage from the norm, giving the operator time to make corrective action would be an example of fault evasion.

Weighing failure recovery

Regardless of how bulletproof the design is, how fault-tolerant the system, or how well trained the operators and technicians, system failures are still possible. Though we can mitigate the severity and duration of these undesired events by and technique, the fact remains that a disastrous event could occur at any time. A properly written control system specification should deal with this from the beginning.
How quickly could your facility recover from a major event such as a hurricane or fire? There are two major categories to consider, the physical plant and software.

Physical plant configuration control: When a processing plant is constructed, a mass of drawings and other documents emerge as a matter of course. The control system specification should describe the documents that are critical for failure recovery. A complete record of these drawings should be maintained offsite. Most of those documents are static and change very little after project installation. Some, however, provide information that could change after installation. Instrumentation and electrical drawings fall into this category, as do Piping and Instrumentation Diagrams, instrument specifications, electrical single lines, and other documents. It is important to maintain these documents in order to quickly reconstruct sections of the facility if needed.

Data security: This depends on a mix of automatic and manual processes. There are two primary considerations for data security in this context, backup and restore. Several of the more common options for managing data follow.

  1. No offsite data, daily backups to hard drive: This is the least secure. If the computer fails, the backups are likely to be lost. The likelihood of being able to restore is remote.
  2. Daily backup to hard drive with a weekly backup to tape or other media. Onsite and offsite storage of tapes: This is better. In this case, the most data that will be lost is a week. Restoration is manual and can happen as soon as retrieval of tapes. Until recently, this was probably the most common method of archival.
  3. Daily backup to hard drive with a weekly backup to tape or other media. Onsite and offsite storage of tapes. Redundant, hot-swappable, mirrored hard drives: This is becoming a more common configuration as the cost of this configuration falls. In this case, data constantly backs up to the mirrored drive. If one drive fails, the server switches immediately to the backup, with little or no effect on the system. Pulling and replacing the bad drive without shutting down the server is possible and the drive will automatically re-mirror. The risk in this configuration is less related to the failure of the drive than to a fire or some other disaster in the equipment room taking out the entire server and both drives, in which case, a week's worth of data will be lost.

Fault tolerance and disaster recovery are topics frequently omitted or under-described in control system specifications.


Michael Whitt ( is an ISA Senior Member and the Manager of Integrated Systems at Mesa Associates.