July 2008

Web Exclusive

Flirting with disaster

Your system crashes, you need a plan to bring you back up-now

Fast Forward

  • Disaster recovery planning is critical to ensure operations quickly return.
  • A complete plan needs to address software, hardware, and the network.
  • Policy becomes management's agreement on the risk level it will assume.
 
By Michael Carey

An ounce of prevention is worth a pound of cure. Better safe than sorry. You can have your cake and eat it to. A stitch in time saves nine. A chain is only as strong as its weakest link.

A cliché fest for sure, but they are all vivid reminders that a manufacturer needs to work out a complete disaster recovery plan.

Disaster recovery planning is a critical process required to ensure operations quickly return after system failure, data loss, or system destruction. Users mistakenly assume disaster recovery planning is a backup and restoration plan. But backup and restoration is only part of the disaster recovery plan. A true disaster recovery plan needs to not only address software but also hardware and the network. It also needs to address prevention and recovery and take into account the ability to protect systems using current technology and legacy systems where replacement hardware may not be available.

Disaster recovery planning commences during the system specification stage because all required hardware, software, and services need to be in the plan for the project and then in the operational budget.

Accountability

As a rule, the disaster recovery plan must account for system failures, data loss, and system destruction. System failures can occur in any of the system components or in the network itself. A system disaster can occur from failures of the hardware, the software, or even from power.

System destruction is similar to system failures, but on a larger scale where multiple system components are destroyed by fire, flood, or demolition. Data loss can occur from system failures and system destruction but also from data corruption, user actions, malicious programs, etc. Thus, in order to address disaster prevention and disaster recovery, the disaster recovery plan must consist of policies, procedures, hardware, software, and services to account for any possible occurrence from user mishaps to natural disasters.

Defining a disaster recovery policy is the most critical task in disaster recovery planning. The disaster recovery policy becomes management's agreement on the risk level the organization is willing to assume and defines the spending guidelines for recovery. Such a policy should instruct whether the company should purchase a spare controller or whether it should include redundant controllers in the design. Assessing the cost and risk is management's responsibility. After all of that occurs, management needs to communicate the disaster recovery policy to the organization.

First step

Designing fault tolerance into the system is the first step of prevention. A manufacturer can remedy power problems by using surge protectors or Uninterruptible Power Supplies (UPS). They can also bypass failures by adding redundancy through RAID drives, redundant controllers, redundant networks, redundant network interface cards (NICs), mirrored databases, and mirrored servers. When incorporating prevention into a design, it must be smart and cost effective. Safeguards need to provide effective prevention not just a semblance of prevention. In one case, having redundant controller processors in the same rack using a single power supply is really not that effective on the whole because there is still a single point of failure in the power supply. Ideally, the entire controller should be redundant, and the redundant controller processors should be in a separate rack with a different power supply. Another case of useless redundancy is servers with two NICs where they both connect to the same non-redundant switch. Either spend the money and get redundant switches, or save the money by not purchasing the second NIC and use that money somewhere else where it can provide more value.

Testing procedures

Prevention also comes in maintenance procedures: procedures that verify UPS battery life, replace controller batteries, updating virus software, and installing security patches. Some procedures seem a priority to disaster recovery like backup procedures, but such procedures should be tested because it would be a rude awakening when it is time to restore a SCADA server from a backup only to discover the recovery procedure does not work, or worse yet, the backup procedure was faulty. Other procedures such as a change control procedure play an important role in disaster recovery, but a manufacturer rarely groups them in with the disaster recovery plan. The change control procedure is very important in the disaster recovery plan because this ensures all changes made to the system properly store, so during a system recovery the new version become the de facto change in the recovery and old versions of hardware or software do not re-enter the system. Remember change control must account for the custom software, the packaged software's updates and patches, and even hardware firmware upgrades.

The final piece of prevention ensures a manufacturer can recover the system in the case of destruction. This normally requires software like installation media, custom configuration, and data, remain stored off-site in a controlled environment to prevent media degradation. The manufacturer also needs to maintain spare hardware. Having spare hardware for legacy systems is especially critical because hardware may not be available or extremely hard to come by. For example, finding hardware for Windows NT 4.0 to run on is extremely difficult now. PCs do not come with 3.5" floppy drives, some PCs do not have PS2 ports, relying on USB for the keyboard and mouse, and the size of the storage media is too large for NT 4.0 to support. Thus an upgrade plan needs to be part of the disaster recovery plan to ensure that in the case of system destruction, the system components are available.

A disaster recovery plan protects against system failures, data loss, and system destruction. Disaster recovery plans are all about prevention and quick recovery once a situation occurs. Organizations implement their disaster recovery plan as a set of untested backup and restore procedures, but a disaster recovery plan encompasses more than just backup and restore. A disaster recovery plan may seem expensive to implement, but when it reduces unplanned downtime, it can pay for itself.

ABOUT THE AUTHOR

Michael Carey is director of MES and Information Systems at Panacea Technologies Inc. His e-mail is careym@panaceatech.com.

Disaster recovery tips

  • Design fault tolerance into the system.
  • Prevention also comes in maintenance procedures.
  • Change control procedure ensures all changes to the system are properly stored.
  • Software should remain stored off-site in a controlled environment.
 


Resources