1 September 2006
A catastrophe is just around the bend; do you have a plan?
By Peggie W. Koon
A phone rings very early one morning, and after a groggy hello, you find it is the on call employee.
You imagine a system crashed. She tells you about a call she received; there’s a problem with a warehouse printer. You utter a sigh of relief. She then says the problem is she is unable to get to the facility. All roads are blocked. You stay on the phone with her as she drives 10 miles via the local interstate highway in an attempt to access the plant via back roads; yet another road block. Finally, you turn on the local news. “Go home,” you tell the employee, “there has been a deadly chemical spill in the area—no one is being allowed into or out of the area that has now been declared an evacuation zone.” You watch the news ever so intently for details. There is a picture of the crash site shown on television.
According to the news report, the crash occurred near one of the company’s manufacturing plants. But a closer look reveals the building flashing across the screen is not a manufacturing facility. It’s your IT building—a building where you and almost every business system server reside, along with associated disk drives that house operating system and business application system software, corporate databases, and more. In addition, the company’s communications hubs for telephone, e-mail, Internet, and Intranet services reside there. The chemical gas that caused nine deaths and the evacuation of thousands has resulted in the worse imaginable scenario for disaster recovery for your company. All computers in the IT building are inoperable by the gas; all telephone lines to the company plant sites are down. You want wake up from what must be a nightmare, but it’s impossible because it’s not a dream, it’s true.
The main question you have to ask is would you be able to respond proactively to the disaster? On top of that, would you have an executable disaster recovery or contingency plan in place to respond to such disaster IT? While the U.S. is in the midst of hurricane season on the East Coast, tornados prevalent in the Midwest, and heavy earthquake potential on the West Coast, and while flooding or other natural disaster could happen anywhere in the world, every manufacturer should have some type of disaster recovery plan on file.
Never happen? Wrong
That scenario sound like something that would never happen in a million years, right? It happened at Avondale Mills in Graniteville, S.C. on 6 January 2005. A northbound train hit a parked train at 2:39 a.m. causing 14 cars, including three tankers that held 90 tons of liquid chlorine, to derail from the track near the IT building.
At atmospheric temperature, the liquid chlorine became a deadly gas that was heavier than air; the chlorine gas permeated every building within a 1-mile radius of the crash, destroying shrubbery, trees, and any metallic surfaces it encountered along the way. Cars that were once running stalled in the middle of nearby streets, fish in a local creek died, electrical switch gears and boxes corroded, motors and electrical components on machinery at the various plants suffered damage, electrical outlets sparked spontaneously, and computers in the company’s IT building came to a silent but immediately discernible halt. No one at the company had ever imagined such devastation could occur, and none of us were ready for the effect this accident would have—especially relative to IT.
Let’s face it: There isn’t one manufacturer that is not affected by IT. We define IT as an information system that is the combination of computers and people used to provide information to aid in making decisions and managing a firm. You could also define IT as an acronym used to inculcate in the minds of every level of corporate and middle management the concept that all of the computers or information systems in a company, whether used on the plant floor or in the board room, are a part of the company’s investment in technology.
At the most rudimentary level, we can assume any computer that malfunctions as a result of a disaster is included in disaster IT. For example, a computer controlled drive system on a machine that suddenly fails due to corrosion of the drive system computer or the drive system electronics is a part of disaster IT. A multifunction processor in the plant’s process control system, used to control the machine’s steady state processing, that suddenly fails due to corrosion of the boards is a part of disaster IT. Whether it’s the computers located in an IT building, or the computers at the plant floor, in a control room, on a machine, or on a desk top, when computers and information systems are affected, the end result is disaster IT.
First step: Recovery
When a company attempts to recover from disaster, the first step usually includes assessing the damage to its infrastructure, which is at the core of its operations. The next steps include identifying the key people required to get the company’s operations back up and running, developing a strategy for restoration/recovery, and implementing the plan. The IT infrastructure will define the key people and resources required to support the IT organization, hence, the IT director or manager must take a similar approach toward systems recovery.
The IT organization is an IS strategy that includes all computer systems in a company, all transfer media, all software products, all databases, and all technology providers. The old expression, “the network is the computer” applies when a company implements this strategy. And when disaster strikes IT, it affects every aspect of the operations—from payroll to customer service, to benefits, to plant floor production, to orders and sales, to inventory and shipments, to decision support, and executive information. A tightly integrated IT organization/strategy is critical to the survival of today’s companies; however; such a strategy can increase IT vulnerability during disaster.
At the infrastructure level of IT is the communications backbone, or network, which usually includes any voice and/or data transfer media and associated servers. For example, the loss of a company’s Private Branch Exchange, or PBX, might affect internal and external communications. Cell phones can immediately go out for people-to-people communications; in the plants, workers can use walkie-talkies. If the corporate wide area network is lost, communication between clients and servers is also lost. Communications between systems and peripheral devices (such as printers, thin net, and PC clients) connected to switches and hubs on the corporate wide area network will also become useless due to the loss of the network. The Internet often integrates into the business (customer access via portals, e-mail, or internal corporate Intranet users) so loss of Net access also affects customer and company communications.
The first order of business, then, is to restore the network, so internal and external people-to-people and system-to-system communications can come back on line. You see every server in a company is typically connected via the backbone. There might as well be no servers without the network.
In our case, network and system specialists worked around the clock to restore the corporate voice and data networks; PC and office automation experts worked in tandem with the communications experts to replace PCs and printers affected. In the interim, we established a wireless network to resume customer communications using an Internet Service Provider.
Remember, we defined IT as all computing platforms (Windows XP, Windows 2000, Windows NT, OS, OpenVMS, etc.), all communications protocols (TCP/IP, etc.), all software applications (business, process control, process automation), all databases (DB2, SQL Server, Sybase, RMS, etc.), and all technology providers (DBAs, programmers, analysts, network managers, system managers, help desk personnel, control system engineers, etc.). Every aspect of the IT organization is affected during disaster IT.
Back in the saddle
When the servers undergo complete destruction, the process of replacing the computer hardware is usually fairly uncomplicated. Suppliers, especially those with whom you partner, tend to make every practical effort to assist in an expedient replacement process. The real key to the effort lies in the availability of the technical IT experts and the existence of off-site software, system, and data backups necessary for server restoration. Notice the words software, system, and data. All too often companies invest in offsite backup storage for company data without considering the need for concomitant offsite operating system and application software backup storage. This can be a fatal mistake in disaster IT. Every aspect of the operations have the potential to be shut down when servers for business transaction processing, decision support, expert systems, and executive information systems are destroyed.
The applications for these servers are especially vulnerable if a manufacturer developed them in-house or purchased them with extensive customization. If the disks on the servers suffer damage, as might well be the case, the company’s operational systems (business operations) may only partially be available. The one hope in such a case lies in the knowledge and expertise of the people that developed and maintained the system. They must be available to re-create business processes using whatever information they can salvage from the system servers. Disk forensic experts are often invaluable in the restoration process.
Another fatal mistake in disaster IT is the lack of contingency planning. Almost every IT manager has developed or at least conceptualized a disaster recovery plan. But what if the recovery plan takes weeks or months to implement? What contingency plan is in place? If you can not recover the server for payroll, is there a remote resource available to pay your employees? If the order processing system fails, will you be able to place orders from a remote site, to locate rolls in the warehouse, to print bills of lading, and make shipments to customers? Will you know your inventory levels and order position? Will you be able to receive raw materials and distribute the materials to the various manufacturing facilities to make products? Will you be able to schedule your remaining operating facilities? These are just a few of the questions you must answer during disaster IT.
Manufacturers have long used distributed processing in facilities, especially where you find process control and process automation systems. The reason for the distributed control is to ensure the loss of computer resources in one facility does not affect the operations at another facility. The use of distributed processing at the plant floor level can be a critical saving grace during such a disaster. The plant systems located in each production facility are usually autonomous from the business servers; these servers can provide accurate real-time production information to management. The plant systems IT experts can develop applications to merge production data with warehouse location data to provide visibility of inventory levels to management, which in turn will facilitate the shipping, invoicing, and order entry processes during the disaster. These distributed plant systems can provide valuable real-time information on the state of the company’s operations. In addition, the plant systems will allow production to continue.
Officials at the plant made several critical observations regarding disaster IT, the most significant of which are:
IT is a critical resource that affects every aspect of a company, especially during a disaster (such as the chlorine disaster in Graniteville, S.C.).
The most critical disaster IT resource is the IT staff.
The network is the computer; the communications network is critical to the company’s redeployment.
System, software, and data backups should occur on a regularly scheduled basis. You should store all backups at a remote off-site facility.
You should consider distributed processing at every practical level of systems configuration so a failure of one server at one location does not affect the entire company’s operations.
Development of an IT disaster recovery plan is not enough. Every company should invest in a pragmatic contingency plan for IT.
Partnering pays. The existence of strong alliances (with customers and suppliers alike) is critical.
It only takes one disaster of the magnitude of the chlorine spill at Graniteville for a company located in the heart of a disaster to underline the importance of IT to its business.
Katrina hits; a plan in place
After the storm surge hit, they had 20 feet of standing water. To their advantage the company had a disaster recover plan, which they hit immediately after the storm. Because they documented everything, they were able to get a devastated plant up and running again in 11 weeks.
“Everything was done like we were supposed to do,” he said.
There is no way a company can survive today without planning for any kind of an event whether it is man made or a natural event, he said.
As Wiles said, “natural disasters can happen anytime, anywhere. You need a plan.
Plan of attack
The following is a list of steps a manufacturer can use in disaster recovery:
Identify the key people required to get the systems back up and running.
If you’re a director or manager of IT, you will probably attend strategic meetings where a team is making decisions on the recovery from the catastrophe. Just as the company quickly assembles its key/critical personnel, the IT director or manager must quickly identify the key/critical resources required to restore the company’s IT.
Assess the level of infrastructure damage, and implement (or develop) a plan to restore it.
Just as the company needs to identify the damage to the infrastructure—buildings, utilities, equipment, etc.—the IT director or manager must remember the “the network is the computer.” Without the network, a company can not perform critical business processes.
Identify the critical systems that must be in place to run the business.
At the lowest or operational level are the process control, process automation, and business transaction processing systems. Without these systems, the systems at the higher levels (EIS, etc.) are not functional. At the plant level, the process control and process automation systems should be at the various manufacturing facilities.
Implement the recovery or contingency plan for the critical systems.
This step may include the purchase of new equipment, the use of backup data, the recovery of data from disaster disks, the use of remote processing systems at remote sites, or any combination. The better the plan for recovery/contingency, the faster the recovery process is.
As you restore each critical business system, new issues will arise. The entire IT staff may quickly be overwhelmed by the tasks of deploying the new systems and responding to glitches and problems that may occur.
Redevelop the plan.
Once the company’s IT is up and running, it is critical the disaster plan, whether designed for recovery, contingency, or both, undergoes review and the company identifies its strengths and weaknesses.
A comprehensive disaster plan for IT is like an insurance policy that no one will ever use. After all, it never happened in 20 years, so it probably will not happen now, right?
IT disaster can happen to you. It happened in Graniteville, S.C.
Every company should analyze the risk of having a disaster without a feasible/pragmatic disaster recovery and/or contingency plan. If you’re not convinced of its importance, just ask anyone from Avondale Mills. They’ll tell you an investment in disaster IT is well worth the risk.
About the author
Peggie W. Koon, Ph.D. was the man-ager of IS plant systems at Avondale Mills, Inc. She is now the deputy chief operating officer of the Delivery Division of Morris Digital Works in Augusta, Ga.
IT Management in Century 21 www.isa.org/link/ITCentury21
When disaster ERUPTS: Plant safety measures notwithstanding, accidents happen. www.isa.org/link/disaster_erupts
Disaster Recovery or Tolerance — Your Choice www.isa.org/link/Recovery_Tolerance