Beyond controllers and capacitors
Understanding system redundancy
- Evaluate redundancy opportunities within system and subsystem software.
- Eliminate single points of failure in hardware infrastructures.
- Configure redundancy in servers, power, cooling, and networking.
By Josh Neland and Franklin Flint
Industrial automation requires a high degree of reliability, and this requirement extends to the computing hardware and software needed to run automated systems. High availability has been designed into low-level system components such as embedded controllers for many years. But today's automated equipment is expected to perform new tasks-such as communicating information to back-end databases-as part of an increasingly integrated and intelligent infrastructure.
A car maker, for example, might need to record the torque value for each screw used during airbag installation and send that information to the company's database. If the data becomes corrupted en route, causing a fault in the system, the consequences can be serious. To find out why the manufacturing process is not meeting specifications, the factory might have to shut down the production line while technicians troubleshoot the problem. For some processes, such as pharmaceutical manufacturing, corrupted data can mean an entire batch of manufactured material must be discarded.
As automated systems are required to do more and systems become more complex, companies have an opportunity to step back and evaluate where they can most effectively and cost-efficiently build redundancy into the computing infrastructure.
Building high availability
To evaluate redundancy opportunities within a system, it is best to start at a high level with the software environment. Consider high-level subsystems such as the database and the administrative console and prioritize where to invest in high availability. If a particular subsystem depends on continuous information updates, then the data store should be highly available to that subsystem. The administrative console, conversely, provides an interface with human operators who interact with information at a slower pace and do not require the highest availability.
The next step is to determine a strategy for building the required availability levels into the various subsystems. If the database must be highly available and is supported by multiple server nodes, how should demand be distributed across those nodes-especially if one of them fails? Several options are available for configuring nodes to provide high availability, including software and hardware solutions.
In the software realm, off-the-shelf operating systems and database applications may include clustering software that enables automatic failover from a faulty node to a different node in a cluster. Failover capability is also available in several virtualization products, enabling organizations to rapidly provision applications and automatically bring up new virtual instances that have been pre-configured to fulfill a certain role.
Virtualization allows different types of software stacks to be managed in a uniform way and is appropriate when legacy and modernized subsystems must coexist. In a homogeneous system, a single application framework can be responsible for provisioning the required services to a new server in the event of a failure.
The performance of each of these solutions may vary a great deal. While a fully redundant hardware clustering solution might take only a second or two to fail over, an application container may require 30 seconds to detect an issue and bring up a new server. Matching the responsiveness of a chosen solution to the requirements during the design phase is critical, as moving the layer of the architecture that ensures redundancy can require a complete system redesign once it is in place.
Nodes that will operate within a clustered subsystem should adhere to certain principles such as minimizing state and failing gracefully. Stateful information within a service should be minimized, consolidated, and placed in shared storage, when possible. For information that cannot be offloaded completely to shared storage, nonvolatile storage-combined with a replication strategy-can be employed to minimize the effects of node failure.
When a system hosting a database instance goes down, it may have already half-written the last transaction. As a new node steps in, the solution should be designed to roll back that partial transaction to avoid creating a double or corrupted entry and raising questions about the validity of the records. The effects of corrupt data can be catastrophic. If the history of a drug production process is found to be inaccurate, the manufacturer might have to recall an entire batch of product worth millions of dollars.
Active/active and active/passive configurations
Once clustering software is selected, an administrator must determine the specific configuration for each cluster. Two basic approaches to configuration are available: active/active and active/passive.
Using a database as an example, an active/active approach means the primary and reserve nodes are running the database at any given time. If database load suddenly surges, the other nodes simply pick up the extra load. While an active/active configuration has many benefits, such as low failover latency, easy expansion, and load balancing, it also requires nodes to be designed so they can run concurrently. Because rewriting existing services to accommodate this requirement is not usually feasible, active/passive may be a more appropriate choice in many cases.
An active/passive cluster configuration provides a fully redundant node for each operational node in the system, and the redundant node is brought online only if the active node fails. While active/passive is simpler to implement, the cost of providing a fully redundant set of nodes (and hardware) can be high.
It is also important for administrators to understand the difference between dedicated and multi-purposed nodes when determining whether an active/active or active/passive configuration is appropriate for a particular cluster. With dedicated nodes, the administrator must decide how many redundant nodes to provide for each interface. With multi-purposed nodes, a single shared pool of extra nodes can provide the redundancy for any of the interfaces, even if it is not known at the time of provisioning which interface will ultimately need to be supported.
Evaluating required interface behavior
Another key element of the software environment is the behavior of application interfaces. If a production database is heavily used, a querying device or web browser may not be able to establish contact with the database front end on the first try. It is possible to build an interface that is guaranteed to respond the first time by hiding the retry process across a set of redundant nodes, but building transparent failover into each subsystem interface might be extremely cumbersome. Instead, it is common to make external interfaces highly available by implementing a retry policy internally across redundant nodes of any subsystem.
Once decisions have been made about the desired interface responses and redundancy requirements at the application level, it is necessary to design the system to achieve availability at the underlying storage level. In a redundant system, all of the subsystems should access shared, centralized storage so if one storage array fails, other resources are available from the shared pool. It is also critical to replicate databases and regularly update the copies so data can be quickly restored in the event of a system failure. Many commercial and open-source storage systems are available that provide built-in replication, backup, and restore capabilities.
Configuring hardware for redundancy
In addition to building high availability into systems at the software level, organizations should ensure hardware supporting those systems is not dependent on any single point of failure. Organizations can help prevent failure through hardware redundancy and repairability on the fly. At the server level, redundancy is essential to ensure continuous, reliable computing, and data storage.
Hardware redundancy starts with the selection of server technology. Blade servers, which consist of a chassis that houses multiple servers operating as individual computers, offer built-in failover and redundancy features for maximum system uptime. Organizations with a growing number of servers should consider blade server technology. In addition to being more space-efficient and energy-efficient than an equivalent number of stand-alone rack servers, blades are easy to slide into a chassis to quickly add processing capacity or swap out a faulty unit.
A blade system chassis can host multiple server blades plus shared infrastructure components, and redundancy is built into the blade architecture by providing more than one of each component to avoid single points of failure. Additionally, these components are ideally shared through the midplane, which is passive to help ensure high reliability-that is, the midplane contains no active logic, only connectors and traces. Redundant hardware components can include:
- Redundant and hot-pluggable chassis cooling fans
- Redundant chassis management modules
- Standard N-1 redundant hot-plug power supplies
- Optional redundant RAM
- Redundant storage with hot-plug hard drives
- Battery backed-up RAID cache
- Redundant network interfaces
- Redundant hot-plug switch modules
Server configuration options
When administrators plan blade server deployment, they should carefully review the available configuration options for various functions ranging from power redundancy to chassis-management redundancy. Depending on the type of device and the design of the redundancy algorithm, setting up a redundant configuration can simply require installing modules and powering up the system-or it may require using a management interface to configure the redundancy options. When a failover occurs, most devices typically transmit informational events that alert IT management.
Among the most common points of failure in any server are the hard drives. Mirroring technology built into many blade servers is designed to protect against hard drive failure by creating an exact replica of the working system volume. This replica can take the place of the original if a drive goes down. Hot-plug connectivity to the hard drives allows the replica to be brought online and the faulty drive replaced-restoring a state of redundancy-without interrupting server operation.
Providing backup for power, cooling
Power supply is a critical element that must be protected at multiple levels. Organizations wanting a true high-availability solution should consider purchasing power from two different utility companies. If one utility experiences an outage, the other can continue to provide service. In addition to power source redundancies, organizations commonly deploy multiple uninterruptible power supply units to provide emergency power to the facility if the utility mains fail.
Each blade server enclosure contains multiple chassis power supplies, and redundancy allows uninterrupted system operation if one or more of these chassis power supplies fails. Server vendors provide a variety of redundancy configuration options, such as keeping the capacity of one or two power supplies in reserve while powering up the server blades. With this configuration in place, the failure of any one or two power supplies will not cause the entire chassis to power down.
In many cases, administrators can also configure the management interface to turn off server blades based on the organization's priorities if power supplies begin failing one at a time. This configuration ensures the most critical server remains functioning until the very last power supply goes out. In case an outage forces systems to run on backup batteries, configuring the chassis to shut off all nonessential blades will extend the life of the most critical blades.
Another important issue is the blade chassis cooling system, which consists of multiple fan modules. Fans, because they have moving parts, tend to fail more often than components without moving parts. The consequences of overheated server components can include corrupted data and even shutdown. To help protect against failure, many modules are designed with two fans per module. The chassis' power supplies may also contain fans to help cool the enclosure.
A well-designed blade chassis will provide hot-pluggable fans with more cooling potential than is necessary for the platform. If one fan does fail, the others can continue to cool the chassis. The fan that failed can quickly be replaced in a hot-plug environment before any servers need to be shut down due to thermal issues.
Well-designed blade servers also ensure network connection redundancy, with chassis I/O modules providing external connectivity and multiple internal ports for connecting to the blades. Each blade commonly includes dual Ethernet network interface cards (NICs) or embedded network LAN on Motherboard (LOM) units.
Chassis designs also allow installation of dual Ethernet switch modules that can enable additional connectivity or network redundancy and fault tolerance. For example, by connecting LOM 1 on each server blade to switch module 1, and connecting LOM 2 on each server blade to the equivalent port of switch module 2, administrators can enable failover protection. Management module software can be configured to trigger a failover event if a NIC, LOM, cable, internal port, or external port fails.
A distinction between a blade server and other types of servers is the connection between the NIC or LOM and the internal ports of the switch module is hardwired through the midplane. This design enables the link between the NIC or LOM and the switch to remain in a connected state-unless either a NIC, LOM, or switch port fails. The link remains active even in the absence of a network connection between the external uplink ports on the integrated switch and the external network.
Deciding where to invest
Redundancy is essential for avoiding costly stoppages and downtime, especially where automated manufacturing and processing is concerned. But few organizations find it cost-effective to build the same level of redundancy and availability into every part of the operation. As infrastructure grows increasingly integrated and intelligent, companies have an opportunity to change the architecture of their systems so availability does not depend on every single processor and capacitor, but is also built into databases, interfaces, and applications.
The key is to be selective about where to invest in redundancy. By starting with the overall system, assessing which subsystems are required to be highly available, and considering the software environment as well as the hardware domain, organizations can achieve the high availability they need while driving down costs.
ABOUT THE AUTHORS
Josh Neland and Franklin Flint are Technology Evangelists for the OEM Solutions group at Dell. You can follow their musings on Twitter @joshneland and @franklinAtDell, and read more of their work at http://blog.delloem.com/.