Redundancy in EtherNet/IP systems
By Alain Grenier
As with any Ethernet-based industrial protocol, in EtherNet/IP, redundancy—the repetition or duplication of messages to circumvent transmission errors—is required to maintain maximum uptime while still enabling the system to deal with minor outages and potential failures to the environment. Redundancy plays a critical role in determining the reliability of the entire system, from the very edge devices, through the network core, to the plant backbone.
This entire system should be held to the same set of standards for optimum uptime and supportability, and the reliability of the Industrial system includes the robustness of the application, how the system handles environmental stresses, how the data flows, and the integrity of the data.
The “Differences between Enterprise and Industrial networks” table identifies the main differences between Commercial or Enterprise-type networks and Industrial-rated networks. These core distinctions are important to keep in mind when considering building an EtherNet/IP network. Industrial networks require equipment optimized for the environment in which it is installed. Such environments can widely vary, ranging in numerous geographic locations and in many different climates. This article will explore the balance between the cost of ensuring a systems redundancy and the cost of failure within a system and inevitably lost production.
The three main areas of coverage to ensure redundancy for Ethernet-based industrial systems are physical, data link, and network as shown in Figure 1 below. The lower failures occur in the OSI model, the greater the impact. For example, if you lose a cable that connects an end device to a switch port, there is no data movement of any kind. In this case, there is now a potential for impact throughout the entire process, depending on the importance of that end device. If the issue is at layer 3, where a router may have experienced a loss of power and connection to the plant backbone, localized plant processes can still operate; however, operation for distributed plant or business processes can be effected to a larger extent.
Figure 1: Areas of redundancy focus in the OSI model
Figure 2 shows the impact upon the network based on percentages. Again, notice the lower the failures are in the OSI model, the more impact failures have, with 72% of failures occurring on the first three layers. These failures can include hardware failures, cabling failures, power losses, programming configurations incorrectly, etc.
Figure 2: Area of system failure with percentages
Physical redundancy: More than just the cabling
Physical redundancy encompasses the physical Ethernet network connections (plus Ethernet equipment) AND the physical hardware the connections go between. Network redundancy focuses on the multiple routes that can be used between edge devices. The more available routes edge to edge, the more failures the network can sustain while still keeping the process alive and functioning. Physical redundancy typically follows two scenarios:
- Diverse routing of cabling—if a cable tray or conduit is damaged in some way, cutting link on the cabling in it, there is another way to get data where it is needed by providing cabling via another route to maintain connection reliability.
- Redundant hardware—having multiple connections on a controller or other hardware allows the controller reliable connectivity in cases where a connection or port has suffered a failure. This also relies on redundant network hardware in cases of failures, including multiple power supplies, multiple CPU cards on controllers, etc.
When looking at redundancy in the EtherNet/IP network, it is fundamental to first look at the application and the area of coverage, including the number of devices attaching to the Ethernet network. Consider the following:
- Are they grouped according to location and function?
- What application is being performed?
- What is the device type?
- Will there be a requirement to connect to the existing plant backbone network?
Based on this knowledge, an understanding of how the Ethernet network will look and what the required number of ports will be on the Ethernet switches that will be put in those areas will emerge. This is needed to determine the number of cables, physical routing of the cables, location of network nodes to connect the cables, and other aspects necessary to determine.
A popular way to look at system connectivity needs is the Zone/Cell view. In this way, you have a zone of control divided up into functional cells.
Figure 3: Control zone/cell reference diagram
When considering Figure 3, assuming each line is a single cable connection, it would be very easy to isolate sections of the process with the loss of just one or two cables. At the physical layer, it is important to plan out redundant connections to devices that can support multiple connections. Many devices only have one data interface, but the Ethernet switches they connect to have multiple ports to support connections to other switches, forming redundant paths and creating the capability to work around port and cabling failures. In the following sections, we will discuss what network protocols are available to make the best use of these redundant paths between network nodes.
Once how the devices are going to be connected to the network is decided, you must decide the level of redundancy needed for the maximum expected uptime of the system. This requires evaluating the cabling and Ethernet network hardware needs by considering a series of questions:
- Do we need redundant cabling between devices?
- Is running redundant cable by different routes needed to provide physical security of the cabling—in case of damage to one of the runs?
- Is there more than one Ethernet interface on the device to be used? (Many controllers have multiple Ethernet interfaces in case of Ethernet port or module failure.)
Figure 4: Ethernet network physical topologies
Data Link Layer: Using the Ethernet switches in the network to provide protocol redundancy and maintaining Ethernet network health
Layer 2 redundancy protocols do two things: They identify all the possible paths amongst the networking devices, and they place the redundant extra paths in a blocking state to remove network loops. Loops in an Ethernet network cause data duplication and will incapacitate a network in a short period of time. In the event that the network segment fails, the protocol activates the appropriate ports that are in a blocking state to reestablish connectivity. The object being to fix the issue before the process is even aware there is a problem.
Ethernet networks have redundancy protocols supported by identified Ethernet standards. These are supported in Layers 2 and 3 of the OSI model. We will begin by looking at Layer 2.
Standard Layer 2 Network redundancy protocols
1. Spanning Tree–There are several versions of Spanning Tree:
- STP (Spanning Tree Protocol)—Standardized in 1996 as IEEE 802.1D, it is the first and slowest of the Spanning Tree protocols. Average failover time for STP is only as low as 30 seconds, making it far too slow for use in any industrial process.
- RSTP (Rapid Spanning Tree Protocol)—Standardized in 1998 as IEEE 802.1w, it was an evolutionary leap for STP. It is much more rapid, with failover times from about 500 ms to up to 12 seconds. There still remains an issue with the speed of failover for Industrial processes.
- MSTP (Multiple Spanning Tree Protocol)—originally standardized as IEEE 802.1s and then incorporated into IEEE 802.1Q 2003, this protocol allows multiple instances of Spanning Tree Protocol per Virtual LAN. This means that in a single physical network, there can be multiple virtual network groupings, each with their own instance of Spanning Tree Protocol.
- There are proprietary implementations of Spanning Tree that are optimized for use in Industrial Networks. They are based upon standard RSTP, but are not designated as a standard STP protocol.
2. LACP (Link Aggregation Control Protocol)—this protocol allows the user to configure multiple Ethernet ports between Ethernet switches into a single virtual “Link.” This allows load sharing of information between the links and is extremely fast at moving data between a failed port and an adjacent port in the event of a link failure.
3. The amount of interconnections amongst the network elements dictates the amount of failures the network can take while still remaining capable of maintaining the process.
(Figures 5 and 8 show examples of the protocols.)
Spanning Tree is a redundant topology that provides network redundancy instead of just path redundancy while preventing loops in a network. For Ethernet to function properly, only one active path can exist between devices. To provide redundancy, Spanning Tree relies on having multiple paths or connections to different switches and configures some of these paths into a standby (blocked) state. If a network segment becomes unreachable, spanning tree reconfigures and reestablishes a connection by activating the “blocked” links.
All switches in the LAN gather information about each other through an exchange of data messages called Bridge Protocol Data Units (BPDUs). The exchange of messages causes the following:
- The election of a “Root” switch for stability
- The election of a designated switch
- The removal of loops by placing redundant switch ports in a backup state
The “Root” switch is considered to be the “logical” center of the Spanning Tree network. All paths that are unnecessary in order to reach the “Root” switch from anywhere in the network are placed in backup mode. BPDUs contain information about the transmitting switch it originated from and its ports, including:
- Unique switch Identifier or MAC address
- Switch priority
- Port priority
- Port cost
Spanning Tree then uses this information to elect the “Root” switch and “Root” port for the switched network. The switches send configuration BPDUs to configure the spanning tree topology. All switches connected to the LAN receive the transmitted BPDU. The BPDUs are not forwarded by the switch, but the information contained in the BPDU can be used by the receiving switch to transmit a new BPDU.
The resulting action of this communication is:
- One switch is identified as the Root.
- The shortest distance to the Root is determined for each switch.
- A designated switch or switch closest to the Root is selected.
- An active port from each switch is selected, and the others are blocking.
If all the switches are enabled with default settings, the switch with the lowest MAC address becomes the Root by default. However, due to traffic patterns, number of forwarding ports or just simply physical location, this may not be the best option. By increasing the priority (lowering the actual numerical value of the priority number) of the ideal switch so that it becomes the Root, spanning tree is forced to recalculate and form a new topology. This is the same scenario when identifying which port is active and which port stays in standby.
Each port on a switch using spanning tree protocol exists in one of five states:
Each port moves through these five states as follows:
- From initialization to blocking
- From blocking to listening or disabled
- From listening to learning or to disabled
- From learning to forwarding or to disabled
- From forwarding to disabled
Figure 5: Example of a Spanning Tree Ethernet network
Spanning Tree networks can support either ring or mesh topologies. A ring topology is basically a ring of Ethernet switches connected together in a ring fashion. A mesh topology requires the use of two Ethernet switches up at the top with switches below that have connections to both of the upper switches. Mesh networks use more fiber than ring networks, but can typically survive more network hits intact. Figure 6 shows a typical Mesh network example, while Figure 7 shows a Ring network example.
Figure 6: Spanning Tree in a mesh network
Figure7: Spanning Tree in a ring
Link Aggregation Control Protocol (IEEE 802.1ad) provides redundancy without the use of Spanning Tree. It enables users with the capability to bundle groups of ports between switches to form one virtual link with the bandwidth of the member links. LACP provides several functions:
- Higher bandwidth
- Enhanced bandwidth granularity
- Load sharing across the member links to balance bandwidth across the member links
- Fault tolerance provided by offloading data to working member links when a member link fails
LACP is a method of providing needed extra bandwidth between Ethernet switches that have extra non-utilized ports without buying a switch or switches with higher bandwidth ports. For example, moving from 100-Mbps switching to Gigabit Ethernet switches.
Figure 8: Example of an LACP-based Ethernet connection between switches
Non-standard Layer 2 redundancy protocols
There are numerous identified network redundancy protocols that are not standard and are vendor-specific that are designed to provide a very fast fail-over mechanism for the controls network. These protocols typically provide faster recovery than Spanning Tree protocols; however, they are not standard. This can make interoperability difficult at the least amongst different systems and networks that may contain differing vendor’s products. Those that use ring architectures break the ring to prevent loops through the use of a “Redundancy Manager” by placing one of its ring ports into a blocking state. If a link is broken in the ring, the blocked port is placed into a forwarding state to ensure the network connectivity is maintained. If more than one link is broken, then ring segments become isolated until the broken links are fixed. Figure 9 shows an example of this ring topology.
Figure 9: Example of non-standard redundancy protocol ring topology
Network layer redundancy protocols—how routers talk to each other and fix breaks
As EtherNet/IP networks expand, the use of a single IP subnet will not be enough. In order to better facilitate communication between IP Subnets, you need to use a Layer 3 network device, namely, a router. Routers can provide data movement in two ways: Statically via routes that are mapped by hand (Static routing) or dynamically via designated routing protocols (Dynamic routing).
Static routing can be useful for small routing areas, but cannot provide fast failover because it requires user interaction to program an alternate route manually. Dynamic routing is required where a hand off failover is required or the routing environment is too large. Routing protocols are inherently slower on failover than Layer 2 protocols.
Routers support several types of protocols to communication like OSPF (Open Shortest Path First) and RIP (Routing Information Protocol) that have a communications redundancy built in as long as the physical network architecture remains in place.
There is also a router redundancy protocol that supports redundant router replacement. If one router fails, its designated backup is placed into service seamlessly. This is called Virtual Router Redundancy Protocol (VRRP).
Distance vector vs. Link state routing protocols
- Send routing table info only to neighbors, so change communication may need one min/router
- Also called “routing by rumor”
- Easy to configure, but slow
- Flood routing information about itself to all nodes, so changes are acknowledged immediately
- Efficient, but complex to configure
OSPF and RIP: Standard router communications protocols
OSPF and RIP protocols are used as a means of communication between routers in which they can tell each other what IP Subnets they have attached. By sending these routing table updates to each other, the routers build a map of how the network is constructed at Layer 3. This also identifies the redundant ways these routers can maintain connection to each other if a router loses connection on a port. If a router knows of another way to get to an IP Subnet it needs to send data to, it will use these alternate paths. Figures 10 and 11 illustrate some examples of these routing protocols.
OSPF is referred to as a Link-State Routing protocol. The best paths from router to router are based upon the A class of routing algorithms in which each router broadcasts connection information to all other routers on an internetwork. This spares the routers from checking for all available routes, but adds the memory requirement of storing all of the routing information. This algorithm relies upon the cost of the links between routers, not the number of hops. If the cost on a connection is cheaper, this indicates a higher bandwidth capability. OSPF keeps memory of ALL of the possible routes, not only the active ones.
Figure 10: OSPF routing protocol example
RIP and RIP II are types of distance vector protocols. Distance vector algorithms compute distances from a node by finding paths to all adjacent nodes and use this information to continue on the adjacent paths, router hop by router hop. Distance vector algorithms can be computationally intensive, a problem that is alleviated somewhat by defining different routing levels. They rely upon the number of hops in a particular direction between the source router and destination router. They do not take into consideration the speed of the physical media, so it is possible to move traffic across a suboptimal link.
Figure 11: RIP and RIP 2 routing protocol examples
Table 2: OSPF and RIP comparison
VRRP is the way for routers to perform physical redundancy to one another. If one router dies or is unable to function in the appropriate manner, its designated backup will take over the former routers function. They maintain this relationship through the use of HELLO packets and regular updates to make sure that both routers have all of the same information. The use of VRRP would be a considerable function to incorporate into an EtherNet/IP design if there is a requirement to attach to a corporate network and there is a requirement to maintain some sort of segregation between the plant floor EtherNet/IP network and the corporate environment.
Figure 12: VRRP example
Determining the cost of redundancy: How much is too much?
Designing redundancy into a system requires a carefully orchestrated balancing of factors. You must consider how much to incorporate into the various areas: physical, network, and application. The first thing that has to be determined is the scope of the system being installed. The following are a list of questions to ask when evaluating EtherNet/IP design:
- Is this a new install or an upgrade to a previous installation?
- Is there any existing cable that can be reused?
- Is there any existing equipment that can be reused?
- Has the area for installation been determined?
- Use of copper or fiber optics? This is dependent on distance and the environment for the installation.
- Who is the control system vendor?
- Will there be a point of connection to the existing plant network? What sort of data is intended to be passed to this network from the plant floor?
- To what extent has redundancy been considered? Is the network a ring or mesh-based network?
- If Ethernet-network redundancy is not being considered, is it economically feasible to do without it? How many outages are you prepared to pay for in lost revenue? Balance this against the cost of managed vs. unmanaged switches or more advanced Ethernet networking devices like routers.
- How experienced are the plant controls support staff in regards to Ethernet networking and will the IT staff be involved in the support?
- What is the projected budget for the control system, including the network cabling and equipment?
Redundancy levels are dependent upon the operational expectations of the EtherNet/IP Control System being installed. Discrete automation systems usually incorporate most of the redundancy into the Ethernet network, requiring such devices that are smarter and more expensive, but also able to heal around network breaks. Process control systems rely upon the controllers for redundancy, meaning the Ethernet network itself is relatively dumb, but there is double hardware expense due to use of parallel, non-redundant networks.
The cost differential between managed and unmanaged Ethernet switches can be exceeded by the lost revenue of an extended downtime event caused by a network outage. The ability to be able to monitor a network and see the application in action can help predict events that can cause outages. An unmanaged switch does not allow you to see how the network is performing and carry out predictive maintenance based on evidence. Also, the ability to use port mirroring on a managed switch can assist with troubleshooting Application Level issues, as you can use a protocol analyzer to see the EtherNet/IP application in operation.
On the other hand, using too many connections between Ethernet switches can cause slowdowns in re-convergence of a network if there is a lost link or switch. Ring topologies typically use two inter-switch links per switch, while mesh topologies can use three or more. The recommended norm is no more than three edge switches in a mesh network environment. Buying Ethernet switching hardware that exceeds the requirements for the network can cause cost increases as well.
Understanding the relationships between the physical structure of a network and its protocols is crucial to creating a truly maintainable and adaptable network that can adapt to issues effectively. Consult with the Ethernet switch vendor that produces the framework of the installed network in order to determine an effective balance, allowing for the design and implementation of an EtherNet/IP-based control system that is successful in its operation throughout the course of its lifetime.
ABOUT THE AUTHOR
Alain Grenier is chief technologist at ODVA. He can be reached at firstname.lastname@example.org.