I spent the day today with a customer doing a proof of concept and failover testing demo on a Cisco UCS, VMware and NetApp environment. As I sit on the train heading back to Washington from NYC I thought it might be a good time to put together a technical post on the failover behavior of UCS blades. UCS has some advanced availability features that should be highlighted, it additionally has some areas where failover behavior may not be obvious. In this post I’m going to cover server failover situations within the UCS system, without heading very deep into the connections upstream to the network aggregation layer (mainly because I’m hoping Brad Hedlund at http://bradhedlund.com will cover that soon, hurry up Brad 😉
**Update** Brad has posted his UCS Networking Best Practices Post I was hinting at above. It’s a fantastic video blog in HD, check it out here: http://bradhedlund.com/2010/06/22/cisco-ucs-networking-best-practices/
To start this off let’s get up to a baseline level of understanding on how UCS moves server traffic. UCS is comprised of a number of blade chassis and a pair of Fabric Interconnects (FI.) The blade chassis hold the blade servers and the FIs handle all of the LAN and SAN switching as well as chassis/blade management that is typically done using six separate modules in each blade chassis in other implementations.
Note: When running redundant Fabric interconnects you must configure them as a cluster using L1 and L2 cluster links between each FI. These ports carry only cluster heartbeat and high-level system messages no data traffic or Ethernet protocols and therefore I have not included them in the following diagrams.
UCS Network Connectivity
Each individual blade gets connectivity to the network(s) via mezzanine form factor I/O card(s.) Depending on which blade type you select each blade will either have one redundant set of connections to the FIs or two redundant sets. Regardless of the type of I/O card you select you will always have 1x10GE connection to each FI through the blade chassis I/O module (IOM.)
UCS Blade Connectivity
In the diagram your seeing the blade connectivity for a blade with a single mezzanine slot. You can see that the blade is redundantly connected to both Fabric A and Fabric B via 2x10GE links. This connection occurs via the IOM which is not a switch itself and instead acts as a remote device managed by the fabric interconnect. What this means is that all forwarding decisions are handled by the FIs and frames are consistently scheduled within the system regardless of source and or destination. The total switching latency of the UCS system is approximately equal to a top-of-rack switch or blade form factor LAN switch within other blade products. Because the IOM is not making switching decisions it will need another method to move 8 internal mid-plane ports traffic upstream using it’s 4 available uplinks. the method it uses is static pinning. This method provides a very elegant switching behavior with extremely predictable failover scenarios. Let’s first look at the pinning later what this means for the UCS network failures.
The chart above shows the static pinning mechanism used within UCS. Given the configured number of uplinks from IOM to FI you will know exactly which uplink port a particular mid-plane port is using. Each half-width blade attaches to a single mid-plane port and each full width blade attaches to two. In the diagram the use of three ports does not have a pinning mechanism because this is not supported. If three links are used the 2 port method will define how uplinks are utilized. This is because eight devices cannot be evenly load-balanced across three links.
The example above shows the numbering of mid-plane ports. If you were using half width blades their numbering would match. When using full-width blades each blade has access to a pair of mid-plane ports (1-2, 3-4, 5-6, 7-8.)In the example above blade three would utilize mid-plane port three in the left example and one in the second based on the static pinning in the chart.
So now let’s discuss how failover happens, starting at the operating system. We have two pieces of failover to discuss, NIC teaming, and SAN multi-pathing. In order to understand that we need a simple logical connectivity view of how a UCS blade see’s the world.
UCS Logical Connectivity
In order to simplify your thinking when working with blade systems reduce your logical diagram to the key components, do this by removing the blade chassis itself from the picture. Remember that a blade is nothing more than a server connected to a set of switches, the only difference is that the first hop link is on the mid-plane of the chassis rather than a cable. The diagram above shows that a UCS blade is logically cabled directly to redundant Storage Area Network (SAN) switches for Fibre Channel (FC) and to the FI for Ethernet. Out of personal preference I leave the FIs out of the SAN side of the diagram because they operate in N_Port Virtualizer (NPV) mode which means forwarding decisions are handled by the upstream NPiV standard compliant SAN switch.
Starting at the Operating System (OS) we will work up the network stack to the FIs to discuss failover. We will be assuming FCoE is being used, if you are not using FCoE ignore the FC piece of the discussion as the Ethernet will remain the same.
SAN multi-pathing is the way we obtain redundancy in FC, FCoE, and iSCSI networks. It is used to provide the OS with two separate paths to the same logical disk. This allows the server to access the data in the event of a failure and in some cases load-balance traffic across two paths to the same disk. Multi-pathing comes in two general flavors: active/active, or active passive. Active/active load balances and has the potential to use the full bandwidth of all available paths. Active/Passive uses one link as a primary and reserves the others for failover. Typically the deciding factor is cost vs. performance.
Multi-pathing is handled by software residing in the OS usually provided by the storage vendor. The software will monitor the entire path to the disk ensuring data can be written and/or read from the disk via that path. Any failure in the path will cause a multi-pathing failover.
Multi-Pathing Failure Detection
Any of the failures designated by the X’s in the diagram above will trigger failover, this also includes failure of the storage controller itself which are typically redundant in an enterprise class array. SAN multi-pathing is an end-to-end failure detection system. This is much easier to implement in SAN as there is one constant target as opposed to a LAN where data may be sent to several different targets across the LAN and WAN. Within UCS SAN multi-pathing does not change from the system used for standalone servers. Each blade is redundantly connected and any path failure will trigger a failover.
NIC teaming is handled in one of three general ways: active/active load-balancing, active/passive failover, or active/active transmit with active/passive receive. The teaming type you use is dependant on the network configuration.
Supported teaming Configurations
In the diagram above we see two network configurations, one with a server dual connected to two switches, and a second with a server dual connected to a single switch using a bonded link. Bonded links act as a single logical link with the redundancy of the physical links within. Active/Active load-balancing is only supported using a bonded link due to MAC address forwarding decisions of the upstream switch. In order to load balance an active/active team will share a logical MAC address, this will cause instability upstream and lost packets if the upstream switches don’t see both links as a single logical link. This bonding is typically done using the Link Aggregation Control Protocol (LACP) standard.
If you glance back up at the UCS logical connectivity diagram you’ll see that UCS blades are connected in the method on the left of the teaming diagram. This means that our options for NIC teaming are Active/Passive failover and Active Active transmit only. This is assuming a bare metal OS such as Windows or Linux installed directly on the hardware, when using virtualized environments such as VMware all links can be actively used for transmit and receive because there is another layer of switching occurring in the hypervisor.
I typically get feedback that the lack of active/active NIC teaming on UCS bare metal blades is a limitation. In reality this is not the case. Remember Active/Active NIC teaming was traditionally used on 1GE networks to provide greater than 1GE of bandwidth. This was limited to a max of 8 aggregated links for a total of 8GE of bandwidth. A single UCS link at 10GE provides 20% more bandwidth than an 8 port active/active team.
NIC teaming like SAN multi-pathing relies on software in the OS, but unlike SAN multi-pathing it typically only detects link failures and in some cases loss of a gateway. Due to the nature of the UCS system NIC teaming in UCS will detect failures of the mid-plane path, the IOM, the utilized link from the IOM to the Fabric Interconnect or the FI itself. This is because the IOM is a linecard of the FI and the blade is logically connected directly to the FI.
UCS Hardware Failover:
UCS has a unique feature on several of the available mezzanine cards to provide hardware failure detection and failover on the card itself. Basically some of the mezzanine cards have a mini-switch built in with the ability to fail path A to path B or vice versa. This provides additional failure functionality and improved bandwidth/failure management. This feature is available on Generation I Converged network Adapters (CNA) and the Virtual Interface Card (VIC) and is currently only available in UCS blades.
UCS Hardware Failover
UCS Hardware failover will provide greater failure visibility than traditional NIC teaming due to advanced intelligence built into the FI as well as the overall architecture of the system. In the diagram above HW failover detects: mid-plane path, IOM and IOM uplink failures as link failures due to the architecture. Additionally if the FI loses it’s upstream network connectivity to the LAN it will signal a failure to the mezzanine card triggering failure. In the diagram above any failure at a point designated by an X will trigger the mezzanine card to divert Ethernet traffic to the B path. UCS hardware failover applies only to Ethernet traffic as SAN networks are built as redundant independent networks and would not support this failover method.
Using UCS hardware failover provides two key advantages over other architectures:
- Allows redundancy for NIC ports in separate subnets/VLANs which NIC teaming cannot do.
- Provides the ability for network teams to define the failure capabilities and primary path for servers alleviating misconfigurations caused by improper NIC teaming settings.
- IOM Link Failure:
The next piece of UCS server failover involves the I/O modules themselves. Each I/O module has a maximum of four 10GE uplinks providing 8x10GE mid-plane connections to the blades at an oversubscription of 1:1 to 8:1 depending on configuration. As stated above UCS uses a static non-configurable pinning mechanism to assign a mid-plane port to a specific uplink from the IOM to the FI. Using this pinning system allows the IOM to operate as an extension of the FI without the need for Spanning Tree Protocol (STP) within the UCS system. Additionally this system provides a very clear network design for designing oversubscription in both nominal and failure situations.
For the discussion of IOM failover we will use an example of a max configuration of 8 half-width blades and 4 uplinks on each redundant IOM.
Fully Configured 8 Blade UCS Chassis
In this diagram each blade is currently redundantly connected via 2x10GE links. One link through each IOM to each FI. Both IOMs and FIs operate in an active/active fashion from a switching perspective so each blade in this scenario has a potential bandwidth of 20GE depending on the operating system configuration. The overall blade chassis is configured with 2:1 oversubscription in this diagram as each IOM is using its max of 4x10GE uplinks while providing its max of 8x10GE mid-plane links for the 8 blades. If each blade were to attempt to push a sustained 20GE of throughput at the same time (very unlikely scenario) it would receive only 10GE because of this oversubscription. The bandwidth can be finely tuned to ensure proper performance in congestion scenarios such as this one using Quality of Service (QoS) and Enhanced Transmission Selection (ETS) within the UCS system.
In the event that a link fails between the IOM and the FI the servers pinned to that link will no longer have a path to that FI. The blade will still have a path to the redundant FI and will rely on SAN multi-pathing, NIC teaming and or UCS hardware failover to detect the failure and divert traffic to the active link.
For example if link one on IOM A fails blades one and five would lose connectivity through Fabric A and any traffic using that path would fail to link one on Fabric B ensuring the blade was still able to send and receive data. When link one on IOM A was repaired or replaced data traffic would immediately be able to start using the A path again.
IOM A will not automatically divert traffic from Blade one and five to an operational link, nor is this possible through a manual process. The reason for this is that diverting blade one and fives traffic to available links would further oversubscribe those links and degrade servers that should be unaffected by the failure of link one. In a real world data center a failed link will be quickly replaced and the only servers that will have been affected are blade one and five.
In the event that the link cannot be repaired quickly there is a manual process called re-acknowledgement which an administrator can perform. This process will adjust the pinning of IOM to FI links based on the number of active links using the same static pinning referenced above. In the above example servers would be re-pinned based on two active ports because three port configurations are not supported.
Overall this failure method and static pinning mechanism provides very predictable bandwidth management as well as limiting the scope of impact for link failures.
The UCS system architecture is uniquely designed to minimize management points and maximize link utilization by removing dependence on STP internally. Because of its unique design network failure scenarios must be clearly understood in order to maximize the benefits provided by UCS. The advanced failure management tools within UCS will provide for increased application uptime and application throughput in failure scenarios if properly designed.