What’s the deal with Quantized Congestion Notification (QCN)

For the last several months there has been a lot of chatter in the blogosphere and Twitter about FCoE and whether full scale deployment requires QCN.  There are two camps on this:

  1. FCoE does not require QCN for proper operation with scale.
  2. FCoE does require QCN for proper operation and scale.

Typically the camps break down as follows (there are exceptions) :

  1. HP camp stating they’ve not yet released a suite of FCoE products because QCN is not fully ratified and they would be jumping the gun.  The flip side of this is stating that Cisco did jumped the gun with their suite of products and will have issues with full scale FCoE.
  2. Cisco camp stating that QCN is not required for proper FCoE frame flow and HP is using the QCN standard as an excuse for not having a shipping product.

For the purpose of this post I’m not camping with either side, I’m not even breaking out my tent.  What I’d like to do is discuss when and where QCN matters, what it provides and why.  The intent being that customers, architects, engineers etc. can decide for themselves when and where they may need QCN.

QCN: QCN is a form of end-to-end congestion management defined in IEEE 802.1.Qau.  The purpose of end-to-end congestion management is to ensure that congestion is controlled from the sending device to the receiving device in a dynamic fashion that can deal with changing bottlenecks.  The most common end-to-end congestion management tool is TCP Windows sizing. 

TCP Window Sizing:

With window sizing TCP dynamically determines the number of frames to send at once without an acknowledgement.  It continuously ramps this number up dynamically if the pipe is empty and acknowledgements are being received.  If a packet is dropped due to congestion and an acknowledgement is not received TCP halves the window size and starts the process over.  This provides a mechanism in which the maximum available throughput can be achieved dynamically. 

Below is a diagram showing the dynamic window size (total packets sent prior to acknowledgement) over the course of several round trips.  You can see the initial fast ramp up followed by a gradual increase until a packet is lost, from there the window is reduced and the slow ramp begins again.

image If you prefer analogies I always refer to TCP sliding windows as a Keg Stand (http://en.wikipedia.org/wiki/Keg_stand.)

image

 

Photo from (http://en.wikipedia.org/wiki/Keg_stand)

In the photo we see several Jarheads (U.S. Marines) surrounding a keg, with one upside down performing a keg stand. 

Note: The haircuts and Republic of Korea (ROK) Marine PT shirt gives them away as Marines. 

To perform a keg stand:

  • Place both hands on top of the keg
  • 1-2 Friend(s) lift your feet over your head while you support your body weight on locked-out arms
  • Another friend places the keg’s nozzle in your mouth and turns it on
  • You swallow beer full speed for as long as you can

What the hell does this have to do with TCP Flow Control? I’m so glad you asked. 

During a keg stand your friend is trying to push as much beer down your throat as it can handle, much like TCP increasing the window size to fill the end-to-end pipe.  Both of your hands are occupied holding your own weight, and your mouth has a beer hose in it, so like TCP you have no native congestion signaling mechanism.  Just like TCP the flow doesn’t slow until packets/beer drops, when you start to spill they stop the flow.

So that’s an example of end-to-end congestion management.  Within Ethernet and FCoE specifically we don’t have any native end-to-end congestion tools (remember TCP is up on L4 and we’re hanging out with the cool kids at L2.)  No problem though because We’re talking FCoE right?  FCoE is just a L1-L2 replacement for Fibre Channel (FC) L0-L1, so we’ll just use FC end-to-end congestion management… Not so fast boys and girls, FC does not have a standard for end-to-end congestion management, that’s right our beautiful over engineered lossless FC has no mechanism for handling network wide, end-to-end congestion.  That’s because it doesn’t need it.

FC is moving SCSI data, and SCSI is sensitive to dropped frames, latency is important but lossless delivery is more important.  To ensure a frame is never dropped FC uses a hop-by-hop flow control known as buffer-to-buffer (B2B) credits. At a high level each FC device knows the amount of buffer spaces available on the next hop device based on the agreed upon frame size (typically 2148 bytes.)  This means that a device will never send a frame to a next hop device that cannot handle the frame.  Let’s go back to my analogy and keep with my USMC theme.

Buffer-to-buffer credits:

The B2B credit system works in the same method you’d have 10 Marines offload and stack a truckload of boxes (‘fork-lift, we don’t need no stinking forklift.’)  The best system to utilize 10 Marines to offload boxes is to line them up end-to-end one in the truck and one on the other end to stack.  Marine 1 in the truck initiates the send by grabbing a box and passing it to Marine 2, the box moves down the line until it gets to the target Marine 10 who stacks it.  Before any Marine hands another Marine a box they look to ensure that Marines hands are empty verifying they can handle the box and it won’t be dropped.  Boxes move down the line until they are all offloaded and stacked.  If anyone slows down or gets backed up each marine will hold their box until the congestion is relieved.

In this analogy the Marine in the truck is the initiator/server and the Marine doing the stacking is the target/storage with each Marine in between being a switch. 

When two FC devices initiate a link they follow the Link-Initialization-Protocols (LIP.)  During this process they agree on an FC frame size and exchange the available dedicated frame buffer spaces for the link.  A sender is always keeping track of available buffers on the receiving side of the link.  The only real difference between this and my analogy is each device (Marine) is typically able to handle more than one frame (box) at once.

So if FC networks operate without end-to-end congestion management just fine why do we need to implement a new mechanism in FCoE, well there-in lies the rub.  Do we need QCN?  The answer is really Yes and No, and it will depend on design.  FCoE today provides the exact same flow control as FC using two standards defined within Data Center Bridging (DCB) these are Enhanced Transmission Selection (ETS) and Priority-Flow Control (PFC) for more info on theses see my DCB blog: http://www.definethecloud.net/?p=31.)  Basically ETS provides a bandwidth guarantee without limiting and PFC provides lossless delivery on an Ethernet network.

Why QCN:

The reason QCN was developed is the differences between the size, scale, and design of FC and Ethernet networks.  Ethernet networks are usually large mesh or partial mesh type designs with multiple switches.  FC designs fall into one of three major categories Collapsed core (single layer), Core edge (two layer) or in rare cases for very large networks edge-core-edge (three layer.)  This is because we typically have far fewer FC connected devices than we do Ethernet (not every device needs consolidated storage/backup access.)

If we were to design our FCoE networks where every current Ethernet device supported FCoE and FCoE frames flowed end-to-end QCN would be a benefit to ensure point congestion didn’t clog the entire network.  On the other hand if we maintain similar size and design for FCoE networks as we do FC networks, there is no need for QCN.

Let’s look at some diagrams to better explain this:

image

 

 image In the diagrams above we see a couple of typical network designs.  The Ethernet diagram shows Core at the top, aggregation in the middle, and edge on the bottom where servers would connect.  The Fibre Channel design shows a core at the top with an edge at the bottom.  Storage would attach to the core and servers would attach at the bottom.  In both diagrams I’ve also shown typical frame flow for each traffic type.  Within Ethernet, servers commonly communicate with one another as well network file systems, the WAN etc.  In an FC network the frame flow is much more simplistic, typically only initiator target (server to storage) communication occurs.  In this particular FC example there is little to no chance of a single frame flow causing a central network congestion point that could effect other flows which is where end-to-end congestion management comes into play. 

What does QCN do:

QCN moves congestion from the network center to the edge to avoid centralized congestion on DCB networks.  Let’s take a look at a centralized congestion example (FC only for simplicity):

image In the above example two 2Gbbps hosts are sending full rate frame flows to two storage devices.  One of the storage devices is a 2Gbps device and can handle the full speed, the other is a 1Gbps device and is not able to handle the full speed. If these rates are sustained switch 3’s buffers will eventually fill and cause centralized congestion effecting frame flows to both switch 4, and 5.  This means that the full rate capable devices would be affected by the single slower device.  QCN is designed to detect this type of congestion and push it to the edge, therefore slowing the initiator on the bottom right avoiding overall network congestion.

This example is obviously not a good design and is only used to illustrate the concept.  In fact in a properly designed FC network with multiple paths between end-points central congestion is easily avoidable. 

When moving to FCoE if the network is designed such that FCoE frames pass through the entire full-mesh network shown in the Common Ethernet design above, there would be greater chances of central congestion.  If the central switches were DCB capable but not FCoE Channel Forwarders (FCF) QCN could play a part in pushing that congestion to the edge. 

If on the other hand you design FCoE in a similar fashion to current FC networks QCN will not be necessary.  An example of this would be:

imageThe above design incorporates FCoE into the existing LAN Core, Aggregation, Edge design without clogging the LAN core with unneeded FCoE traffic.  Each server is dual connected to the common Ethernet mesh, and redundantly connected to FCoE SAN A and B.  This design is extremely scalable and will provide more than enough ports for most FCoE implementations.

Summary:

QCN like other congestion management tools before it such as FECN and BECN have significant use cases.  As far as FCoE deployments go QCN is definitely not a requirement and depending on design will provide no benefit for FCoE.  It’s important to remember that the DCB standards are there to enhance Ethernet as a whole, not just for FCoE.  FCoE utilizes ETS and PFC for lossless frame delivery and bandwidth control, but the FCoE standard is a separate entity from DCB.

Also remember that FCoE is an excellent tool for virtualization which reduces physical server count.  This means that we will continue to require less and less FCoE ports overall especially as 40Gbps and 100Gbps are adopted.  Scaling FCoE networks further than today’s FC networks will most likely not be a requirement.

GD Star Rating
loading...

Networking Showdown: UCS vs. HP Virtual Connect (Updated)

Note: I have made updates to reflect that Virtual Connect is two words, and technical changes to explain another method of network configuration within Virtual Connect that prevents the port blocking described below.  Many thanks to the Ken Henault at HP who graciously walked me through the corrections, and beat them into my head until I understood them.

I’m sitting on a flight from Honolulu to Chicago in route home after a trip to Hawaii for customer briefings.  I was hoping to be asleep at this point but a comment Ken Henault left on my ‘FlexFabric – Small Step, Right Direction’ post is keeping me awake… that’s actually a lie, I’m awake because I’m cramped into a coach seat for 8 hours while my fiancé, who joined me for a couple of days, enjoys the first class luxuries of my auto upgrade, all the comfort in the world wouldn’t make up for the looks I got when we landed if I was the one up front.

So, being that I’m awake anyway I thought I’d address the comment from the aforementioned post.  Before I begin I want to clarify that my last post had nothing to do with UCS, I intentionally left UCS out because it was designed with FCoE in mind from the start so it has native advantages in an FCoE environment.  Additionally within UCS you can’t get away from FCoE, if you want Fibre Channel connectivity your using FCoE so it’s not a choice to be made (iSCSI, NFS, and others are supported but to connect to FC devices or storage it’s FCoE.) The blog was intended to state exactly what it did: HP has made a real step into FCoE with FlexFabric but there is still a long way to go. To see the original post click the link (http://www.definethecloud.net/?p=419.)

I’ve got a great deal of respect for both Ken and HP whom he works for.  Ken knows his stuff, our views may differ occasionally but he definitely gets the technology.  The fact that Ken knows HP blades inside, outside, backwards forwards and has a strong grasp on Cisco’s UCS made his comment even more intriguing to me, because it highlights weak spots in the overall understanding of both UCS and server architecture/design as it pertains to network connectivity. 

Scope:

This post will cover the networking comparison of HP C-Class using Virtual Connect (VC) modules and Virtual Connect (VC) management as it compares to the Cisco UCS Blade System.  This comparison is the closest ‘apples-to-apples’ comparison that can be done between Cisco UCS and HP C-Class.  Additionally I will be comparing the max blades in a single HP VC domain which is 64 (4 chassis x 16 blades) against 64 UCS blades which would require 8 Chassis.

Accuracy and Objectivity:

It is not my intent to use numbers favorable to one vendor or the other.  I will be as technically accurate as possible throughout, I welcome all feedback, comments and corrections from both sides of the house.

HP Virtual Connect:

VC is an advanced management system for HP C-Class blades that allows 4 blade chassis to be interconnected and managed/networked as a single system.  In order to provide this functionality the LAN/SAN switch modules used must be VC and the chassis must be interconnected by what HP calls a stacking-link.  HP does not consider VC Ethernet modules to be switches, but for the purpose of this discussion they will be.  I make this decision based on the fact that: They make switching decisions and they are the same hardware as the ProCurve line of blade switches.

Note: this is a networking discussion so while VC has other features they are not discussed here.

Let’s take a graphical view of a 4-chassis VC domain.

image In the above diagram we see a single VC domain cabled for LAN and SAN connectivity.  You can see that each chassis is independently connected to SAN A and B for Fibre Channel access, but Ethernet traffic can traverse the stacking-links along with the domain management traffic.  This allows a reduced number of uplinks to be used from the VC domain to the network for each 4 chassis VC domain.  This solution utilizes 13 total links to provide 16 Gbps of FC per chassis (assuming 8GB uplinks) and 20 Gbps of Ethernet for the entire VC domain (with blocking considerations discussed below.)  More links could be added to provide additional bandwidth.

This method of management and port reduction does not come without its drawbacks.  In the next graphic I add loop prevention and server to server communication.

image

The first thing to note in the above diagram is the blocked link.  When only a single vNet is configured accross the VC Domain (1-4 chassis) only 1 link or link aggregate group may forward per VLAN.  This means that per VC domain there is only one ingress or egress point to the network per VLAN.  This is because VC is not ‘stacking’ 4 switches into one distributed switch control plane but instead ‘daisy-chaining’ four independent switches together using an internal loop prevention mechanism.  This means that to prevent loops from being caused within the VC domain only one link can be actively used for upstream connectivity per VLAN.

Because of this loop prevention system you will see multiple-hops for frames destined between servers in separate chassis, as well as frames destined upstream in the network.  In the diagram I show a worst case scenario for educational purposes where a frame from a server in the lower chassis must hop three times before leaving the VC domain.  Proper design and consideration would reduce these hops to two max per VC domain.

**Update**

This is only one of the methods available for configuring vNets within a VC domain.  The second method will allow both uplinks to be configured using separate vNets which allows each uplink to be utilized even within the same VLANs but segregates that VLAN internally.  The following diagram shows this configuration.

image

In this configuration server NIC pairs will be configured to each use one vNet and NIC teaming software will provide failover.  Even though both vNets use the same VLAN the networks remain separate internally which prevents looping, upstream MAC address instability etc.  For example a server utilizing only two onboard NICs would have one NIC in vNet1 and one in vNet2.  In the event of an uplink failure for vNet1 the NIC in that vNet would have no north/south access but NIC teaming software could be relied upon to force traffic to the NIC in vNet 2. 

While both methods have advantages and disadvantages this will typically be the preferred method to avoid link blocking and allow better bandwidth utilization.  In this configuration the center two chassis will still require an extra one or two hops to send/receive north/south traffic depending on which vNet is being used.

**End Update**

The last thing to note is that any Ethernet cable reduction will also result in lower available bandwidth for upstream/northbound traffic to the network.  For instance in the top example above only one link will be usable per VLAN.  Assuming 10GE links, that leaves 10G bandwidth upstream for 64 servers.  Whether that is acceptable or not depends on the particular I/O profile of the applications.  Additional links may need to be added to provide adequate upstream bandwidth.  That brings us to our next point:

Calculating bandwidth needs:

Before making a decision on bandwidth requirements it is important to note the actual characteristics of your applications.  Some key metrics to help in design are:

  • Peak Bandwidth
  • Average Bandwidth
  • East/West traffic
  • North/South Traffic

For instance, using the example above, if all of my server traffic is East/West within a single chassis then the upstream link constraints mentioned are mute points.  If the traffic must traverse multiple chassis the stacking-link must be considered.  Lastly if traffic must also move between chassis as well as North/South to the network, uplink bandwidth becomes critical.  With networks it is common to under-architect and over-engineer, meaning spend less time designing and throw more bandwidth at the problem, this does not provide the best results at the right cost.

Cisco Unified Computing System:

Cisco UCS takes a different approach to providing I/O to the blade chassis.  Rather than placing managed switches in the chassis UCS uses a device called an I/O Module or Fabric Extender (IOM/FEX) which does not make switching decisions and instead passes traffic based on an internal pinning mechanism.  All switching is handled by the upstream Fabric Interconnects (UCS 6120 or 6140.)  Some will say the UCS Fabric Interconnect is ‘not-a-switch’ using the same logic as I did above for HP VC devices the Fabric Interconnect is definitely a switch.  In both operational modes the interconnect will make forwarding decisions based on MAC address.

One major architectural difference between UCS and HP, Dell, IBM, Sun blade implementations is that the switching and management components are stripped from the individual chassis and handled in the middle of row by a redundant pair of devices (fabric interconnects.)  These devices replace the LAN Access and SAN edge ports that other vendors Blade devices connect to.  Another architectural difference is that the UCS system never blocks server links to prevent loops (all links are active from the chassis to the interconnects) and in the default mode, End Host mode it will not block any upstream links to the network core.  For more detail on these features see my posts: Why UCS is my ‘A-Game Server Architecture http://www.definethecloud.net/?p=301, and UCS Server Failover http://www.definethecloud.net/?p=359.) 

A single UCS implementation can scale to  a max 40 Chassis 320 servers using a minimal bandwidth configuration, or 10 chassis 80 servers using max bandwidth depending on requirements.  There is also flexibility to mix and match bandwidth needs between chassis etc.  Current firmware limits a single implementation to 12 chassis (96 servers) for support and this increases with each major release.  Let’s take a look at the 8 chassis 64 server implementation for comparison to an equal HP VC domain.

image

In the diagram above we see an 8 chassis 64 server implementation utilizing the minimum number of links per chassis to provide redundancy (the same as was done in the HP example above.  Here we utilize 16 links for 8 chassis providing 20Gbps of LAN and SAN traffic to each chassis.  Because there is no blocking required for loop-prevention all links shown are active.  Additionally because the Fabric Interconnects shown here in green are the access/edge switches for this topology all east/west traffic between servers in a single chassis or across chassis is fully accounted for.  Depending on bandwidth requirements additional uplinks could be added to each chassis.  Lastly there would be no additional management cables required from the interconnects to the chassis as all management is handled on dedicated, prioritized internal VLANs.

In the system above all traffic is aggregated upstream via the two fabric interconnects, this means that accounting for North/South traffic is handled by calculating the bandwidth needs of the entire system and designing the appropriate number of links.

Side by Side Comparison:

imageIn the diagram we see a maximum server scale VC Domain compared to an 8 chassis UCS domain.  The diagram shows both domains connected up to a shared two-tier SAN design (core/edge) and 3 tier network design (Access, Aggregation, Core.)  In the UCS domain all access layer connectivity is handled within the system.

In the next diagram we look at an alternative connectivity method for the HP VC domain utilizing the switch modules in the HP chassis as the access layer to reduce infrastructure.

image

In this method we have reduced the switching infrastructure by utilizing the onboard switching of the HP chassis as the access layer.  The issue here will be the bandwidth requirements and port costs at the LAN aggregation/SAN core.  Depending on application bandwidth requirements additional aggregation/core ports will be required which can be more costly/complex than access connectivity.  Additionally this will increase cabling length requirements in order to tie back to the data center aggregation/core layer. 

Summary:

When comparing UCS to HP blade implementations a base UCS blade implementation is best compared against a single VC domain in order to obtain comparable feature parity.  The total port and bandwidth counts from the chassis for a minimum redundant system are:

  HP Cisco
Total uplinks 13 16
Gbps FC 16 per chassis N/A
Gbps Ethernet 10 per VLAN per VC Domain/ 20 Total N/A
Consolidated I/O N/A 20 per chassis
Total Chassis I/O 21 Gbps for 16 servers 20 Gbps for 8 servers

 

This does not take into account the additional management ports required for the VC domain that will not be required by the UCS implementation.  An additional consideration will be scaling beyond 64 servers.  With this minimal consideration the Cisco UCS will scale to 40 chassis 320 servers where the HP system scales in blocks of 4 chassis as independent VC domains.  While multiple VC domains can be managed by a Virtual Connect Enterprise Manager (VCEM) server the network stacking is done per 4 chassis domain requiring North/South traffic for domain to domain communication.

The other networking consideration in this comparison is that in the default mode all links shown for the UCS implementation will be active.  The HP implementation will have one available uplink or port aggregate uplink per VLAN for each VC domain, further restraining bandwidth and/or requiring additional ports.

GD Star Rating
loading...

UCS Server Failover

I spent the day today with a customer doing a proof of concept and failover testing demo on a Cisco UCS, VMware and NetApp environment.  As I sit on the train heading back to Washington from NYC I thought it might be a good time to put together a technical post on the failover behavior of UCS blades.  UCS has some advanced availability features that should be highlighted, it additionally has some areas where failover behavior may not be obvious.  In this post I’m going to cover server failover situations within the UCS system, without heading very deep into the connections upstream to the network aggregation layer (mainly because I’m hoping Brad Hedlund at http://bradhedlund.com will cover that soon, hurry up Brad ;-)

**Update** Brad has posted his UCS Networking Best Practices Post I was hinting at above.  It’s a fantastic video blog in HD, check it out here: http://bradhedlund.com/2010/06/22/cisco-ucs-networking-best-practices/

To start this off let’s get up to a baseline level of understanding on how UCS moves server traffic.  UCS is comprised of a number of blade chassis and a pair of Fabric Interconnects (FI.)  The blade chassis hold the blade servers and the FIs handle all of the LAN and SAN switching as well as chassis/blade management that is typically done using six separate modules in each blade chassis in other implementations.

Note: When running redundant Fabric interconnects you must configure them as a cluster using L1 and L2 cluster links between each FI.  These ports carry only cluster heartbeat and high-level system messages no data traffic or Ethernet protocols and therefore I have not included them in the following diagrams.

UCS Network Connectivity

image

Each individual blade gets connectivity to the network(s) via mezzanine form factor I/O card(s.)  Depending on which blade type you select  each blade will either have one redundant set of connections to the FIs or two redundant sets.  Regardless of the type of I/O card you select you will always have 1x10GE connection to each FI through the blade chassis I/O module (IOM.)

UCS Blade Connectivity

image In the diagram your seeing the blade connectivity for a blade with a single mezzanine slot.  You can see that the blade is redundantly connected to both Fabric A and Fabric B via 2x10GE links.  This connection occurs via the IOM which is not a switch itself and instead acts as a remote device managed by the fabric interconnect. What this means is that all forwarding decisions are handled by the FIs and frames are consistently scheduled within the system regardless of source and or destination.  The total switching latency of the UCS system is approximately equal to a top-of-rack switch or blade form factor LAN switch within other blade products.  Because the IOM is not making switching decisions it will need another method to move 8 internal mid-plane ports traffic upstream using it’s 4 available uplinks.  the method it uses is static pinning.  This method provides a very elegant switching behavior with extremely predictable failover scenarios. Let’s first look at the pinning later what this means for the UCS network failures.

Static Pinningimage

The chart above shows the static pinning mechanism used within UCS.  Given the configured number of uplinks from IOM to FI you will know exactly which uplink port a particular mid-plane port is using.  Each half-width blade attaches to a single mid-plane port and each full width blade attaches to two.  In the diagram the use of three ports does not have a pinning mechanism because this is not supported.  If three links are used the 2 port method will define how uplinks are utilized.  This is because eight devices cannot be evenly load-balanced across three links.

IOM Connectivity

image

The example above shows the numbering of mid-plane ports.  If you were using half width blades their numbering would match.  When using full-width blades each blade has access to a pair of mid-plane ports (1-2, 3-4, 5-6, 7-8.)In the example above blade three would utilize mid-plane port three in the left example and one in the second based on the static pinning in the chart.

So now let’s discuss how failover happens, starting at the operating system.  We have two pieces of failover to discuss, NIC teaming, and SAN multi-pathing.  In order to understand that we need a simple logical connectivity view of how a UCS blade see’s the world.

UCS Logical Connectivity

image

In order to simplify your thinking when working with blade systems reduce your logical diagram to the key components, do this by removing the blade chassis itself from the picture.  Remember that a blade is nothing more than a server connected to a set of switches, the only difference is that the first hop link is on the mid-plane of the chassis rather than a cable.  The diagram above shows that a UCS blade is logically cabled directly to redundant Storage Area Network (SAN) switches for Fibre Channel (FC) and to the FI for Ethernet.  Out of personal preference I leave the FIs out of the SAN side of the diagram because they operate in N_Port Virtualizer (NPV) mode which means forwarding decisions are handled by the upstream NPiV standard compliant SAN switch.

Starting at the Operating System (OS) we will work up the network stack to the FIs to discuss failover.  We will be assuming FCoE is being used, if you are not using FCoE ignore the FC piece of the discussion as the Ethernet will remain the same.

SAN Multi-Pathing:

SAN multi-pathing is the way we obtain redundancy in FC, FCoE, and iSCSI networks.  It is used to provide the OS with two separate paths to the same logical disk.  This allows the server to access the data in the event of a failure and in some cases load-balance traffic across two paths to the same disk.  Multi-pathing comes in two general flavors: active/active, or active passive.  Active/active load balances and has the potential to use the full bandwidth of all available paths.  Active/Passive uses one link as a primary and reserves the others for failover.  Typically the deciding factor is cost vs. performance.

Multi-pathing is handled by software residing in the OS usually provided by the storage vendor.  The software will monitor the entire path to the disk ensuring data can be written and/or read from the disk via that path.  Any failure in the path will cause a multi-pathing failover.

Multi-Pathing Failure Detection

image

Any of the failures designated by the X’s in the diagram above will trigger failover, this also includes failure of the storage controller itself which are typically redundant in an enterprise class array.  SAN multi-pathing is an end-to-end failure detection system.  This is much easier to implement in SAN as there is one constant target as opposed to a LAN where data may be sent to several different targets across the LAN and WAN.  Within UCS SAN multi-pathing does not change from the system used for standalone servers.  Each blade is redundantly connected and any path failure will trigger a failover.

NIC-Teaming:

NIC teaming is handled in one of three general ways: active/active load-balancing, active/passive failover, or active/active transmit with active/passive receive.  The teaming type you use is dependant on the network configuration.

Supported teaming Configurations

image 

In the diagram above we see two network configurations, one with a server dual connected to two switches, and a second with a server dual connected to a single switch using a bonded link.  Bonded links act as a single logical link with the redundancy of the physical links within.  Active/Active load-balancing is only supported using a bonded link due to MAC address forwarding decisions of the upstream switch.  In order to load balance an active/active team will share a logical MAC address, this will cause instability upstream and lost packets if the upstream switches don’t see both links as a single logical link.  This bonding is typically done using the Link Aggregation Control Protocol (LACP) standard.

If you glance back up at the UCS logical connectivity diagram you’ll see that UCS blades are connected in the method on the left of the teaming diagram.  This means that our options for NIC teaming are Active/Passive failover and Active Active transmit only.  This is assuming a bare metal OS such as Windows or Linux installed directly on the hardware, when using virtualized environments such as VMware all links can be actively used for transmit and receive because there is another layer of switching occurring in the hypervisor. 

I typically get feedback that the lack of active/active NIC teaming on UCS bare metal blades is a limitation.  In reality this is not the case.  Remember Active/Active NIC teaming was traditionally used on 1GE networks to provide greater than 1GE of bandwidth.  This was limited to a max of 8 aggregated links for a total of 8GE of bandwidth.  A single UCS link at 10GE provides 20% more bandwidth than an 8 port active/active team.

NIC teaming like SAN multi-pathing relies on software in the OS, but unlike SAN multi-pathing it typically only detects link failures and in some cases loss of a gateway.  Due to the nature of the UCS system NIC teaming in UCS will detect failures of the mid-plane path, the IOM, the utilized link from the IOM to the Fabric Interconnect or the FI itself.  This is because the IOM is a linecard of the FI and the blade is logically connected directly to the FI.

UCS Hardware Failover:

UCS has a unique feature on several of the available mezzanine cards to provide hardware failure detection and failover on the card itself.  Basically some of the mezzanine cards have a mini-switch built in with the ability to fail path A to path B or vice versa.  This provides additional failure functionality and improved bandwidth/failure management.  This feature is available on Generation I Converged network Adapters (CNA) and the Virtual Interface Card (VIC) and is currently only available in UCS blades.

UCS Hardware Failover

image

UCS Hardware failover will provide greater failure visibility than traditional NIC teaming due to advanced intelligence built into the FI as well as the overall architecture of the system.  In the diagram above HW failover detects: mid-plane path, IOM and IOM uplink failures as link failures due to the architecture.  Additionally if the FI loses it’s upstream network connectivity to the LAN it will signal a failure to the mezzanine card triggering failure.  In the diagram above any failure at a point designated by an X will trigger the mezzanine card to divert Ethernet traffic to the B path.  UCS hardware failover applies only to Ethernet traffic as SAN networks are built as redundant independent networks and would not support this failover method.

Using UCS hardware failover provides two key advantages over other architectures:

  • Allows redundancy for NIC ports in separate subnets/VLANs which NIC teaming cannot do.
  • Provides the ability for network teams to define the failure capabilities and primary path for servers alleviating misconfigurations caused by improper NIC teaming settings.
    IOM Link Failure:

The next piece of UCS server failover involves the I/O modules themselves.  Each I/O module has a maximum of four 10GE uplinks providing 8x10GE mid-plane connections to the blades at an oversubscription of 1:1 to 8:1 depending on configuration.  As stated above UCS uses a static non-configurable pinning mechanism to assign a mid-plane port to a specific uplink from the IOM to the FI.  Using this pinning system allows the IOM to operate as an extension of the FI without the need for Spanning Tree Protocol (STP) within the UCS system.  Additionally this system provides a very clear network design for designing oversubscription in both nominal and failure situations.

For the discussion of IOM failover we will use an example of a max configuration of 8 half-width blades and 4 uplinks on each redundant IOM.

Fully Configured 8 Blade UCS Chassis

image In this diagram each blade is currently redundantly connected via 2x10GE links.  One link through each IOM to each FI.  Both IOMs and FIs operate in an active/active fashion from a switching perspective so each blade in this scenario has a potential bandwidth of 20GE depending on the operating system configuration.  The overall blade chassis is configured with 2:1 oversubscription in this diagram as each IOM is using its max of 4x10GE uplinks while providing its max of 8x10GE mid-plane links for the 8 blades.  If each blade were to attempt to push a sustained 20GE of throughput at the same time (very unlikely scenario) it would receive only 10GE because of this oversubscription.  The bandwidth can be finely tuned to ensure proper performance in congestion scenarios such as this one using Quality of Service (QoS) and Enhanced Transmission Selection (ETS) within the UCS system.

In the event that a link fails between the IOM and the FI the servers pinned to that link will no longer have a path to that FI.  The blade will still have a path to the redundant FI and will rely on SAN multi-pathing, NIC teaming and or UCS hardware failover to detect the failure and divert traffic to the active link.

For example if link one on IOM A fails blades one and five would lose connectivity through Fabric A and any traffic using that path would fail to link one on Fabric B ensuring the blade was still able to send and receive data.  When link one on IOM A was repaired or replaced data traffic would immediately be able to start using the A path again.

IOM A will not automatically divert traffic from Blade one and five to an operational link, nor is this possible through a manual process.  The reason for this is that diverting blade one and fives traffic to available links would further oversubscribe those links and degrade servers that should be unaffected by the failure of link one.  In a real world data center a failed link will be quickly replaced and the only servers that will have been affected are blade one and five. 

In the event that the link cannot be repaired quickly there is a manual process called re-acknowledgement which an administrator can perform.  This process will adjust the pinning of IOM to FI links based on the number of active links using the same static pinning referenced above.  In the above example servers would be re-pinned based on two active ports because three port configurations are not supported. 

Overall this failure method and static pinning mechanism provides very predictable bandwidth management as well as limiting the scope of impact for link failures.

Summary:

The UCS system architecture is uniquely designed to minimize management points and maximize link utilization by removing dependence on STP internally.  Because of its unique design network failure scenarios must be clearly understood in order to maximize the benefits provided by UCS.  The advanced failure management tools within UCS will provide for increased application uptime and application throughput in failure scenarios if properly designed.

GD Star Rating
loading...

FCoE multi-hop; Do you Care?

There is a lot of discussion in the industry around FCoE’s current capabilities, and specifically around the ability to perform multi-hop transmission of FCoE frames and the standards required to do so.  A recent discussion between Brad Hedlund at Cisco and Ken Henault at HP (http://bit.ly/9Kj7zP) prompted me to write this post.  Ken proposes that FCoE is not quite ready and Brad argues that it is. 

When looking at this discussion remember that Cisco has had FCoE products shipping for about 2 years, and has a robust product line of devices with FCoE support including: UCS, Nexus 5000, Nexus 4000 and Nexus 2000, with more products on the road map for launch this year.  No other switching vendor has this level of current commitment to FCoE.  For any vendor with a less robust FCoE portfolio it makes no sense to drive FCoE sales and marketing at this point and so you will typically find articles and blogs like the one mentioned above.  The one quote from that blog that sticks out in my mind is:

“Solutions like HP’s upcoming FlexFabric can take advantage of FCoE to reduce complexity at the network edge, without requiring a major network upgrades or changes to the LAN and SAN before the standards are finalized.”

If you read between the lines here it would be easy to take this as ‘FCoE isn’t ready until we are.’  This is not unusual and if you take a minute to search through articles about FCoE over the last 2-3 years you’ll find that Cisco has been a big endorser of the protocol throughout (because they actually had a product to sell) and other vendors become less and less anti-FCoE as they announce FCoE products.

It’s also important to note that Cisco isn’t the only vendor out there embracing FCoE: NetApp has been shipping native FCoE storage controllers for some time, EMC has them road mapped for the very near future, Qlogic is shipping a 2nd generation of Converged Network adapter, and Emulex has fully embraced 10Gig Ethernet as the way forward with their OneConnect adapter (10GE, iSCSI, FCoE all in one card.)  Additionally support for FCoE switching of native Fibre Channel storage is widely supported by the storage community.

Fibre Channel over Ethernet (FCoE) is defined in IEEE FC-BB5 and requires the switches it traverses to support the IEEE Data Center Bridging (DCB)standards for proper traffic treatment on the network.  For more information on FCoE or DCB see my previous posts on the subjects (FCoE: http://www.definethecloud.net/?p=80, DCB: http://www.definethecloud.net/?p=31.)

DCB Has four major components, and the one in question in the above article is Quantized Congestion Notification (QCN) which the article states is required for multi-hop FCoE.  QCN is basically a regurgitation of FECN and BECN from frame relay.  It allows a switch to monitor it’s buffers and push congestion to the edge rather than clog the core. In the comments Brad correctly states that QCN is not required for FCoE, the reason for this is that Fibre Channel operates today without any native version of QCN, therefore when placing it on Ethernet you will not need to add functionality that wasn’t there to begin with, remember Ethernet is just a new layer 1-2 for native FC layers 2-4, the FC secret sauce remains unmodified.  Remember that not every standard defined by a standards body has to be adhered to by every device, some are required, some are optional.  Logical SANs are a great example of an optional standard.

Rather than discuss what is or isn’t required for multi-hop FCoE I’d like to ask a more important question that we as engineers tend to forget: Do I care?  This question is key because it avoids having us argue the technical merits of something we may never actually need, or may not have a need for today.

Do we care?

First let’s look at why we do multi-hop anything: to expand the port-count of our network.  Take TCP/IP networks and the internet for example, we require the ability to move packets across the globe through multiple routers (hops.)  This is in order to attach devices on all corners of the globe.

Now let’s look at what we do with FC today: typically one or two hop networks (sometimes three) used to connect several hundred devices (occasionally but rarely more.)  It’s actually quite common to find FC implementations with less than 100 attached ports.  This means that if you can hit the right port count without multiple hops you can remove complexity and decrease latency, in Storage Area Networks (SAN) we call this the collapsed core design.

The second thing to consider is a hypothetical question: If FCoE were permanently destined for single hop access/edge only deployments (it isn’t) should that actually stop you from using it?  The answer here is an emphatic no, I would still highly recommend FCoE as an access/edge architecture even if it were destined to connect back to an FC SAN and Ethernet LAN for all eternity.  Let’s jump to some diagrams to explain.  In the following diagrams I’m going to focus on Cisco architecture because as stated above they are currently the only vendor with a full FCoE product portfolio.

 image

In the above diagram you can see a fairly dynamic set  of FCoE connectivity options.  Nexus 5000 can be directly connected to servers, or to Nexus 4000 in IBM BladeCenter to pass FCoE.  It can also be connected to 10GE Nexus 2000s to increase its port density. 

To use the nexus 5000 + 2000 as an example it’s possible to create a single-hop (2000 isn’t an L2 hop it is an extension of the 5000) FCoE architecture of up to 384 ports with one point of switching management per fabric.  If you take server virtualization into the picture and assume 384 servers with a very modest V2P ratio of 10 virtual machines to 1 physical machine that brings you to 3840 servers connected to a single hop SAN.  That is major scalability with minimal management all without the need for multi-hop. The diagram above doesn’t include the Cisco UCS product portfolio which architecturally supports up to 320 FCoE connected servers/blades.

The next thing I’ve asked you to think about is whether or not you should implement FCoE in a hypothetical world where FCoE stays an access/edge architecture forever.  The answer would be yes.  In the following diagrams I outline the benefits of FCoE as an edge only architecture.

image

The first benefit is reducing the networks that are purchased, managed, power, and cooled from 3 to 1 (2 FC and 1 Eth to 1 FCoE.)  Even just at the access layer this is a large reduction in overhead and reduces the refresh points as I/O demands increase.

image The second benefit is the overall infrastructure reduction at the access layer.  Taking a typical VMware server as an example we reduce 6x 1GE ports, 2x 4GFC ports and the 8 cables required for them to 2x 10GE ports carrying FCoE.  This increases total bandwidth available while greatly reducing infrastructure.  Don’t forget the 4 top-of-rack switches (2x FC, 2x GE) reduced to 2 FCoE switches.

Since FCoE is fully compatible with both FC and pre-DCB Ethernet this requires 0 rip-and-replace of current infrastructure.  FCoE is instead used to build out new application environments or expand existing environments while minimizing infrastructure and complexity.

What if I need a larger FCoE environment?

If you require a larger environment than is currently supported extending your SAN is quite possible without multi-hop FCoE.  FCoE can be extended using existing FC infrastructure.  Remember customers that require an FCoE infrastructure this large already have an FC infrastructure to work with.

image 

What if I need to extend my SAN between data centers?

FCoE SAN extension is handled in the exact same way as FC SAN extension, CWDM, DWDM, Dark Fiber, or FCIP.  Remember we’re still moving Fibre Channel frames.

image

Summary:

FCoE multi-hop is not an argument that needs to be had for most current environments.  FCoE is a supplemental technology to current Fibre Channel implementations.  Multi-hop FCoE will be available by the end of CY2010 allowing 2+ tier FCoE networks with multiple switches in the path, but there is no need to wait for them to begin deploying FCoE.  The benefits of an FCoE deployment at the access layer only are significant, and many environments will be able to scale to full FCoE roll-outs without ever going mutli-hop. 

GD Star Rating
loading...

Why Cisco UCS is my ‘A-Game’ Server Architecture

A-Game:

When I discuss my A-Game it’s my go to hardware vendor for a specific data center component.  For example I have an A-Game platform for:

  • Storage
  • SAN
  • LAN (access Layer LAN specifically, you don’t want me near your aggregation, core or WAN)
  • Servers and Blades (traditionally this has been one vendor for both)

As this post is in regards to my server A-Game I’ll leave the rest undefined for now and may blog about them later.

Over the last 4 years I’ve worked in some capacity or another as an independent customer advisor or consultant with several vendor options to choose from.  This has been either with a VAR or strategic consulting firm such as www.fireflycom.net.)  In both cases there is typically a company lean one way or another but my role has given me the flexibility to choose the right fit for the customer not my company or the vendors which is what I personally strive to do.  I’m not willing to stake my own integrity on what a given company wants to push today.  I’ve written about my thoughts on objectivity in a previous blog (http://www.definethecloud.net/?p=112.)

Another rule in regards to my A-Game is that it’s not a rule, it’s a launching point.  I start with a specific hardware set in mind in order to visualize the customer need and analyze the best way to meet that need.  If I hit a point of contention that negates the use of my A-Game I’ll fluidly adapt my thinking and proposed architecture to one that better fits the customer.  These points of contention may be either technical, political, or business related:

  • Technical: My A-Game doesn’t fit the customers requirement due to some technical factor, support, feature, etc.
  • Political: My A-Game doesn’t fit the customer because they don’t want Vendor X (previous bad experience, hype, understanding, etc.)
  • Business: My A-Game isn’t on an approved vendor list, or something similar.

If I hit one of these roadblocks I’ll shift my vendor strategy for the particular engagement without a second thought.  The exception to this is if one of these roadblocks isn’t actually a roadblock and my A-Game definitely provides the best fit for the customer I’ll work with the customer to analyze actual requirements and attempt to find ways around the roadblock.

Basically my A-Game is a product or product line that I’ve personally tested, worked with and trust above the others that is my starting point for any consultative engagement.

A quick read through my blog page or a jump through my links will show that I work closely with Cisco products and it would be easy to assume that I am therefore inherently skewed towards Cisco.  In reality the opposite is true, over the last few years I’ve had the privilege to select my job(s) and role(s) based on the products I want to work with.

My sorted UCS history:

As anyone who’s worked with me can attest to I’m not one to pull punches, feign friendliness, or accept what you try and sell me based on a flashy slide deck or eloquent rhetoric.  If you’re presenting to me don’t expect me to swallow anything without proof, don’t expect easy questions, and don’t show up if you can’t put the hardware in my hands to cash the checks your slides write.  When I’m presenting to you, I expect and encourage the same.

Prior to my exposure to UCS I worked with both IBM and HP servers and blades.  I am an IBM Certified Blade Expert (although dated at this point.)  IBM was in fact my A-Game server and blade vendor.  This had a lot to do with the technology of the IBM systems as well as the overall product portfolio IBM brought with it.  That being said I’d also be willing to concede that HP blades have moved above IBM’s in technology and innovation, although IBM’s MAX5 is one way IBM is looking to change that.

When I first heard about Cisco’s launch into the server market I thought, and hoped, it was a joke.  I expected some Frankenstein of a product where I’d place server blades in Nexus or Catalyst chassis.  At the time I was working heavily with the Cisco Nexus product line primarily 5000, 2000, and 1000v.  I was very impressed with these products, the innovation involved, and the overall benefit they’d bring to the customer.  All the love in the world for the Nexus line couldn’t overcome my feeling that there was no way Cisco could successfully move into servers.

Early in 2009 my resume was submitted among several others by my company to Learning at Cisco and the business unit in charge of UCS.  This was part of an application process for learning partners in order to be invited to the initial external Train The Trainer (TTT) and participate in training UCS to: Cisco, partners, and customers worldwide.  Myself and two other engineer/trainers (Dave Alexander and Fabricio Grimaldi) were selected from my company to attend.  The first interesting thing about the process was that the three of us were selected above CCIEs, 2x CCIEs and more experienced instructors from our company based on our server backgrounds.  It seemed Cisco really was looking to push servers not some network adaptation.

During the TTT I remained very skeptical.  The product looked interesting but not ‘game-changing.’  The user interfaces were lacking and definitely showed their Alpha and Beta colors.  Hardware didn’t always behave as expected and the real business/technological benefits of the product didn’t shine through.  That being said remember that at this point the product was months away from launch and this was a very Beta version of hardware/software we were working with.  Regardless of the underlying reasons I walked away from the TTT feeling fully underwhelmed.

I spent the time on my flight back to the East Coast from San Jose looking through my notes and thinking about the system and components.  It definitely had some interesting concepts but I didn’t feel it was a platform I would stake my name to at this point.

Over the next couple of months Fabricio Grimaldi and I assisted Dave Alexander (http://theunifiedcomputingblog.com) in developing the UCS Implementation certification course.  Through this process I spent a lot of time digging into the underlying architecture, relating it back to my server admin days and white boarding the concepts and connections in my home office.  Additionally I got more and more time on the equipment to ‘kick-the-tires.’  During this process Dave myself and Fabrico began instructing an internal Cisco course known as UCS Bootcamp.  The course was designed for Cisco engineers from both pre-sales and post-sales roles and focused specifically on the technology as a product deep dive.

It was over these months having discussions on the product, wrapping my head around the technology, and developing training around the components that the lock cylinders in my brain started to click into place and finally the key turned: UCS changes the game for server architecture, the skeptic had become a convert.

UCS the game changer:

The term game changer ge
ts thrown around all willy nilly like in this industry.  Every minor advancement is touted by its owner as a ‘Game Changer.’  In reality ‘Game Changers’ are few and far between.  In order to qualify you must actually shift the status quo, not just improve upon it.  To use vacuums as an example, if your vacuum sucks harder it just sucks harder, it doesn’t change the game.  A Dyson vacuum may vacuum better than anyone else’s but Roomba (http://www.irobot.com/uk/home_robots.cfm) is the one that changed the game.  With Dyson I still have to push the damn thing around the living room, with Roomba I watch it go.

In order to understand why UCS changes the game rather than improving upon it, you first need to define UCS:

UCS is NOT a blade system it is a server architecture

Cisco’s unified Computing System (UCS) is not all about blades, it is about rack mount servers, blade servers, and management being used as a flexible pool of computing resources.  Because of this it has been likened to an x86-64 based mainframe system.

UCS takes a different approach to the original blade system designs.  It’s not a solution for data center point problems (power, cooling, management, space, cabling) in isolation it’s a redefinition of the way we do computing.

‘Instead of asking how can I improve upon current architectures’

Cisco/Nuova asked

What’s the purpose of the server and what’s the best way to accomplish that goal.’

Many of the ideas UCS utilizes have been tried and implemented in other products before: Unified I/O, single point of management, modular scalability, etc., but never all in one cohesive design.

There are two major features of UCS that I call ‘the cake’ and three more that are really icing.  The two cake features are the reason UCS is my A-Game and the others just further separate it.

  • Unified Management
  • Workload Portability

Unified Management:

Blade architectures are traditionally built with space savings as a primary concern.  In order to do this a blade chassis is built with a shared LAN, SAN, power, cooling infrastructure and an onboard management system to control server hardware access, fan speeds, power levels, etc.  M. Sean McGee describes this much better than I could hope to in his article The “Mini-Rack” approach to Blade Design (http://bit.ly/bYJVJM.)  This traditional design saves space and can also save on overall power, cooling, and cabling but causes pain points in management among other considerations.

UCS was built from the ground up with a different approach, and Cisco has the advantage of zero legacy server investment which allows them to execute on this.  The UCS approach is:

  • Top-of-Rack networking should be Top-Of-Rack not repeated in each blade chassis.
  • Management should encompass the entire architecture not just a single chassis.
  • Blades are only 40% of the data center server picture, rack mounts should not be excluded.
    The UCS Approach

    image

The key difference here is that all management of the LAN, SAN, server hardware, and chassis itself is pulled into the access layer and performed on the UCS Fabric Interconnect which provides all of the switching and management functionality for the system.  The system itself was built from the ground up with this in mind, and as such this is designed into each hardware component.  Other systems that provide a single point of management do so by layering on additional hardware and software components in order to manage underlying component managers.  Additionally these other systems only manage blade enclosures while UCS is designed to manage both blades and traditional rack mounts from one point.  This functionality will be available in firmware by the end of CY10.

To put this in perspective Cisco UCS provides a very similar rapid repeatable physical server deployment model to the virtual server deployment model VMware provides.  Through the use of granular Role Based Access Control (RBAC) UCS ensures that organizational changes are not required, while at the same time providing the flexibility to streamline people and process if desired.

Workload Portability:

Workload portability has significant benefits within the data center, the concept itself is usually described as ‘statelessness.’  If you’re familiar with VMware this is the same flexibility VMware provides for virtual machines, i.e. there is no tie to the underlying hardware. One of the key benefits of UCS is the ability to apply this type of statelessness at the hardware level.  This removes the tie of the server or workload to the blade or slot it resides in, and provides major flexibility to maintenance and repair cycles, as well as deployment times for new or expanding applications.

Within UCS all management is performed on the Fabric Interconnect through the UCS Manager GUI or CLI.  This includes any network configuration for blades, chassis, or rack-mounts, all server configuration including firmware BIOS, NIC/HBA and boot order among other things.  The actual blade is configured through an object called a ‘service profile’.’  This profile defines the server on the network as well as the way in which the server hardware operates (BIOS/Firmware, etc.)

All of the settings contained within a server profile are traditionally configured, managed and stored in hardware on a server.  Because these are now defined in a configuration file the underlying hardware tie is stripped away and a server workload can be quickly moved from one physical blade to another without requiring changes in the networks, or storage arrays.  This decreases maintenance windows and speeds roll-out.

Within UCS, Service Profiles can be created using templates or pools which is unique to UCS.  This further increases the benefits of service profiles and decreases the risk inherent with multiple configuration points, and case-by-case deployment models.

UCS Profiles and Templates

image

These two features and their real world applications and value are what place UCS in my A-Game slot.  These features will provide benefits to ANY server deployment model, and are unique to UCS.  While subcomponents exist within other vendors they are not:

  • Designed into the hardware
  • Fully integrated without the need for additional hardware and software and licensing
  • As robust

Icing on the cake:

  • Dual socket server memory scalability and flexibility (Cisco memory expander technology)
  • Integration with VMware and advanced networking for virtual switching
  • Unified fabric (I/O consolidation)

Each of these feature also offer real world benefits but the real heart of UCS is the Unified management and server statelessness.  You can find more information on these other features through var
ious blogs and Cisco documentation.

When is it time for my B-Game?:

By now you should have an understanding as to why I chose UCS as my A-Game (not to say you necessarily agree, but that you understand my approach.)  So what are the factors that move me towards my B-Game?  I will list three considerations and the qualifying question that would finalize a decision to position a B-Game system:

Infiniband If the customer is using Infiniband for networking UCS does not currently support it.  I would first assess whether there was an actual requirement for Infiniband or if it was just the best option at the time of last refresh.  If Infiniband is required I would move to another solution.
Non-Intel Processors Requirement for non-Intel processors would steer me towards another vendor as UCS does not currently support non-Intel.  As above I would first verify whether non-Intel was a requirement or a choice.
Requirement for chassis based storage If a customer had a requirement for chassis based storage there is no current Cisco offering for this within UCS.  This is however very much a corner case and only a configuration I would typically recommend for single chassis deployments with little need to scale.  In-chassis storage becomes a bottle neck rather than a benefit in multi-chassis configurations.

While there are other reasons I may have to look at another product for a given engagement they are typically few and far between.  UCS has the right combination of entry point and scalability to hit a great majority of server deployments.  Additionally as a newer architecture there is no concern with the architectural refresh cycle of other vendors.  As other blade solutions continue to age there will be an increased risk to the customer in regards to forward compatibility.

Summary:

UCS is not the only server or blade system on the market, but it is the only complete server architecture.  Call it consolidated, unified, virtualized, whatever but there isn’t another platform to combine rack-mounts and blades under a single architecture with a single management window and tools for rapid deployment.  The current offering is appropriate for a great majority of deployments and will continue to get better.

If your considering a server refresh or new deployment it would be a mistake not to take a good look at the UCS architecture.  Even if it’s not what you choose it may give you some ideas as to how you want to move forward, or features to ask your chosen vendor for.

Even if you never buy a UCS server you can still thank Cisco for launching UCS.  The lower pricing you’re getting today, and the features being put in place on other vendors product lines are being driven by a new server player in the market, and the innovation they launched with.

Comments, concerns, complaints always appreciated!

GD Star Rating
loading...

FCoE initialization Protocol (FIP) Deep Dive

In an attempt to clarify my future posts I will begin categorizing a bit.  The following post will be part of a Technical Deep Dive series.

Fibre Channel over Ethernet (FCoE) is a protocol designed to move native Fibre Channel over 10 Gigabit Ethernet and above links, I’ve described the protocol in a previous post (http://www.definethecloud.net/?p=80.)  In order for FCoE to work we need a mechanism to carry the base Fibre Channel port / device login mechanisms over Ethernet.  These are the processes for a port to login and obtain a routable Fibre Channel Address.  Let’s start with some background and definitions:

DCB Data Center Bridging
FC Native Fibre Channel Protocol
FCF Fibre Channel Forwarder (an Ethernet switch capable of handling Encapsulation/De-encapsulation of FCoE frames and some or all FC services)
FCID Fibre Channel ID (24 Bit Routable address)
FCoE Fibre Channel over Ethernet
FC-MAP A 24-Bit value identifying an individual fabric
FIP FCoE Initialization Protocol
FLOGI FC Fabric Login
FPMA Fabric Provided MAC Address
PLOGI FC Port Login
PRLI Process Login
SAN Storage Area Network (switching infrastructure)
SCSI Small Computer Systems Interface
 
Now for the background, you’ll never grasp FIP properly if you don’t first get the fundamentals of FC:
 
N_Port Initialization
image

 

When a node comes online it’s port is considered an N_port.  When an N_port connects to the SAN it will connect to a switch port defined as a Fabric Port F_Port (this assumes your using a switched fabric.)  All N_ports operate the same way when they are brought online:

  1. FLOGI – Used to obtain a routable FCID for use in FC frame exchange.  The switch will provide the FCID during a FLOGI exchange.
  2. PLOGI – Used to register the N_Port with the FC name server

At this point a targets (disk or storage array) job is done, they can now sit and wait for requests.  An initiator (server) on the other hand needs to perform a few more tasks to discover available targets:

  1. Query – Request available targets from the FC name server, zoning will dictate which targets are available.
  2. PLOGI – A 2nd port Login, this time into the target port.
  3. PRLI – Process login to exchange supported upper layer protocols (ULP) typically SCSI-3.

Once this process has been completed the initiator can exchange frames with the target, i.e. the server can write to disk.

FIP:

The reason the FC login process is key to understanding FIP is that this is the process that FIP is handling for FCoE networks.  FIP allows an Ethernet attached FC node (Enode) to discover existing FCFs and supports the FC login procedure over 10+GE networks.  Rather than just providing an FCID, FIP will provide an FPMA which is a MAC address comprised of two parts: FC-MAP and FCID.

48 bit FCMAP (Mac Address)

image

FIP

image

So FIP provides an Ethernet MAC address used by FCoE to traverse the Ethernet network which contains the FCID required to be routed on the FC network.  FIP also passes the query and query response from the FC name server.  FIP uses a separate Ethertype from FCoE and its frames are standard Ethernet size (1518 Byte 802.1q frame) whereas FCoE frames are 2242 Byte Jumbo Frames.

FIP Snooping:

FIP snooping is used in multi-hop FCoE environments.  FIP snooping is a frame inspection method that can be used by FIP snooping capable DCB devices to monitor FIP frames and apply policies based on the information in those frames.  This allows for:

  • Enhanced FCoE security (Prevents FCoE MAC spoofing.)
  • Creates FC point-to-point links within the Ethernet LAN
  • Allows auto-configuration of ACLs based on name server information read in the FIP frames

FIP Snooping

image

Summary:

FIP snooping uses dynamic Access Control Lists to enforce Fibre Channel rules within the DCB Ethernet network.  This prevents Enodes from seeing or communicating with other Enodes without first traversing an FCF.

Feedback, corrections, updates, questions?

GD Star Rating
loading...

HP Flex-10, Cisco VIC, and Nexus 1000v

When discussing the underlying technologies for cloud computing topologies virtualization is typically a key building block.  Virtualization can be applied to any portion of the data center architecture from load-balancers to routers, and from servers to storage.  Server virtualization is one of the most widely adopted virtualization technologies, and provides a great deal of benefits to the server architecture. 

One of the most common challenges with server virtualization is the networking.  Virtualized servers typically consist of networks of virtual machines that are configured by the server team with little to no management/monitoring possible from the network/security teams.  This causes inconsistent policy enforcement between physical and virtual servers as well as limited network functionality for virtual machines.

Virtual Networks

image

The separate network management models for virtual servers and physical servers presents challenges to: policy enforcement, compliance, and security, as well as adds complexity to the configuration and architecture of virtual server environments.  Due to this fact many vendors are designing products and solutions to help draw these networks closer together.

The following is a discussion of three products that can be used for this, HP’s Flex-10 adapters, Cisco’s Nexus 1000v and Cisco’s Virtual interface Card (VIC.) 

This is not a pro/con or discussion of which is better, just an overview of the technology and how it relates to VMware.

HP Flex-10 for Virtual Connect:

Using HP’s Virtual Connect switching modules for C-Class blades and either Flex-10 adapters or Lan-On-Motherboard (LOM) administrators can ‘partition the bandwidth of a single 10Gb pipeline into multiple “FlexNICs.” In addition, customers can regulate the bandwidth for each partition by setting it to a user-defined portion of the total 10Gb connection. Speed can be set from 100 Megabits per second to 10 Gigabits per second in 100 Megabit increments.’ (http://bit.ly/boRsiY)

This allows a single 10GE uplink to be presented to any operating system as 4 physical Network Interface Cards (NIC.)

FlexConnect

image

In order to perform this interface virtualization FlexConnect uses internal VLAN mappings for traffic segregation within the 10GE Flex-10 port (mid-plane blade chassis connection from the Virtual Connect Flex-10 10GbE interconnect module and the Flex-10 NIC device.)  Each FlexNIC can present one or more VLANs to the installed operating system.

Some of the advantages with this architecture are:

  • A single 10GE link can be divided into 4 separate logical links each with a defined portion of the bandwidth.
  • More interfaces can be presented from fewer physical adapters which is extremely advantageous within the limited space available on blade servers.

When the installed operating system is VMware this allows for 2x10GE links to be presented to VMware as 8x separate NICs and used for different purposes such as vMotion, Fault Tolerance (FT), Service Console, VM kernel and data.

The requirements for Flex-10 as described here are:

  • HP C-Class blade chassis
  • VC Flex-10 10GE interconnect module (HP blade switches)
  • Flex-10 LOM and or Mezzanine cards

Cisco Nexus 1000v:

‘Cisco Nexus® 1000V Series Switches are virtual machine access switches that are an intelligent software switch implementation for VMware vSphere environments running the Cisco® NX-OS operating system. Operating inside the VMware ESX hypervisor, the Cisco Nexus 1000V Series supports Cisco VN-Link server virtualization technology to provide:

• Policy-based virtual machine (VM) connectivity

• Mobile VM security and network policy, and

• Non-disruptive operational model for your server virtualization, and networking teams’(http://bit.ly/b4JJX5.)

The Nexus 1000v is a Cisco software switch which is placed in the VMware environment and provides physical type network control/monitoring to VMware virtual networks.  The Nexus 1000v is comprised of two components the Virtual Supervisor Module (VSM) and Virtual Ethernet Module (VEM.)  The Nexus 1000v does not have hardware requirements and can be used with any standards compliant physical switching infrastructure.  Specifically the upstream switch should support 802.1q trunks and LACP.

Cisco Nexus 1000v

image

Using the Nexus 1000v Network teams have complete control over the virtual network and manage it using the same tools and policies used on the physical network.

Some advantages of the 1000v are:

  • Consistent policy enforcement for physical and virtual servers
  • vMotion aware policies migrate with the VM
  • Increased, security, visibility and control of virtual networks

The requirements for Cisco Nexus 1000v are:

  • vSphere 4.0 or higher
  • Enterprise + VMware license
  • Per physical host CPU VEM license
  • Virtual Center Server

Cisco Virtual interface Card (VIC):

The Cisco VIC provides interface virtualization similar to the Flex-10 adapter.  One 10GE port is able to be presented to an operating system as up to 128 virtual interfaces depending on the infrastructure. ‘The Cisco UCS M81KR presents up to 128 virtual interfaces to the operating system on a given blade. The virtual interfaces can be dynamically configured by Cisco UCS Manager as either Fibre Channel or Ethernet devices’ (http://bit.ly/9RT7kk.)

Fibre Channel interfaces are known as vFC and Ethernet interfaces are known as vEth, they can be used in any combination up to the architectural limits.  Currently the VIC is only available for Cisco UCS blades but will be supported on UCS rack mount servers as well by the end of 2010.  Interfaces are segregated using an internal tagging mechanism known as VN-Tag which does not use VLAN tags and operates independently of VLAN operation.

Virtual Interface Card

image

Each virtual interface acts as if directly connected to a physical switch port and can be configured in Access or Trunk mode using 802.1q standard trunking. These interfaces can then be used by any operating system or VMware.  For more information on their use see my post Defining VN-Link (http://bit.ly/ddxGU7.)

VIC Advantages:

  • Granular configuration of multiple Fibre Channel and Ethernet ports on one 10GE link.
  • Single point of network configuration handled by a network team rather than a server team.

Requirements:

  • Cisco UCS B-series blades (until C-Series support is released)
  • Cisco Fabric interconnect access layer switches/managers.

Summary:

Each of these products has benefits in specific use cases and can reduce overhead and/or administration for server networks.  When combining one or more of these products you should carefully analyze the benefits of each and identify features that may be sacrificed by combining the two.  For instance using the Nexus 1000v along with FlexConnect adds a Server administered network management layer in between the physical network and virtual network.

Nexus 1000v with Flex-10

image

Comments and corrections are always welcome.

GD Star Rating
loading...

Post defining VN-Link

For the Cisco fans or those curious about Cisco’s VN-Link see my post on my colleagues Unified Computing Blog: http://bit.ly/dqIIQK.

GD Star Rating
loading...

Fibre Channel over Ethernet

Fibre Channel over Ethernet (FCoE) is a protocol standard ratified in June of 2009.  FCoE provides the tools for encapsulation of Fibre Channel (FC) in 10 Gigabit Ethernet frames.  The purpose of FCoE is to allow consolidation of low-latency, high performance FC networks onto 10GE infrastructures.  This allows for a single network/cable infrastructure which greatly reduces switch and cable count, lowering the power, cooling, and administrative requirements for server I/O.

FCoE is designed to be fully interoperable with current FC networks and require little to no additional training for storage and IP administrators. FCoE operates by encapsulating native FC into Ethernet frames.  Native FC is considered a ‘lossless’ protocol, meaning frames are not dropped during periods of congestion.  This is by design in order to ensure the behavior expected by the SCSI payloads.  Traditional Ethernet does not provide the tools for lossless delivery on shared networks so enhancements were defined by the IEEE to provide appropriate transport of encapsulated Fibre Channel on Ethernet networks.  These standards are known as Data Center Bridging (DCB) which I’ve discussed in a previous post (http://www.definethecloud.net/?p=31.)  These Ethernet enhancements are fully backward compatible with traditional Ethernet devices, meaning DCB capable devices can exchange standard Ethernet frames seamlessly with legacy devices.  The full 2148 Byte FC frame is encapsulated in an Ethernet jumbo frame avoiding any modification/fragmentation of the FC frame.

FCoE itself takes FC layers 2-4 and maps them to Ethernet layers 1-2, this replaces the FC-0 Physical layer, and FC-1 Encoding Layer.  This mapping between Ethernet and Fibre Channel is done through a Logical End-Point (LEP) which can by thought of as a translator between the two protocols.  The LEP is responsible for providing the appropriate encoding and physical access for frames traveling from FC nodes to Ethernet nodes and vice versa.  There are two devices that typically act as FCoE LEPs: Fibre Channel Forwarders (FCF) which are switches capable of both Ethernet and Fibre Channel, and Converged Network Adapters (CNA) which provide the server-side connection for a FCoE network.  Additionally the LEP operation can be done using a software initiator and traditional 10GE NICs but this places extra workload on the server processor rather than offloading it to adapter hardware.

One of the major advantages of replacing FC layers 0-1 when mapping onto 10GE is the encoding overhead.  8GB Fibre Channel uses an 8/10 bit encoding which adds 25% protocol overhead, 10GE uses a 64/64 bit encoding which has about 2% overhead, dramatically reducing the protocol overhead and increasing throughput.  The second major advantage is that FCoE maintains FC layers 2-4 which allows seamless integration with existing FC devices and maintains the Fibre Channel tool set such as zoning, LUN masking etc.  In order to provide FC login capabilities, multi-hop FCoE networks, and FC zoning enforcement on 10GE networks FCoE relies on another standard set known as Fibre Channel initialization Protocol (FIP) which I will discuss in a lter post.

Overall FCoE is one protocol to choose from when designing converged networks, or cable-once architectures.  The most important thing to remember is that a true cable-once architecture doesn’t make you choose your Upper Layer Protocol (ULP) such as FCoE, only your underlying transport infrastructure.  If you choose 10GE the tools are now in place to layer any protocol of your choice on top, when and if you require it.

Thanks to my colleagues who recently provided a great discussion on protocol overhead and frame encoding…

GD Star Rating
loading...

Data Center Bridging Exchange

Data Center Bridging Exchange (DCBX) is one of the components of the DCB standards.  These standards offer enhancements to standard ethernet which are backwards compatible with traditional Ethernet and provide support for I/O Consolidation (http://www.definethecloud.net/?p=18.)  The three purposes of DCBX are:

Discovery of DCB capability:

The ability for DCB capable devices to discover and identify capabilities of DCB peers as well as identify non-DCB capable legacy devices.  You can find more information on DCB in a previous post (http://www.definethecloud.net/?p=31.)

Identification of misconfigured DCB features:

The ability to discover misconfiguration of features that require symmetric configuration between DCB peers.  Some DCB features are asymmetric meaning they can be configured differently on each end of a link, other features must match on both sides to be effective (symmetric.)  This functionality allows detection of configuration errors for these symmetric features.

Configuration of Peers:

A capability allowing DCBX to pass configuration information to a peer.  For instance a DCB capable switch can pass Priority Flow Control (PFC) information on to a Converged Network Adapter (CNA) to ensure FCoE traffic is appropriately tagged and pause is enabled for the chosen Class of Service (CoS) value.  This PFC exchange is a symmetric exchange and must match on both sides of the link.  DCB features such as Enhanced Transmission Selection (ETS) otherwise known as bandwidth management can be configured asymmetrically (different on each side of the link.)

DCBX relies on Link Level Discovery Protocol (LLDP) in order to pass this information and configuration.  LLDP is an industry standard version of Cisco Discovery Protocol (CDP) which allows devices to discover one another and exchange information about basic capabilities.  Because DCBX relies on LLDP and is an acknowledged protocol (2 way communication) any link which is intended to support DCBX must have LLDP enabled on both sides of the link for Tx/Rx.  When a port has LLDP disabled for either Rx or Tx DCBX is disabled on the port and DCBX Type Length Values (TLV) within received LLDP frames will be ignored.

DCBX capable devices should have DCBX enabled by default with the ability to administratively disable it.  This allows for more seamless deployments of DCB networks with less tendency for error.

GD Star Rating
loading...