VXLAN Deep Dive – Part II

In part one of this post I covered the basic theory of operations and functionality of VXLAN (http://www.definethecloud.net/vxlan-deep-dive.)  This post will dive deeper into how VXLAN operates on the network.

Let’s start with the basic concept that VXLAN is an encapsulation technique.  Basically the Ethernet frame sent by a VXLAN connected device is encapsulated in an IP/UDP packet.  The most important thing here is that it can be carried by any IP capable device.  The only time added intelligence is required in a device is at the network bridges known as VXLAN Tunnel End-Points (VTEP) which perform the encapsulation/de-encapsulation.  This is not to say that benefit can’t be gained by adding VXLAN functionality elsewhere, just that it’s not required.

image

Providing Ethernet Functionality on IP Networks:

As discussed in Part 1, the source and destination IP addresses used for VXLAN are the Source VTEP and destination VTEP.  This means that the VTEP must know the destination VTEP in order to encapsulate the frame.  One method for this would be a centralized controller/database.  That being said VXLAN is implemented in a decentralized fashion, not requiring a controller.  There are advantages and drawbacks to this.  While utilizing a centralized controller would provide methods for address learning and sharing, it would also potentially increase latency, require large software driven mapping tables and add network management points.  We will dig deeper into the current decentralized VXLAN deployment model.

VXLAN maintains backward compatibility with traditional Ethernet and therefore must maintain some key Ethernet capabilities.  One of these is flooding (broadcast) and ‘Flood and Learn behavior.’  I cover some of this behavior here (http://www.definethecloud.net/data-center-101-local-area-network-switching)  but the summary is that when a switch receives a frame for an unknown destination (MAC not in its table) it will flood the frame to all ports except the one on which it was received.  Eventually the frame will get to the intended device and a reply will be sent by the device which will allow the switch to learn of the MACs location.  When switches see source MACs that are not in their table they will ‘learn’ or add them.

VXLAN is encapsulating over IP and IP networks are typically designed for unicast traffic (one-to-one.)  This means there is no inherent flood capability.  In order to mimic flood and learn on an IP network VXLAN uses IP multi-cast.  IP multi-cast provides a method for distributing a packet to a group.  This IP multi-cast use can be a contentious point within VXLAN discussions because most networks aren’t designed for IP multi-cast, IP multi-cast support can be limited, and multi-cast itself can be complex dependent on implementation.

Within VXLAN each VXLAN segment ID will be subscribed to a multi-cast group.  Multiple VXLAN segments can subscribe to the same ID, this minimizes configuration but increases unneeded network traffic.  When a device attaches to a VXLAN on a VTEP that was not previously in use, the VXLAN will join the IP multi-cast group assigned to that segment and start receiving messages.

image

In the diagram above we see the normal operation in which the destination MAC is known and the frame is encapsulated in IP using the source and destination VTEP address.  The frame is encapsulated by the source VTEP, de-encapsulated at the destination VTEP and forwarded based on bridging rules from that point.  In this operation only the destination VTEP will receive the frame (with the exception of any devices in the physical path, such as the core IP switch in this example.)

image

In the example above we see an unknown MAC address (the MAC to VTEP mapping does not exist in the table.)  In this case the source VTEP encapsulates the original frame in an IP multi-cast packet with the destination IP of the associated multicast group.  This frame will be delivered to all VTEPs participating in the group.  VTEPs participating in the group will ideally only be VTEPs with connected devices attached to that VXLAN segment.  Because multiple VXLAN segments can use the same IP multicast group this is not always the case.  The VTEP with the connected device will de-encapsulate and forward normally, adding the mapping from the source VTEP if required.  Any other VTEP that receives the packet can then learn the source VTEP/MAC mapping if required and discard it. This process will be the same for other traditionally flooded frames such as ARP, etc.  The diagram below shows the logical topologies for both traffic types discussed.

image

As discussed in Part 1 VTEP functionality can be placed in a traditional Ethernet bridge.  This is done by placing a logical VTEP construct within the bridge hardware/software.  With this in place VXLANs can bridge between virtual and physical devices.  This is necessary for physical server connectivity, as well as to add network services provided by physical appliances.  Putting it all together the diagram below shows physical servers communicating with virtual servers in a VXLAN environment.  The blue links are traditional IP links and the switch shown at the bottom is a standard L3 switch or router.  All traffic on these links is encapsulated as IP/UDP and broken out by the VTEPs.

image

Summary:

VXLAN provides backward compatibility with traditional VLANs by mimicking broadcast and multicast behavior through IP multicast groups.  This functionality provides for decentralized learning by the VTEPs and negates the need for a VXLAN controller.

GD Star Rating
loading...

VXLAN Deep Dive

I’ve been spending my free time digging into network virtualization and network overlays.  This is part 1 of a 2 part series, part 2 can be found here: http://www.definethecloud.net/vxlan-deep-divepart-2.  By far the most popular virtualization technique in the data center is VXLAN.  This has as much to do with Cisco and VMware backing the technology as the tech itself.  That being said VXLAN is targeted specifically at the data center and is one of many similar solutions such as: NVGRE and STT.)  VXLAN’s goal is allowing dynamic large scale isolated virtual L2 networks to be created for virtualized and multi-tenant environments.  It does this by encapsulating frames in VXLAN packets.  The standard for VXLAN is under the scope of the IETF NVO3 working group.

 

VxLAN Frame

The VXLAN encapsulation method is IP based and provides for a virtual L2 network.  With VXLAN the full Ethernet Frame (with the exception of the Frame Check Sequence: FCS) is carried as the payload of a UDP packet.  VXLAN utilizes a 24-bit VXLAN header, shown in the diagram, to identify virtual networks.  This header provides for up to 16 million virtual L2 networks.

Frame encapsulation is done by an entity known as a VXLAN Tunnel Endpoint (VTEP.)  A VTEP has two logical interfaces: an uplink and a downlink.  The uplink is responsible for receiving VXLAN frames and acts as a tunnel endpoint with an IP address used for routing VXLAN encapsulated frames.  These IP addresses are infrastructure addresses and are separate from the tenant IP addressing for the nodes using the VXLAN fabric.  VTEP functionality can be implemented in software such as a virtual switch or in the form a physical switch.

VXLAN frames are sent to the IP address assigned to the destination VTEP; this IP is placed in the Outer IP DA.  The IP of the VTEP sending the frame resides in the Outer IP SA.  Packets received on the uplink are mapped from the VXLAN ID to a VLAN and the Ethernet frame payload is sent as an 802.1Q Ethernet frame on the downlink.  During this process the inner MAC SA and VXLAN ID is learned in a local table.  Packets received on the downlink are mapped to a VXLAN ID using the VLAN of the frame.  A lookup is then performed within the VTEP L2 table using the VXLAN ID and destination MAC; this lookup provides the IP address of the destination VTEP.  The frame is then encapsulated and sent out the uplink interface.

image

Using the diagram above for reference a frame entering the downlink on VLAN 100 with a destination MAC of 11:11:11:11:11:11 will be encapsulated in a VXLAN packet with an outer destination address of 10.1.1.1.  The outer source address will be the IP of this VTEP (not shown) and the VXLAN ID will be 1001.

In a traditional L2 switch a behavior known as flood and learn is used for unknown destinations (i.e. a MAC not stored in the MAC table.  This means that if there is a miss when looking up the MAC the frame is flooded out all ports except the one on which it was received.  When a response is sent the MAC is then learned and written to the table.  The next frame for the same MAC will not incur a miss because the table will reflect the port it exists on.  VXLAN preserves this behavior over an IP network using IP multicast groups.

Each VXLAN ID has an assigned IP multicast group to use for traffic flooding (the same multicast group can be shared across VXLAN IDs.)  When a frame is received on the downlink bound for an unknown destination it is encapsulated using the IP of the assigned multicast group as the Outer DA; it’s then sent out the uplink.  Any VTEP with nodes on that VXLAN ID will have joined the multicast group and therefore receive the frame.  This maintains the traditional Ethernet flood and learn behavior.

VTEPs are designed to be implemented as a logical device on an L2 switch.  The L2 switch connects to the VTEP via a logical 802.1Q VLAN trunk.  This trunk contains an VXLAN infrastructure VLAN in addition to the production VLANs.  The infrastructure VLAN is used to carry VXLAN encapsulated traffic to the VXLAN fabric.  The only member interfaces of this VLAN will be VTEP’s logical connection to the bridge itself and the uplink to the VXLAN fabric.  This interface is the ‘uplink’ described above, while the logical 802.1Q trunk is the downlink.

image

Summary

VXLAN is a network overlay technology design for data center networks.  It provides massively increased scalability over VLAN IDs alone while allowing for L2 adjacency over L3 networks.  The VXLAN VTEP can be implemented in both virtual and physical switches allowing the virtual network to map to physical resources and network services.  VXLAN currently has both wide support and hardware adoption in switching ASICS and hardware NICs, as well as virtualization software.

GD Star Rating
loading...

Digging Into the Software Defined Data Center

The software defined data center is a relatively new buzzword embraced by the likes of EMC and VMware.  For an introduction to the concept see my article over at Network Computing (http://www.networkcomputing.com/data-center/the-software-defined-data-center-dissect/240006848.)  This post is intended to take it a step deeper as I seem to be stuck at 30,000 feet for the next five hours with no internet access and no other decent ideas.  For the purpose of brevity (read: laziness) I’ll use the acronym SDDC for Software Defined Data Center whether or not this is being used elsewhere.)

First let’s look at what you get out of a SDDC:

Legacy Process:

In a traditional legacy data center the workflow for implementing a new service would look something like this:

  1. Approval of the service and budget
  2. Procurement of hardware
  3. Delivery of hardware
  4. Rack and stack of new hardware
  5. Configuration of hardware
  6. Installation of software
  7. Configuration of software
  8. Testing
  9. Production deployment

This process would very greatly in overall time but 30-90 days is probably a good ballpark (I know, I know, some of you are wishing it happened that fast.)

Not only is this process complex and slow but it has inherent risk.  Your users are accustomed to on-demand IT services in their personal life.  They know where to go to get it and how to work with it.  If you tell a business unit it will take 90 days to deploy an approved service they may source it from outside of IT.  This type of shadow IT poses issues for security, compliance, backup/recovery etc. 

SDDC Process:

As described in the link above an SDDC provides a complete decoupling of the hardware from the services deployed on it.  This provides a more fluid system for IT service change: growing, shrinking, adding and deleting services.  Conceptually the overall infrastructure would maintain an agreed upon level of spare capacity and would be added to as thresholds were crossed.  This would provide an ability to add services and grow existing services on the fly in all but the most extreme cases.  Additionally the management and deployment of new services would be software driven through intuitive interfaces rather than hardware driven and disparate CLI based.

The process would look something like this:

  1. Approval of the service and budget
  2. Installation of software
  3. Configuration of software
  4. Testing
  5. Production deployment

The removal of four steps is not the only benefit.  The remaining five steps are streamlined into automated processes rather than manual configurations.  Change management systems and trackback/chargeback are incorporated into the overall software management system providing a fluid workflow in a centralized location.  These processes will be initiated by authorized IT users through self-service portals.  The speed at which business applications can be deployed is greatly increased providing both flexibility and agility.

Isn’t that cloud?

Yes, no and maybe.  Or as we say in the IT world: ‘It depends.’  SDDN can be cloud, with on-demand self-service, flexible resource pooling, metered service etc. it fits the cloud model.  The difference is really in where and how it’s used.  A public cloud based IaaS model, or any given PaaS/SaaS model does not lend itself to legacy enterprise applications.  For instance you’re not migrating your Microsoft Exchange environment onto Amazon’s cloud.  Those legacy applications and systems still need a home.  Additionally those existing hardware systems still have value.  SDDC offers an evolutionary approach to enterprise IT that can support both legacy applications and new applications written to take advantage of cloud systems.  This provides a migration approach as well as investment protection for traditional IT infrastructure. 

How it works:

The term ‘Cloud operating System’ is thrown around frequently in the same conversation as SDDC.  The idea is compute, network and storage are raw resources that are consumed by the applications and services we run to drive our businesses.  Rather than look at these resources individually, and manage them as such, we plug them into a a management infrastructure that understands them and can utilize them as services require them.  Forget the hardware underneath and imagine a dashboard of your infrastructure something like the following graphic.

image

 

The hardware resources become raw resources to be consumed by the IT services.  For legacy applications this can be very traditional virtualization or even physical server deployments.  New applications and services may be deployed in a PaaS model on the same infrastructure allowing for greater application scale and redundancy and even less tie to the hardware underneath.

Lifting the kimono:

Taking a peak underneath the top level reveals a series of technologies both new and old.  Additionally there are some requirements that may or may not be met by current technology offerings. We’ll take a look through the compute, storage and network requirements of SDDC one at a time starting with compute and working our way up.

Compute is the layer that requires the least change.  Years ago we moved to the commodity x86 hardware which will be the base of these systems.  The compute platform itself will be differentiated by CPU and memory density, platform flexibility and cost. Differentiators traditionally built into the hardware such as availability and serviceability features will lose value.  Features that will continue to add value will be related to infrastructure reduction and enablement of upper level management and virtualization systems.  Hardware that provides flexibility and programmability will be king here and at other layers as we’ll discuss.

Other considerations at the compute layer will tie closely into storage.  As compute power itself has grown by leaps and bounds  our networks and storage systems have become the bottleneck.  Our systems can process our data faster than we can feed it to them.  This causes issues for power, cooling efficiency and overall optimization.  Dialing down performance for power savings is not the right answer.  Instead we want to fuel our processors with data rather than starve them.  This means having fast local data in the form of SSD, flash and cache.

Storage will require significant change, but changes that are already taking place or foreshadowed in roadmaps and startups.  The traditional storage array will become more and more niche as it has limited capacities of both performance and space.  In its place we’ll see new options including, but not limited to migration back to local disk, and scale-out options.  Much of the migration to centralized storage arrays was fueled by VMware’s vMotion, DRS, FT etc.  These advanced features required multiple servers to have access to the same disk, hence the need for shared storage.  VMware has recently announced a combination of storage vMotion and traditional vMotion that allows live migration without shared storage.  This is available in other hypervisor platforms and makes local storage a much more viable option in more environments.

Scale-out systems on the storage side are nothing new.  Lefthand and Equalogic pioneered much of this market before being bought by HP and Dell respectively.  The market continues to grow with products like Isilon (acquired by EMC) making a big splash in the enterprise as well as plays in the Big Data market.  NetApp’s cluster mode is now in full effect with OnTap 8.1 allowing their systems to scale out.  In the SMB market new players with fantastic offerings like Scale Computing are making headway and bringing innovation to the market.  Scale out provides a more linear growth path as both I/O and capacity increase with each additional node.  This is contrary to traditional systems which are always bottle necked by the storage controller(s). 

We will also see moves to central control, backup and tiering of distributed storage, such as storage blades and server cache.  Having fast data at the server level is a necessity but solves only part of the problem.  That data must also be made fault tolerant as well as available to other systems outside the server or blade enclosure.  EMC’s VFcache is one technology poised to help with this by adding the server as a storage tier for software tiering.  Software such as this place the hottest data directly next the processor with tier options all the way back to SAS, SATA, and even tape.

By now you should be seeing the trend of software based feature and control.  The last stage is within the network which will require the most change.  Network has held strong to proprietary hardware and high margins for years while the rest of the market has moved to commodity.  Companies like Arista look to challenge the status quo by providing software feature sets, or open programmability layered onto fast commodity hardware.  Additionally Software Defined Networking (http://www.definethecloud.net/sdn-centralized-network-command-and-control) has been validated by both VMware’s acquisition of Nicira and Cisco’s spin-off of Insieme which by most accounts will expand upon the CiscoOne concept with a Cisco flavored SDN offering.  In any event the race is on to build networks based on software flows that are centrally managed rather than the port-to-port configuration nightmare of today’s data centers. 

This move is not only for ease of administration, but also required to push our systems to the levels required by cloud and SDDC.  These multi-tenant systems running disparate applications at various service tiers require tighter quality of service controls and bandwidth guarantees, as well as more intelligent routes.  Today’s physically configured networks can’t provide these controls.  Additionally applications will benefit from network visibility allowing them to request specific flow characteristics from the network based on application or user requirements.  Multiple service levels can be configured on the same physical network allowing traffic to take appropriate paths based on type rather than physical topology.  These network changes are require to truly enable SDDC and Cloud architectures. 

Further up the stack from the Layer 2 and Layer 3 transport networks comes a series of other network services that will be layered in via software.  Features such as: load-balancing, access-control and firewall services will be required for the services running on these shared infrastructures.  These network services will need to be deployed with new applications and tiered to the specific requirements of each.  As with the L2/L3 services manual configuration will not suffice and a ‘big picture’ view will be required to ensure that network services match application requirements.  These services can be layered in from both physical and virtual appliances  but will require configurability via the centralized software platform.

Summary:

By combining current technology trends, emerging technologies and layering in future concepts the software defined data center will emerge in evolutionary fashion.  Today’s highly virtualized data centers will layer on technologies such as SDN while incorporating new storage models bringing their data centers to the next level.  Conceptually picture a mainframe pooling underlying resources across a shared application environment.  Now remove the frame.

GD Star Rating
loading...

Thoughts From a Tech Leadership Summit

This week I attended a tech leadership Summit in Vail Colorado for the second time.  The event is always a fantastic series of discussions and brings some of the top minds in the technology industry.  Here are some thoughts on the trends and thinking that were common at the event.

Virtualization and VDI:

There was a lot less talk of VDI and virtualization then in 2011.  These conversations were replaced with more conversations about cloud and app delivery.  Overall the consensus felt to be that getting the application to the right native environment on a given device was a far better approach then getting the desktop there.

Hypervisors were barely mentioned except in a recurring theme that the hypervisor itself has hit commodity.  This means that management and upper layer feature set are the differentiators.  Parallel to this thought was that VMware no longer has the best hypervisor yet their management system is still far superior to the competition (KVM was touted as the best hypervisor several times.)

The last piece of the virtualization discussion was around VMware’s acquisition of Nicira.  Some bullet points on that:

  • VMware paid too much for Nicira but that was unavoidable for the startup-to-be in the valley and it’s a great acquisition overall.
  • It’s no surprise VMware moved into networking everyone is moving that way.
  • While this is direct competition with Cisco it is currently in a small niche of service provider business.  Nicira’s product requires significant custom integration to deploy and will take time for VMware to productize it in a fashion usable for the enterprise.  Best guess: two years to real product. \
  • Overall the Cisco VMware partnership is very lucrative on both sides and should not be effected by this in the near term.
  • A seldom discussed portion of this involves the development expertise that comes with the acquisition.  With the hypervisor being commodity, and differentiation moving into the layers above that, we’ll see more and more variety in hypervisors.  This means multi-hypervisor support will be a key component of the upper level management products where virtualization vendors will compete.  Nicira’s team has proven capabilities in this space and can accelerate VMware’s multi-hypervisor strategy.

Storage:

There was a lot of talk about both the vision and execution of EMC over the past year or more.  I personally used ‘execution machine’ more than once to describe them (coming from a typically non-EMC Kool-Aid guy.)  Some key points that resonated over past few days:

  • EMC’s execution on the VNX/VNXe product lines is astounding.  EMC launched a product and went on direct attack into a portion of NetApp’s business that nobody could really touch.  Through both sales and marketing excellence they’ve taken an increasingly large chunk out of this portion of the market.  This shores up a breech in their product line NetApp was using to gain share.
  • EMC’s Isilon acquisition was not only a fantastic choice, but was quickly integrated well.  Isilon is a fantastic product and has big data potential which is definitely a market that will generate big revenue in coming years.
  • EMC’s cloud vision is sound and they are executing well on it.  Additionally they were ahead of their pack of hardware vendor peers in this regard. EMC is embracing a software defined future.

I also participated in several discussions around flash and flash storage.  Some highlights:

  • PCIe based flash storage is definitely increasing in sales and enterprise consumption.  This market is expected to continue to grow as we strive to move the data closer to the processor.  There are two methods for this: storage in the server, servers in the storage.  PCIe flash plays in the server side and EMC Isilon will eventually play on the storage side.  Also look for an announcement in the SMB storage space around this during VMworld.
  • One issue in this space is that the expensive fast server based flash becomes trapped capacity if a server can’t drive enough I/O to it.  Additionally there are data loss concerns with this data trapped in the server.
  • Both of these issues are looking to be solved by EMC and IBM who intend to add server based flash into the tiering of shared storage.
  • Most traditional storage vendors flash options are ‘bolt-ons’ to traditional array architecture.  This can leave the expensive flash I/O starved, limiting it’s performance benefit.  Several all flash startups intend to use this as an inflection point with flash based systems designed from the ground up for the performance the disk offers.
  • Flash is still not an answer to every problem, and never will be.

The last point that struck me was a potential move from shared storage as a whole.  Microsoft would rather have you use local storage, clusters and big data apps like Hadoop thrive on local storage and one last big shared storage draw is going away: vMotion.  Once shared storage is no longer need for live virtual machine migration there will be far less draw for expensive systems.

Cloud:

The major cloud discussion I was a part of (mainly observer) involved OpenStack.  Overall OpenStack has a ton of buzz, and a plethora of developers.  What it’s lacking is customers, leadership and someone driving it who can lead a revolution.  Additionally it’s suffering from politics and bureaucracy.  It was described as impossible to support by one individual who would definitely know one way or another.  My thinking is that if you have CloudStack sitting there with real customers, an easily deployed system, support and leadership why waste cycles continuing down the OpenStack path?  The best answer I heard for that: Ego.  Everyone wants to build the next Amazon and CloudStack is too baked to make as much of a mark.

Overall it’s an interesting topic but my thought is: with limited developers the industry should be getting behind the best horse and working together.

Big Data:

Big Data was obviously another fun topic.  The quote of the week was ‘There are ten people, not companies, that understand Big Data.  6 of them are at Cloudera and the other 4 are locked in Google writing their own checks.’  Basically Big Data knowledge is rare and hiring consultants is not typically a viable option because you need people holding three things: Knowledge of big data processing, knowledge of your data, and knowledge of your business.  These data scientists aren’t easy to come by.  Additionally contrary to popular hype, Hadoop is not the end-all be-all of big data, it’s a tool in a large tool chest.  Especially when talking about real-time you’ll need to look elsewhere.  The consensus was that we are with big data where we were with cloud 2-3 years ago.  That being said CIO’s may still need to show big data initiatives (read: spend) so you should see $$ thrown at well packaged big data solutions geared toward plug-n-play in the enterprise.

All in all it was an excellent event and I was humbled as usual to participate in great conversations with so many smart people who are out there driving the future of technology.  What I’ve written here is a a summary from my perspective on the one summit portion I had time to participate in.  There is always a good chance I misquoted/misunderstood something so feel free to call me out.  As always I’d love your feedback, contradictions or hate mail comments.

GD Star Rating
loading...

Forget Multiple Hypervisors

The concept of managing multiple hypervisors in the data center isn’t new–companies have been doing so or thinking about doing so for some time. Changes in licensing schemes and other events bring this issue to the forefront as customers look to avoid new costs. VMware recently acquired DynamicOps, a cloud automation/orchestration company with support for multiple hypervisors, as well as for Amazon Web Services. A hypervisor vendor investing in multihypervisor support brings the topic back to the forefront.  To see the full article visit: http://www.networkcomputing.com/virtualization/240003355

GD Star Rating
loading...

Ignoring BYOD Spells Disaster!

Last week in a blog titled "BYOD–Bring Your Own Disaster," I urged caution and scope for BYOD projects. This week I’m playing devil’s advocate with myself. A conversation with Greg Knieriemen (@knieriemen) got me thinking of the consequences of ignoring BYOD. Let’s dive into the risk of burying your head in the sand and ignoring the BYOD push.  To see the full article visit: http://www.networkcomputing.com/private-cloud/232300959.

GD Star Rating
loading...

The Idle Cycle Conundrum

One of the advantages of a private cloud architecture is the flexible pooling of resources that allows rapid change to match business demands. These resource pools adapt to the changing demands of existing services and allow for new services to be deployed rapidly. For these pools to maintain adequate performance, they must be designed to handle peak periods and this will also result in periods with idle cycles… To see the full article visit Network Computing: http://www.networkcomputing.com/private-cloud/231903031.

GD Star Rating
loading...

Hypervisors are not the Droids You Seek

Long ago, in a data center far, far away, we as an industry moved away from big iron and onto commodity hardware. That move brought with it many advantages, such as cost and flexibility. The change also brought along with it higher hardware and operating system software failure rates. This change in application stability forced us to change our deployment model and build the siloed application environment: One application, one operating system, one server….

To see the full post visit Network Computing: http://www.networkcomputing.com/private-cloud/231901662

GD Star Rating
loading...

Server Networking With gen 2 UCS Hardware

** this post has been slightly edited thanks to feedback from Sean McGee**

In previous posts I’ve outlined:

If you’re not familiar with UCS networking I suggest you start with those for background.  This post is an update to those focused on UCS B-Series server to Fabric Interconnect communication using the new hardware options announced at Cisco Live 2011.  First a recap of the new hardware:

The UCS 6248UP Fabric Interconnect

The 6248 is a 1RU model that provides 48 universal ports (1G/10G Ethernet or 1/2/4/8G FC.)  This provides 20 additional ports over the 6120 in the same 1RU form factor.  Additionally the 6248 is lower latency at 2.0us from 3.2us previously.

The UCS 2208XP I/O Module

The 2208 doubles the total uplink bandwidth per I/O module providing a total of 160Gbps total throughput per 8 blade chassis.  It quadruples the number of internal 10G connections to the blades allowing for 80Gbps per half-width blade.

UCS 1280 VIC

The 1280 VIC provides 8x10GE ports total, 4x to each IOM for a total of 80Gbps per half-width slot (160 Gbs with 2x in a full-width blade.)  It also double the VIF numbers of the previous VIC allowing for 256 (theoretical)  vNICs or vHBAs.  The new VIC also supports port-channeling to the UCS 2208 IOM and iSCSI boot.

The other addition that affects this conversation is the ability to port-channel the uplinks from the 2208 IOM which could not be done before (each link on a 2104 IOM operated independently.)  All of the new hardware is backward compatible with all existing UCS hardware.  For more detailed information on the hardware and software announcements visit Sean McGee’s blog where I stole these graphics: http://www.mseanmcgee.com/2011/07/ucs-2-0-cisco-stacks-the-deck-in-las-vegas/.

Let’s start by discussing the connectivity options from the Fabric Interconnects to the IOMs in the chassis focusing on all gen 2 hardware.

There are two modes of operation for the IOM: Discrete and Port-Channel.  in both modes it is possible to configure 1, 2 , 4, or 8 uplinks from each IOM in either Discrete mode (non-bundled) or port-channel mode (bundled.)

UCS 2208 fabric Interconnect Failover

image

Discrete Mode:

In discrete mode a static pinning mechanism is used mapping each blade to a given port dependent on number of uplinks used.  This means that each blade will have an assigned uplink on each IOM for inbound and outbound traffic.  In this mode if a link failure occurs the blade will not ‘re-pin’ on the side of the failure but instead rely on NIC-Teaming/bonding or Fabric Failover for failover to the redundant IOM/Fabric.  The pinning behavior is as follows with the exception of 1-Uplink (not-shown) in which all blades use the only available Port:

2 Uplinks

Blade

Port 1

Port 2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

4 Uplinks

Blade

Port 1

Port  2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

8 Uplinks

Blade

Port 1

Port2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

The same port-pinning will be used on both IOMs, therefore in a redundant configuration each blade will be uplinked via the same port on separate IOMs to redundant fabrics.  The draw of discrete mode is that bandwidth is predictable in link failure scenarios.  If a link fails on one IOM that server will fail to the other fabric rather than adding additional bandwidth draws on the active links for the failure side.  In summary it forces NIC-teaming/bonding or Fabric Failover to handle failure events rather than network based load-balancing.  The following diagram depicts the failover behavior for server three in an 8 uplink scenario.

Discrete Mode Failover

image

In the previous diagram port 3 on IOM A has failed.  With the system in discrete mode NIC-teaming/bonding or Fabric Failover handles failover to the secondary path on IOM B (which is the same port (3) based on static-pinning.)

Port-Channel Mode:

In Port-Channel mode all available links are bonded and a port-channel hashing algorithm (TCP/UDP + Port VLAN, non-configurable) is used for load-balancing server traffic.  In this mode all server links are still ‘pinned’ but they are pinned to the logical bundle rather than individual IOM uplinks.  The following diagram depicts this mode.

Port-Channel Mode

image

In this scenario when a port fails on an IOM port-channel load-balancing algorithms handle failing the server traffic flow to another available port in the channel.  This failover will typically be faster than NIC-teaming/bonding failover.  This will decrease the potential throughput for all flows on the side with a failure, but will only effect performance if the links are saturated.  The following diagram depicts this behavior.

image

In the diagram above Blade 3 was pinned to Port 1 on the A side.  When port 1 failed port 4 was selected (depicted in green) while fabric B port 6 is still active leaving a potential of 20 Gbps.

Note: Actual used ports will vary dependent on port-channel load-balancing.  These are used for example purposes only.

As you can see the port-channel mode enables additional redundancy and potential per-server bandwidth as it leaves two paths open.  In high utilization situations where the links are fully saturated this will degrade throughput of all blades on the side experiencing the failure.  This is not necessarily a bad thing (happens with all port-channel mechanisms), but it is a design consideration.  Additionally port-channeling in all forms can only provide the bandwidth of a single link per flow (think of a flow as a conversation.)  This means that each flow can only utilize 10Gbps max even though 8x10Gbps links are bundled.  For example a single FTP transfer would max at 10Gbps bandwidth, while 8xFTP transfers could potentially use 80Gbps (10 per link) dependent on load-balancing.

Next lets discuss server to IOM connectivity (yes I use discuss to describe me monologuing in print, get over it, and yes I know monologuing isn’t a word) I’ll focus on the new UCS 1280 VIC because all other current cards maintain the same connectivity.  the following diagram depicts the 1280 VIC connectivity.

image

The 1280 VIC utilizes 4x10Gbps links across the mid-pane per IOM to form two 40Gbps port-channels.  This provides for 80Gbps total potential throughput per card.  This means a half-width blade has a total potential of 80Gbps using this card and a full-width blade can receive 160Gbps (of course this is dependent upon design.)  As with any port-channel, link-bonding, trunking or whatever you may call it, any flow (conversation) can only utilize the max of one physical link (or back plane trace) of bandwidth.  This means every flow from any given UCS server has a max potential bandwidth of 10Gbps, but with 8 total uplinks 8 different flows could potentially utilize 80Gbps.

This becomes very important with things like NFS-based storage within hypervisors.  Typically a virtualization hypervisor will handle storage connectivity for all VMs.  This means that only one flow (conversation) will occur between host and storage.  In these typical configurations only 10Gbps will be available for all VM NFS data traffic even though the host may have a potential 80Gbps bandwidth.  Again this is not necessarily a concern, but a design consideration as most current/near-future hosts will never use more than 10Gbps of storage I/O.

Summary:

The new UCS hardware packs a major punch when it comes to bandwidth, port-density and failover options.  That being said it’s important to understand the frame flow, port-usage and potential bandwidth in order to properly design solutions for maximum efficiency.  As always comments, complaints and corrections are quite welcome!

GD Star Rating
loading...

WWT’s Geek Day 2012

A BrightTALK Channel

GD Star Rating
loading...