Intel’s Betting the Storage I/O Farm on the CPU


I had the privilege of attending Tech Field Day 4 in San Jose this week as a delegate thanks to Stephen Foskett and Gestalt IT.  It was a great event and a lot of information was covered in two days of presentations.  I’ll be discussing the products and vendors that sponsored the event over the next few blogs starting with this one on Intel.  Check out the official page to view all of the delegates and find links to the recordings etc.

Intel presented both their Ethernet NIC and storage I/O strategy as well as a processor update and public road map, this post will focus on the Ethernet and I/O presentation.

Intel began the presentation with an overview of the data center landscape and a description of the move towards converged I/O infrastructure, meaning storage, traditional LAN and potentially High Performance Computing (HPC) on the same switches and cables.  Anyone familiar with me or this site knows that I am a fan and supporter of converging the network infrastructure to reduce overall cost and complexity as well as provide more flexibility to data center I/O so I definitely liked this messaging.  Next was a discussion of iSCSI and its tradition of being used as a consolidation tool.


iSCSI has been used for years in order to provide a mechanism for consolidated block storage data without the need for a separate physical network.  Most commonly iSCSI has been deployed as a low-cost alternative to Fibre Channel.  Its typically been used in the SMB space and for select applications in larger datacenters.  iSCSI was previously limited to 1 Gigabit pipes (prior to the 10GE ratification) and it also suffers from higher latency and lower throughput than Fibre Channel.  The beauty of iSCSI is the ability to use existing LAN infrastructure and traditional NICs to provide block access to shared disk, the Achilles heal is performance.  Because of this cost has always been the primary deciding factor to use iSCSI. For more information on iSCSI see my post on storage protocols:

In order to increase the performance of iSCSI and decrease the overhead on the system processor(s) the industry developed iSCSI Host Bus Adapters (HBA) which offload the protocol overhead to the I/O card hardware.  These were not widely adopted due to the cost of the cards, this means that a great deal of iSCSI implementations rely on a protocol stack in the operating system (OS.) 

Intel then drew parallels to doing the same with FCoE via the FCoE software stack available for Windows and included in current Linux kernels.  The issue with drawing this parallel is that iSCSI is a mid-market technology that sacrifices some performance and reliability for cost, whereas FCoE is intended to match/increase the performance and reliability of FC while utilizing Ethernet as the transport.  This means that when looking at FCoE implementations the additional cost of specialized I/O hardware makes sense to gain the additional performance and reduce the CPU overhead.

Intel also showed some performance testing of FCoE software stack versus hardware offload using a CNA.  The IOPS they showed were quite impressive for a software stack, but IOPS isn’t the only issue.  The other issue is protocol overhead on the processor.Their testing showed an average of about 6% overhead for the software stack.  6% is low but we were only being shown one set of test criteria for a specific workload.  Additionally we were not provided the details of the testing criteria.  Other tests I’ve seen of the software stack are about 2 years old and show very comparable CPU utilization for FCoE software stack and Generation I CNAs for 8 KB reads, but a large disparity as the block size increased (CPU overhead became worse and worse for the software stack.)  In order to really understand the implications of utilizing a software stack Intel will need to publish test numbers under multiple test conditions:

  • Sequential and random
  • Various read and write combinations
  • Various block sizes
  • Mixed workloads of FCoE and other Ethernet based traffic

I’ve since located the test Intel referenced from Demartek.  It can be obtained here (  Notice that in the forward Demartek states the importance of CPU utilization data and stresses that they don’t cherry pick data then provides CPU utilization data only for the Microsoft Exchange simulation through JetStress, not for the SQLIO simulation at various block sizes.  I find that you can learn more from the data not shown in vendor sponsored testing, than the data shown.

Even if we were to make two big assumptions: Software stack IOPS are comparable to CNA hardware, and additional CPU utilization is less than or equal to 6% would you want to add an additional 6% CPU overhead to your virtual hosts?  The purpose of virtualization is to come as close as possible to full hardware utilization via placing multiple workloads on a single server.  In that scenario adding additional processor overhead seems short sighted.

The technical argument for doing this is two fold:

  • Saving cost on specialized I/O hardware
  • Processing capacity evolves faster than I/O offload capacity and speeds mainly due to economies of scale therefore your I/O performance will increase with each processor refresh using a software stack

If you’re looking to save cost and are comfortable with the processor and performance overhead then there is no major issue with using the software stack.  That being said if you’re really trying to maximize performance and or virtualization ratios you want to squeeze every drop you can out of the processor for the virtual machines.  As far as the second point of processor capacity goes, it most definitely rings true but with each newer faster processor you buy you’re losing that assumed 6% off the top for protocol overhead.  That isn’t acceptable to me.

The Other Problem:

FC and FCoE have been designed to carry native SCSI commands and data and treat them as SCSI expects, most importantly frames are not dropped (lossless network.)  The flow control mechanism FC uses for this is called buffer-to-buffer credits (B2B.)  This is a hop-to-hop mechanism implemented in hardware on HBAs/CNAs and FC switches.  In this mechanism when two ports initialize a link they exchange a number of buffer spaces they have dedicated to the device on the other side of the link based on agreed frame size. When any device sends a frame it is responsible for keeping track of the buffer space available on the receiving device based on these credits.  When a device receives a frame and has processed it (removing it from the buffer) it returns an R_RDY similar to a TCP ACK which lets the sending device know that a buffer has been freed.  For more information on this see the buffer credits section of my previous post:  This mechanism ensures that a device never sends a frame that the receiving device does not have sufficient buffer space for and this is implemented in hardware. 

On FCoE networks we’re relying on Ethernet as the transport so B2B credits don’t exist.  Instead we utilize Priority Flow Control (PFC) which is a priority based implementation of 802.3x pause.  For more information on DCB see my previous post:  PFC is handled by DCB capable NICs and will handle sending a pause before the NIC buffers overflow.  This provides for a lossless mechanism that can be translated back into B2B credits at the FC edge. 

The issue here with the software stack is that while the DCB capable NIC ensures the frame is not dropped on the wire via PFC it has to pass processing across the PCIe bus to the processor and allow the protocol to be handled by the OS kernel.  This adds layers in which the data could be lost or corrupted that don’t exist with a traditional HBA or CNA.


FCoE software stack is not a sufficient replacement for a CNA.  Emulex, Broadcom, Qlogic and Brocade are all offloading protocol to the card to decrease CPU utilization and increase performance.  HP has recently announced embedding Emulex OneConnect adapters, which offload iSCSI, TCP and FCoE, on the system board.  That’s a lot of backing for protocol offload with only Intel standing on the other side of the fence.  My guess is that Intel’s end goal is to sell more processors, and utilizing more cycles for protocol processing makes sense.  Additionally Intel doesn’t have a proven FC stack to embed on a card and the R/D costs would be significant, so throwing it in the kernel and selling their standard NIC makes sense to the business.  Lastly don’t forget storage vendor qualification, Intel has an uphill battle getting an FCoE software stack on the approved list for the major storage vendors.

Full Discloser:  Tech Field Day is organized by the folks at Gestalt IT and paid for by the presenters of the event.  My travel, meals and accommodations were paid for by the event but my opinions negative or positive are all mine.

GD Star Rating

Inter-Fabric Traffic in UCS

It’s been a while since my last post, time sure flies when you’re bouncing all over the place busy as hell.  I’ve been invited to Tech Field Day next week and need to get back in the swing of things so here goes.

In order for Cisco’s Unified Computing System (UCS) to provide the benefits, interoperability and management simplicity it does, the networking infrastructure is handled in a unique fashion.  This post will take a look at that unique setup and point out some considerations to focus on when designing UCS application systems.  Because Fibre Channel traffic is designed to be utilized with separate physical fabrics exactly as UCS does this post will focus on Ethernet traffic only.   This post focuses on End Host mode, for the second art of this post focusing on switch mode use this link:  Let’s start with taking a look at how this is accomplished:

UCS Connectivity


In the diagram above we see both UCS rack-mount and blade servers connected to a pair of UCS Fabric Interconnects which handle the switching and management of UCS systems.  The rack-mount servers are shown connected to Nexus 2232s which are nothing more than remote line-cards of the fabric interconnects known as Fabric Extenders.  Fabric Extenders provide a localized connectivity point (10GE/FCoE in this case) without expanding management points by adding a switch.  Not shown in this diagram are the I/O Modules (IOM) in the back of the UCS chassis.  These devices act in the same way as the Nexus 2232 meaning they extend the Fabric Interconnects without adding management or switches.  Next let’s look at a logical diagram of the connectivity within UCS.

UCS Logical Connectivity

imageIn the last diagram we see several important things to note about UCS Ethernet networking:

  • UCS is a Layer 2 system meaning only Ethernet switching is provided within UCS.  This means that any routing (L3 decisions) must occur upstream.
  • All switching occurs at the Fabric Interconnect level.  This means that all frame forwarding decisions are made on the Fabric Interconnect and no intra-chassis switching occurs.
  • The only connectivity between Fabric Interconnects is the cluster links.  Both Interconnects are active from a switching perspective but the management system known as UCS Manger (UCSM) is an Active/Standby clustered application.  This clustering occurs across these links.  These links do not carry data traffic which means that there is no inter-fabric communication within the UCS system and A to B traffic must be handled upstream.

At first glance handling all switching at the Fabric Interconnect level looks as though it would add latency (inter-blade traffic must be forwarded up to the fabric interconnects then back to the blade chassis.)  While this is true, UCS hardware is designed for low latency environments such as High Performance Computing (HPC.)  Because of this design goal all components operate at very low latency.  The Fabric Interconnects themselves operate at approximately 3.2us (micro seconds), and the Fabric Extenders operate at about 1.5us.  This means total roundtrip time blade to blade is approximately 6.2us right inline or lower than most Access Layer solutions.  Equally as important with this design switching between any two blades/servers in the system will occur at the same speed regardless of location (consistent predictable latency.)

The question then becomes how is traffic between fabrics handled?  The answer is that traffic between fabrics must be handled upstream (next hop device(s) shown in the diagrams as the LAN cloud.)  This is an important consideration when designing UCS implementations and selecting a redundancy/load-balancing behavior for server NICs.

Let’s take a look at two examples, first a bare-metal OS (Windows, Linux, etc.) next a VMware server.

Bare-Metal Operating System

image In the diagram above we see two blades which have been configured in an active/passive NIC teaming configuration using separate fabrics (within UCS this is done within the service profile.)  This means that blade 1 is using Fabric A as a primary path with B available for failover and blade 2 is doing the opposite.  In this scenario any traffic sent from blade 1 to blade 2 would have to be handled by the upstream device depicted by the LAN cloud.  This is not necessarily an issue for the occasional frame but will impact performance for servers that communicate frequently.


For bare-metal operating systems analyze the blade to blade communication requirements and ensure chatty server to server applications are utilizing the same fabric as a primary:

  • When using a card that supports hardware failover provide only one vNIC (made redundant through HW failover) and place its primary path on the same fabric as any other servers that communicate frequently.
  • When using cards that don’t support HW failover use active/passive NIC teaming and ensure that the active side is set to the same fabric for servers that communicate frequently.

VMware Servers


In the above diagram we see that the connectivity is the same from a physical perspective but in this case we are using VMware as the operating system.  In this case a vSwitch, vDS, or Cisco Nexus 1000v will be used to connect the VMs within the Hypervisor.  Regardless of VMware switching option the case will be the same.  It is necessary to properly design the the virtual switching environment to ensure that server to server communication is handled in the most efficient way possible.


  • For half-width blades requiring 10GE or less total throughput, or full-width blades requiring 20GE or less total throughput provide a single vNIC with hardware failover if available or use an active/passive NIC configuration for the VMware switching.
  • For blades requiring the total active/active throughput of available NICs determine application profiles and utilize port-groups (port-profiles with Nexus 1000v) to ensure active paths are the same for application groups which communicate heavily.


UCS utilizes a unique switching design in order to provide high bandwidth, low-latency switching with a greatly reduced management architecture compared to competing solutions.  The networking requires a  thorough understanding in order to ensure architectural designs provide the greatest available performance.  Ensuring application groups that utilize high levels of server to server traffic are placed on the same path will provide maximum performance and minimal additional overhead on upstream networking equipment.

GD Star Rating