Intel’s Betting the Storage I/O Farm on the CPU

 

I had the privilege of attending Tech Field Day 4 in San Jose this week as a delegate thanks to Stephen Foskett and Gestalt IT.  It was a great event and a lot of information was covered in two days of presentations.  I’ll be discussing the products and vendors that sponsored the event over the next few blogs starting with this one on Intel.  Check out the official page to view all of the delegates and find links to the recordings etc. http://gestaltit.com/field-day/2010-san-jose/.

Intel presented both their Ethernet NIC and storage I/O strategy as well as a processor update and public road map, this post will focus on the Ethernet and I/O presentation.

Intel began the presentation with an overview of the data center landscape and a description of the move towards converged I/O infrastructure, meaning storage, traditional LAN and potentially High Performance Computing (HPC) on the same switches and cables.  Anyone familiar with me or this site knows that I am a fan and supporter of converging the network infrastructure to reduce overall cost and complexity as well as provide more flexibility to data center I/O so I definitely liked this messaging.  Next was a discussion of iSCSI and its tradition of being used as a consolidation tool.

iSCSI:

iSCSI has been used for years in order to provide a mechanism for consolidated block storage data without the need for a separate physical network.  Most commonly iSCSI has been deployed as a low-cost alternative to Fibre Channel.  Its typically been used in the SMB space and for select applications in larger datacenters.  iSCSI was previously limited to 1 Gigabit pipes (prior to the 10GE ratification) and it also suffers from higher latency and lower throughput than Fibre Channel.  The beauty of iSCSI is the ability to use existing LAN infrastructure and traditional NICs to provide block access to shared disk, the Achilles heal is performance.  Because of this cost has always been the primary deciding factor to use iSCSI. For more information on iSCSI see my post on storage protocols: http://www.definethecloud.net/storage-protocols.

In order to increase the performance of iSCSI and decrease the overhead on the system processor(s) the industry developed iSCSI Host Bus Adapters (HBA) which offload the protocol overhead to the I/O card hardware.  These were not widely adopted due to the cost of the cards, this means that a great deal of iSCSI implementations rely on a protocol stack in the operating system (OS.) 

Intel then drew parallels to doing the same with FCoE via the FCoE software stack available for Windows and included in current Linux kernels.  The issue with drawing this parallel is that iSCSI is a mid-market technology that sacrifices some performance and reliability for cost, whereas FCoE is intended to match/increase the performance and reliability of FC while utilizing Ethernet as the transport.  This means that when looking at FCoE implementations the additional cost of specialized I/O hardware makes sense to gain the additional performance and reduce the CPU overhead.

Intel also showed some performance testing of FCoE software stack versus hardware offload using a CNA.  The IOPS they showed were quite impressive for a software stack, but IOPS isn’t the only issue.  The other issue is protocol overhead on the processor.Their testing showed an average of about 6% overhead for the software stack.  6% is low but we were only being shown one set of test criteria for a specific workload.  Additionally we were not provided the details of the testing criteria.  Other tests I’ve seen of the software stack are about 2 years old and show very comparable CPU utilization for FCoE software stack and Generation I CNAs for 8 KB reads, but a large disparity as the block size increased (CPU overhead became worse and worse for the software stack.)  In order to really understand the implications of utilizing a software stack Intel will need to publish test numbers under multiple test conditions:

I’ve since located the test Intel referenced from Demartek.  It can be obtained here (http://www.demartek.com/Reports_Free/Demartek_Intel_10GbE_FCoE_iSCSI_Adapter_Performance_Evaluation_2010-09.pdf.)  Notice that in the forward Demartek states the importance of CPU utilization data and stresses that they don’t cherry pick data then provides CPU utilization data only for the Microsoft Exchange simulation through JetStress, not for the SQLIO simulation at various block sizes.  I find that you can learn more from the data not shown in vendor sponsored testing, than the data shown.

Even if we were to make two big assumptions: Software stack IOPS are comparable to CNA hardware, and additional CPU utilization is less than or equal to 6% would you want to add an additional 6% CPU overhead to your virtual hosts?  The purpose of virtualization is to come as close as possible to full hardware utilization via placing multiple workloads on a single server.  In that scenario adding additional processor overhead seems short sighted.

The technical argument for doing this is two fold:

If you’re looking to save cost and are comfortable with the processor and performance overhead then there is no major issue with using the software stack.  That being said if you’re really trying to maximize performance and or virtualization ratios you want to squeeze every drop you can out of the processor for the virtual machines.  As far as the second point of processor capacity goes, it most definitely rings true but with each newer faster processor you buy you’re losing that assumed 6% off the top for protocol overhead.  That isn’t acceptable to me.

The Other Problem:

FC and FCoE have been designed to carry native SCSI commands and data and treat them as SCSI expects, most importantly frames are not dropped (lossless network.)  The flow control mechanism FC uses for this is called buffer-to-buffer credits (B2B.)  This is a hop-to-hop mechanism implemented in hardware on HBAs/CNAs and FC switches.  In this mechanism when two ports initialize a link they exchange a number of buffer spaces they have dedicated to the device on the other side of the link based on agreed frame size. When any device sends a frame it is responsible for keeping track of the buffer space available on the receiving device based on these credits.  When a device receives a frame and has processed it (removing it from the buffer) it returns an R_RDY similar to a TCP ACK which lets the sending device know that a buffer has been freed.  For more information on this see the buffer credits section of my previous post: http://www.definethecloud.net/whats-the-deal-with-quantized-congestion-notification-qcn.  This mechanism ensures that a device never sends a frame that the receiving device does not have sufficient buffer space for and this is implemented in hardware. 

On FCoE networks we’re relying on Ethernet as the transport so B2B credits don’t exist.  Instead we utilize Priority Flow Control (PFC) which is a priority based implementation of 802.3x pause.  For more information on DCB see my previous post: http://www.definethecloud.net/data-center-bridging-exchange.  PFC is handled by DCB capable NICs and will handle sending a pause before the NIC buffers overflow.  This provides for a lossless mechanism that can be translated back into B2B credits at the FC edge. 

The issue here with the software stack is that while the DCB capable NIC ensures the frame is not dropped on the wire via PFC it has to pass processing across the PCIe bus to the processor and allow the protocol to be handled by the OS kernel.  This adds layers in which the data could be lost or corrupted that don’t exist with a traditional HBA or CNA.

Summary:

FCoE software stack is not a sufficient replacement for a CNA.  Emulex, Broadcom, Qlogic and Brocade are all offloading protocol to the card to decrease CPU utilization and increase performance.  HP has recently announced embedding Emulex OneConnect adapters, which offload iSCSI, TCP and FCoE, on the system board.  That’s a lot of backing for protocol offload with only Intel standing on the other side of the fence.  My guess is that Intel’s end goal is to sell more processors, and utilizing more cycles for protocol processing makes sense.  Additionally Intel doesn’t have a proven FC stack to embed on a card and the R/D costs would be significant, so throwing it in the kernel and selling their standard NIC makes sense to the business.  Lastly don’t forget storage vendor qualification, Intel has an uphill battle getting an FCoE software stack on the approved list for the major storage vendors.

Full Discloser:  Tech Field Day is organized by the folks at Gestalt IT and paid for by the presenters of the event.  My travel, meals and accommodations were paid for by the event but my opinions negative or positive are all mine.

Storage Protocols

Storage is a major consideration for cloud initiatives; what type of disk, which vendor, and as importantly which protocol?  Experts will tout one over the other based on cost, performance, throughput, etc.  Let's take a look at the major storage protocols at play in the data center:

Small Computer System Interface (SCSI):

SCSI is the dominant block level access method for disk in the data center.  Blocks are typically the smallest unit that can be read or written to on a disk, they exist in various sizes depending on disk type and usage.  Block level access means that the server can directly access the disk blocks without the need for a file system in place on top of them, this is opposite of file-based storage discussed later.

SCSI has been in use since the early 1980's and was originally used to move data within a single server.  The operating system handles writing data using the SCSI protocol to a SCSI drive controller which managed one or more devices on a SCSI cable within a system chassis.  The SCSI controller ensured that only one device would be active on the cable at any time which prevents contention on the SCSI bus.  Because SCSI was managed by a single controller and contained within a system the chance for data loss, or contention were minimal, this meant that SCSI did not require control mechanisms to handle data loss or contention as with networked protocols. SCSI itself is still widely used in its native format but it has also been encapsulated into other protocols for use within storage networks for consolidated storage.

Fibre Channel (FC):

Fibre Channel was designed to extend the functionality of SCSI into point-to-point, loop, and switched topologies.  This allows for longer distances as well as storage consolidation.  FC encapsulates SCSI data and Command Descriptor Blocks (CDB) into the payload of Fibre Channel frames.  Fibre Channel networks provided the addressing, routing, and flow-control required to support SCSI data.  Additionally Fibre Channel networks are designed to meet the needs of SCSI by providing 'lossless' in order delivery.  This means that in a stable network FC frames will not be dropped, and are delivered in order ensuring that the Upper Layer Protocols (ULP) will not be forced to reorder or resend frames.

Fibre Channel networks are typically carried over fiber-optic links on dedicated infrastructures.  These infrastructures are traditionally built-in pairs as exact mirrors of one another.  This provides complete physical redundancy end-to-end.  Additionally these networks provide high bandwidth and low-latency.  FC networks come in 1/2/4/8 Gbps speeds with 16/32 Gbps in the works.  Additionally 10Gbps FC links are typically available on a proprietary basis for links between switches.

internet/IP Small Computer System Interface (iSCSI):

iSCSI takes SCSI data and CDBs and places it in the payload of IP packets.  This allows the SCSI protocol to be extended across existing IP infrastructures.  While IP is routable within the data center and across the WAN iSCSI is not traditionally used/supported over routed boundaries (exceptions do exist.)  The draw of iSCSI has been that storage data can be extended across the existing infrastructure with minimal additional cost.

iSCSI has not gained the market share many have predicted over the years due to flaws in the protocol and limitations of the traditional Ethernet based data center networks.  until the standardization of 10 Gigabit Ethernet most data centers relied on 1GE links which were typically saturated already.  This meant implementing iSCSI required new switching infrastructure.  10GE has changed the bandwidth limits but still not catapulted iSCSI into the mainstream.  There are several reasons for this, one being that there is large existing investment in Fibre Channel, and two being the iSCSI protocol itself.

The problem with iSCSI from a protocol standpoint is that it takes the SCSI protocol which expects lossless, in-order delivery, and places it in TCP/IP packets which are designed to support heterogeneous WAN networks and experience packet loss and out-of-order delivery frequently.  This is done without providing any additional tools to either SCSI or TCP/IP for handling the SCSI payloads in the expected fashion.  This in no way means iSCSI is unusable or should be written off it just means that additional considerations must be made when designing iSCSI, especially in the Enterprise or larger environment.

In order to provide proper performance for iSCSI on shared networks Quality of Service (QoS), physical architecture, and jumbo frame support must be taken into account.  Because of these considerations many iSCSI networks have traditionally been placed on separate network hardware from the data center LAN (isolated iSCSI networks.)  This has minimized some of the benefits of consolidating on a single protocol.  With 10 Gigabit Ethernet and the standardization of Data Center Bridging (DCB) iSCSI looks more promising for a greater audience.  For more information on DCB see my previous post (http://www.definethecloud.net/?p=31.)

Fibre Channel over Ethernet (FCoE):

FCoE was ratified in 2009 and provides the functionality for moving native Fibre Channel across consolidated Ethernet networks.  FcoE relies on the DCB standards referenced above.  FCoE encapsulates full Fibre Channel frames inside Ethernet Jumbo Frame payloads.  Utilizing jumbo frames ensure that the FC frame is not fragmented or changed in any way.  The FCoE and DCB standards provide a robust tool set for consolidating existing Fibre Channel workloads on shared 10GE networks while providing the lossless, in-order delivery SCSI expects.  FCoE does not modify the existing Fibre Channel protocol suite and allows for the same management model including zoning, LUN masking, etc.  FCoE has started gaining ground over the last two years pushed by several large hardware vendors in the storage, network, and server markets.  For more information on FCoE see my post (http://www.definethecloud.net/?p=80.)

Common Internet File System (CIFS):

CIFS is a file based storage system based on Small Message block (SMB.)  This is a shared storage protocol typically used in Microsoft environments for file sharing.  Windows-based file shares rely on CIFS as the transfer protocol of the file level data.  File based storage relies on an underlying files system such as FAT32, XFS, NTFS or otherwise which differs from block based storage which does not.  File level storage is an excellent medium for some applications but is not traditionally effective in others.  When an application needs direct block access to disk file based storage is not appropriate.  Deployments that fall into this category include some databases and most Operating Systems.

Network File System (NFS):

NFS is another file based storage protocol.  NFS is traditionally used in Linux and Unix environments.  NFS is also a widely used protocol for VMware environments and can offer several benefits for virtual machine storage.  As a file based storage protocol NFS experiences many of the same limitations as stated for CIFS above.

Hyper Text Transfer Protocol (HTTP) and others:

When the cloud discussion leaves the data center (private/internal cloud) and moves up to the service provider level such as Google, Amazon, or the TelCos the protocols listed above may not have the necessary scalability.  When you begin talking about supporting thousands of customers with multiple Terabytes each, traditional storage protocols may not suffice.  It has to do with both the scalability of the systems and the administration of the disk.  iSCSI and FC both require a fair amount of management for the RAID, volumes, and LUNs, whereas CIFS and NFS require a fair amount for the security and volumes.  Protocols such as HTTP based storage are being used to simplify storage configuration and increase its scalability.

Which is the right protocol to use when moving to the cloud?  Obviously there is only one answer!  As always in IT 'it depends.'  Each protocol has it's uses, benefits and drawbacks.  The most important thing to remember is that most environments can benefit from more than one or all of these protocols.  Every application is different and any given protocol may have advantages for a particular app.  The only universal truth in cloud storage is that protocol flexibility will be key.