I had the privilege of attending Tech Field Day 4 in San Jose this week as a delegate thanks to Stephen Foskett and Gestalt IT. It was a great event and a lot of information was covered in two days of presentations. I’ll be discussing the products and vendors that sponsored the event over the next few blogs starting with this one on Intel. Check out the official page to view all of the delegates and find links to the recordings etc. http://gestaltit.com/field-day/2010-san-jose/.
Intel presented both their Ethernet NIC and storage I/O strategy as well as a processor update and public road map, this post will focus on the Ethernet and I/O presentation.
Intel began the presentation with an overview of the data center landscape and a description of the move towards converged I/O infrastructure, meaning storage, traditional LAN and potentially High Performance Computing (HPC) on the same switches and cables. Anyone familiar with me or this site knows that I am a fan and supporter of converging the network infrastructure to reduce overall cost and complexity as well as provide more flexibility to data center I/O so I definitely liked this messaging. Next was a discussion of iSCSI and its tradition of being used as a consolidation tool.
iSCSI has been used for years in order to provide a mechanism for consolidated block storage data without the need for a separate physical network. Most commonly iSCSI has been deployed as a low-cost alternative to Fibre Channel. Its typically been used in the SMB space and for select applications in larger datacenters. iSCSI was previously limited to 1 Gigabit pipes (prior to the 10GE ratification) and it also suffers from higher latency and lower throughput than Fibre Channel. The beauty of iSCSI is the ability to use existing LAN infrastructure and traditional NICs to provide block access to shared disk, the Achilles heal is performance. Because of this cost has always been the primary deciding factor to use iSCSI. For more information on iSCSI see my post on storage protocols: http://www.definethecloud.net/storage-protocols.
In order to increase the performance of iSCSI and decrease the overhead on the system processor(s) the industry developed iSCSI Host Bus Adapters (HBA) which offload the protocol overhead to the I/O card hardware. These were not widely adopted due to the cost of the cards, this means that a great deal of iSCSI implementations rely on a protocol stack in the operating system (OS.)
Intel then drew parallels to doing the same with FCoE via the FCoE software stack available for Windows and included in current Linux kernels. The issue with drawing this parallel is that iSCSI is a mid-market technology that sacrifices some performance and reliability for cost, whereas FCoE is intended to match/increase the performance and reliability of FC while utilizing Ethernet as the transport. This means that when looking at FCoE implementations the additional cost of specialized I/O hardware makes sense to gain the additional performance and reduce the CPU overhead.
Intel also showed some performance testing of FCoE software stack versus hardware offload using a CNA. The IOPS they showed were quite impressive for a software stack, but IOPS isn’t the only issue. The other issue is protocol overhead on the processor.Their testing showed an average of about 6% overhead for the software stack. 6% is low but we were only being shown one set of test criteria for a specific workload. Additionally we were not provided the details of the testing criteria. Other tests I’ve seen of the software stack are about 2 years old and show very comparable CPU utilization for FCoE software stack and Generation I CNAs for 8 KB reads, but a large disparity as the block size increased (CPU overhead became worse and worse for the software stack.) In order to really understand the implications of utilizing a software stack Intel will need to publish test numbers under multiple test conditions:
- Sequential and random
- Various read and write combinations
- Various block sizes
- Mixed workloads of FCoE and other Ethernet based traffic
I’ve since located the test Intel referenced from Demartek. It can be obtained here (http://www.demartek.com/Reports_Free/Demartek_Intel_10GbE_FCoE_iSCSI_Adapter_Performance_Evaluation_2010-09.pdf.) Notice that in the forward Demartek states the importance of CPU utilization data and stresses that they don’t cherry pick data then provides CPU utilization data only for the Microsoft Exchange simulation through JetStress, not for the SQLIO simulation at various block sizes. I find that you can learn more from the data not shown in vendor sponsored testing, than the data shown.
Even if we were to make two big assumptions: Software stack IOPS are comparable to CNA hardware, and additional CPU utilization is less than or equal to 6% would you want to add an additional 6% CPU overhead to your virtual hosts? The purpose of virtualization is to come as close as possible to full hardware utilization via placing multiple workloads on a single server. In that scenario adding additional processor overhead seems short sighted.
The technical argument for doing this is two fold:
- Saving cost on specialized I/O hardware
- Processing capacity evolves faster than I/O offload capacity and speeds mainly due to economies of scale therefore your I/O performance will increase with each processor refresh using a software stack
If you’re looking to save cost and are comfortable with the processor and performance overhead then there is no major issue with using the software stack. That being said if you’re really trying to maximize performance and or virtualization ratios you want to squeeze every drop you can out of the processor for the virtual machines. As far as the second point of processor capacity goes, it most definitely rings true but with each newer faster processor you buy you’re losing that assumed 6% off the top for protocol overhead. That isn’t acceptable to me.
The Other Problem:
FC and FCoE have been designed to carry native SCSI commands and data and treat them as SCSI expects, most importantly frames are not dropped (lossless network.) The flow control mechanism FC uses for this is called buffer-to-buffer credits (B2B.) This is a hop-to-hop mechanism implemented in hardware on HBAs/CNAs and FC switches. In this mechanism when two ports initialize a link they exchange a number of buffer spaces they have dedicated to the device on the other side of the link based on agreed frame size. When any device sends a frame it is responsible for keeping track of the buffer space available on the receiving device based on these credits. When a device receives a frame and has processed it (removing it from the buffer) it returns an R_RDY similar to a TCP ACK which lets the sending device know that a buffer has been freed. For more information on this see the buffer credits section of my previous post: http://www.definethecloud.net/whats-the-deal-with-quantized-congestion-notification-qcn. This mechanism ensures that a device never sends a frame that the receiving device does not have sufficient buffer space for and this is implemented in hardware.
On FCoE networks we’re relying on Ethernet as the transport so B2B credits don’t exist. Instead we utilize Priority Flow Control (PFC) which is a priority based implementation of 802.3x pause. For more information on DCB see my previous post: http://www.definethecloud.net/data-center-bridging-exchange. PFC is handled by DCB capable NICs and will handle sending a pause before the NIC buffers overflow. This provides for a lossless mechanism that can be translated back into B2B credits at the FC edge.
The issue here with the software stack is that while the DCB capable NIC ensures the frame is not dropped on the wire via PFC it has to pass processing across the PCIe bus to the processor and allow the protocol to be handled by the OS kernel. This adds layers in which the data could be lost or corrupted that don’t exist with a traditional HBA or CNA.
FCoE software stack is not a sufficient replacement for a CNA. Emulex, Broadcom, Qlogic and Brocade are all offloading protocol to the card to decrease CPU utilization and increase performance. HP has recently announced embedding Emulex OneConnect adapters, which offload iSCSI, TCP and FCoE, on the system board. That’s a lot of backing for protocol offload with only Intel standing on the other side of the fence. My guess is that Intel’s end goal is to sell more processors, and utilizing more cycles for protocol processing makes sense. Additionally Intel doesn’t have a proven FC stack to embed on a card and the R/D costs would be significant, so throwing it in the kernel and selling their standard NIC makes sense to the business. Lastly don’t forget storage vendor qualification, Intel has an uphill battle getting an FCoE software stack on the approved list for the major storage vendors.
Full Discloser: Tech Field Day is organized by the folks at Gestalt IT and paid for by the presenters of the event. My travel, meals and accommodations were paid for by the event but my opinions negative or positive are all mine.