Blades are Not the Future

Kevin Houston, Founder of Blades Made Simple and all around server and blade rocket surgeon, posted an excellent thought provoking article titled ‘Why Blade Servers Will Be the Core of Future Data Centers ( http://bladesmadesimple.com/2011/10/why-blade-servers-will-be-the-core-of-future-data-centers/.)  The article is his predictions and thoughts on the way in which the server industry will move.  Kevin walks through several stages of blade server evolution he believes could be coming.

  1. Less I/O expansion, basically less switching infrastructure required in the chassis due to increased bandwidth.
  2. More on-board storage options, possibly utilizing the space reclaimed from I/O modules.
  3. External I/O expansion options such as those offered by Aprius and Xsigo,
  4. Going fully modular at the rack-level,extending the concept of a blade chassis to rack size and add shelves of PCIe, storage and servers.

I jokingly replied to him that he’d invented the ‘rack-mount’ server, as in the blades are not in a blade chassis, but inserted into a rack, access external storage in the same rack and have connections to shared resources (PCIe) in that rack.  The reality is Kevin’s vision is closer to a mainframe than a rack-mount.

Overall while I enjoyed Kevin’s post for the thought experiment I think his vision of the data center future is way off from where we’re headed.  Starting off I don’t think that blades are the solution for every problem now.  I’ve previously summarized my thoughts on that, and some bad Shakespeare prose, in a blog on my friend Thomas Jones site: http://niketown588.com/2010/09/08/to-blade-or-not-to-blade/.  Basically stating that blades aren’t the right tool for every job.

Additionally I don’t see blades as the long-term future of enterprise and above computing.  I look to the way Microsoft, Google, Amazon and Facebook do their computing as the future, cheap commodity rack-mounts in mass.  I see the industry transitioning this way.  Blades (as we use them today) don’t hold water in that model due to cost, complexity, proprietary nature, etc.  Blades are designed to save space and they’re built to be highly available, as we start to build our data centers to scale and our applications with more reliability designing them for cloud platforms, highly available server hardware becomes irrelevant.  No service is lost when one of the thousands of servers handling Bing search fails, a new server is put in its place and joins the pool of available resources.

If blades, or some transformation of them, were the future I don’t see it playing out the same way as Kevin does.  I think Kevin’s end concept is built on a series of shaky assumptions: external I/O appliances, and blade chassis storage.

Let’s start with chassis based storage (i.e. shared storage in the blade chassis.  This is something I’ve never been a fan of as it limits access of the shared disk to a single chassis, meaning 14 blades max… wait, less than 14 blades because it uses blade slots to provide disk.  In very small scale this may make sense because you have an ‘all-in-one’ chassis, but the second you outgrow that (oops my business got bigger) you’re now stuck with small silos of data.

The advantage of this approach however is the low-latency access and the high bandwidth availability across the blade back/mid-plane.  This makes this a more interesting option now with lightning fast SSDs and cache options.  Now you can have extremely high performance storage within the blade chassis which provides a lot of options for demanding applications.  In these instances local storage in the chassis will be a big hit, but it will not be the majority of deployments without additional features such as EMC’s ‘Project Lightning’ (http://www.emc.com/about/news/press/2011/20110509-05.htm) to free the trapped data from the confines of the chassis.

Next we have external I/O appliances… These have been on my absolute hate list since the first time I saw them.  Kevin suggests a device based on industry standards but current versions are fully proprietary and require not only the vendors appliance but also the vendors cards in either the appliance or the server, this is the first nightmare.  Beyond that these devices create a single-point-of-failure for the I/O devices of multiple servers, and run directly in the I/O path.  Your basically adding complexity, cost and failure points, and for what?  Let’s look at that:

From Aprius’s perspective ‘Aprius PCI Express over Ethernet technology extends a server’s internal PCIe bus out over the Ethernet network, enabling groups of servers to access and share network-attached pools of PCIe Express devices, such as flash solid state storage, at PCIe performance levels (www.aprius.com.) I’d really like to know how you get ‘PCIe performance levels’ over Ethernet infrastructure???

And from Xsigo: ’In the Xsigo wire-once infrastructure you connect each server to the I/O Director via a single cable. Then connect networks and storage to the I/O Director. You’re now ready to provision Ethernet and Fibre channel connections from servers to data center resource in real time (http://xsigo.com/.)’ Basically you plug all your I/O into their server/appliance then cable it to your server via Infiniband or Ethernet, why???  You’re adding a device in-band in order to consolidate storage and LAN traffic?  FCoE, NFS, iSCSI, etc. already do that on standards based 10GE or 40GE and with no in-band appliance.

Kevin mentions this as a way to allow more space in the blades for future memory and processor options.  This makes sense as HP, IBM, Dell and Sun designs have already run into barriers with the height of their blades restricting processor options.  This is because the blade size was designed years ago and didn’t account for today’s larger processors/heat sinks.  Their only workaround is utilizing two blade slots which consumes too much space per blade.  Newer blade architectures like Cisco UCS take modern processors into account and don’t have this limitation, so don’t require I/O offloading to free space.

Lastly I/O offloading as a whole just stinks to me.  You still have to get the I/O into the server right?  Which means you’ll still have I/O adapters of some type in the server.  With 40GE to the blade shipping this year why would you require anything else?  GPU and cache storage argument?  Sure go that direction and then explain to me why you’d want to pull those types of devices off the local PCIe bus and use them remotely adding latency?

Finally to end my rant a rack size blade enclosure presents a whole lot of lock-in.  You’re at the mercy of the vendor you purchase it from for new hardware and support until it’s fully utilized.  Sounds a lot like the reason we left main frames for x86 in the first place doesn’t it?

Thoughts, corrections comments and sheer hate mail always appreciated!  What do you think?

GD Star Rating
loading...

Cloud Success Factor: Rethink Application Development

You’ve been driving a perfectly suitable family sedan for the last ten years. It’s highly rated by all the gurus who rate such things; it’s safe, reliable and gets acceptable gas mileage. You’ve never loved it in anyway, although you did have a moment of pure capitalist joy when you drove it off the lot, and you’ve never disliked it in any way. Then one day you woke up and out of the blue you were bored and needed a change, a big change…

To see the full post visit Network Computing: http://www.networkcomputing.com/private-cloud/231901846

GD Star Rating
loading...

Hypervisors are not the Droids You Seek

Long ago, in a data center far, far away, we as an industry moved away from big iron and onto commodity hardware. That move brought with it many advantages, such as cost and flexibility. The change also brought along with it higher hardware and operating system software failure rates. This change in application stability forced us to change our deployment model and build the siloed application environment: One application, one operating system, one server….

To see the full post visit Network Computing: http://www.networkcomputing.com/private-cloud/231901662

GD Star Rating
loading...

Server Networking With gen 2 UCS Hardware

** this post has been slightly edited thanks to feedback from Sean McGee**

In previous posts I’ve outlined:

If you’re not familiar with UCS networking I suggest you start with those for background.  This post is an update to those focused on UCS B-Series server to Fabric Interconnect communication using the new hardware options announced at Cisco Live 2011.  First a recap of the new hardware:

The UCS 6248UP Fabric Interconnect

The 6248 is a 1RU model that provides 48 universal ports (1G/10G Ethernet or 1/2/4/8G FC.)  This provides 20 additional ports over the 6120 in the same 1RU form factor.  Additionally the 6248 is lower latency at 2.0us from 3.2us previously.

The UCS 2208XP I/O Module

The 2208 doubles the total uplink bandwidth per I/O module providing a total of 160Gbps total throughput per 8 blade chassis.  It quadruples the number of internal 10G connections to the blades allowing for 80Gbps per half-width blade.

UCS 1280 VIC

The 1280 VIC provides 8x10GE ports total, 4x to each IOM for a total of 80Gbps per half-width slot (160 Gbs with 2x in a full-width blade.)  It also double the VIF numbers of the previous VIC allowing for 256 (theoretical)  vNICs or vHBAs.  The new VIC also supports port-channeling to the UCS 2208 IOM and iSCSI boot.

The other addition that affects this conversation is the ability to port-channel the uplinks from the 2208 IOM which could not be done before (each link on a 2104 IOM operated independently.)  All of the new hardware is backward compatible with all existing UCS hardware.  For more detailed information on the hardware and software announcements visit Sean McGee’s blog where I stole these graphics: http://www.mseanmcgee.com/2011/07/ucs-2-0-cisco-stacks-the-deck-in-las-vegas/.

Let’s start by discussing the connectivity options from the Fabric Interconnects to the IOMs in the chassis focusing on all gen 2 hardware.

There are two modes of operation for the IOM: Discrete and Port-Channel.  in both modes it is possible to configure 1, 2 , 4, or 8 uplinks from each IOM in either Discrete mode (non-bundled) or port-channel mode (bundled.)

UCS 2208 fabric Interconnect Failover

image

Discrete Mode:

In discrete mode a static pinning mechanism is used mapping each blade to a given port dependent on number of uplinks used.  This means that each blade will have an assigned uplink on each IOM for inbound and outbound traffic.  In this mode if a link failure occurs the blade will not ‘re-pin’ on the side of the failure but instead rely on NIC-Teaming/bonding or Fabric Failover for failover to the redundant IOM/Fabric.  The pinning behavior is as follows with the exception of 1-Uplink (not-shown) in which all blades use the only available Port:

2 Uplinks

Blade

Port 1

Port 2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

4 Uplinks

Blade

Port 1

Port  2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

8 Uplinks

Blade

Port 1

Port2

Port 3

Port 4

Port 5

Port 6

Port 7

Port 8

1

image

2

image

3

image

4

image

5

image

6

image

7

image

8

image

The same port-pinning will be used on both IOMs, therefore in a redundant configuration each blade will be uplinked via the same port on separate IOMs to redundant fabrics.  The draw of discrete mode is that bandwidth is predictable in link failure scenarios.  If a link fails on one IOM that server will fail to the other fabric rather than adding additional bandwidth draws on the active links for the failure side.  In summary it forces NIC-teaming/bonding or Fabric Failover to handle failure events rather than network based load-balancing.  The following diagram depicts the failover behavior for server three in an 8 uplink scenario.

Discrete Mode Failover

image

In the previous diagram port 3 on IOM A has failed.  With the system in discrete mode NIC-teaming/bonding or Fabric Failover handles failover to the secondary path on IOM B (which is the same port (3) based on static-pinning.)

Port-Channel Mode:

In Port-Channel mode all available links are bonded and a port-channel hashing algorithm (TCP/UDP + Port VLAN, non-configurable) is used for load-balancing server traffic.  In this mode all server links are still ‘pinned’ but they are pinned to the logical bundle rather than individual IOM uplinks.  The following diagram depicts this mode.

Port-Channel Mode

image

In this scenario when a port fails on an IOM port-channel load-balancing algorithms handle failing the server traffic flow to another available port in the channel.  This failover will typically be faster than NIC-teaming/bonding failover.  This will decrease the potential throughput for all flows on the side with a failure, but will only effect performance if the links are saturated.  The following diagram depicts this behavior.

image

In the diagram above Blade 3 was pinned to Port 1 on the A side.  When port 1 failed port 4 was selected (depicted in green) while fabric B port 6 is still active leaving a potential of 20 Gbps.

Note: Actual used ports will vary dependent on port-channel load-balancing.  These are used for example purposes only.

As you can see the port-channel mode enables additional redundancy and potential per-server bandwidth as it leaves two paths open.  In high utilization situations where the links are fully saturated this will degrade throughput of all blades on the side experiencing the failure.  This is not necessarily a bad thing (happens with all port-channel mechanisms), but it is a design consideration.  Additionally port-channeling in all forms can only provide the bandwidth of a single link per flow (think of a flow as a conversation.)  This means that each flow can only utilize 10Gbps max even though 8x10Gbps links are bundled.  For example a single FTP transfer would max at 10Gbps bandwidth, while 8xFTP transfers could potentially use 80Gbps (10 per link) dependent on load-balancing.

Next lets discuss server to IOM connectivity (yes I use discuss to describe me monologuing in print, get over it, and yes I know monologuing isn’t a word) I’ll focus on the new UCS 1280 VIC because all other current cards maintain the same connectivity.  the following diagram depicts the 1280 VIC connectivity.

image

The 1280 VIC utilizes 4x10Gbps links across the mid-pane per IOM to form two 40Gbps port-channels.  This provides for 80Gbps total potential throughput per card.  This means a half-width blade has a total potential of 80Gbps using this card and a full-width blade can receive 160Gbps (of course this is dependent upon design.)  As with any port-channel, link-bonding, trunking or whatever you may call it, any flow (conversation) can only utilize the max of one physical link (or back plane trace) of bandwidth.  This means every flow from any given UCS server has a max potential bandwidth of 10Gbps, but with 8 total uplinks 8 different flows could potentially utilize 80Gbps.

This becomes very important with things like NFS-based storage within hypervisors.  Typically a virtualization hypervisor will handle storage connectivity for all VMs.  This means that only one flow (conversation) will occur between host and storage.  In these typical configurations only 10Gbps will be available for all VM NFS data traffic even though the host may have a potential 80Gbps bandwidth.  Again this is not necessarily a concern, but a design consideration as most current/near-future hosts will never use more than 10Gbps of storage I/O.

Summary:

The new UCS hardware packs a major punch when it comes to bandwidth, port-density and failover options.  That being said it’s important to understand the frame flow, port-usage and potential bandwidth in order to properly design solutions for maximum efficiency.  As always comments, complaints and corrections are quite welcome!

GD Star Rating
loading...

Choosing The Right Private Cloud Storage

One of the key decisions in architecting an infrastructure for private cloud is selecting a storage platform for the deployment. Storage is a key component of the infrastructure and will play a major role in the overall performance of the private cloud. The storage decision carries additional weight due to its larger investment and typically longer refresh-cycle…..

To view the full article visit Network Computing: http://www.networkcomputing.com/private-cloud/231901384

GD Star Rating
loading...

WWT’s Geek Day 2012

A BrightTALK Channel

GD Star Rating
loading...

Build For IT Nirvana

In many data centers large and small there is a history of making short-term decisions that affect long-term design. These may be based on putting out immediate fires, such as rolling out a new application, expanding an old one, or replacing failed hardware. They may also be made by short-sighted or near-sighted policies, or more commonly old policies that aren’t question in the light of new technology. These types of decisions can range from costly to crippling for data center operations…

To see the full post visit NetworkComputing: http://www.networkcomputing.com/private-cloud/231700329

GD Star Rating
loading...