UCS Server Failover

I spent the day today with a customer doing a proof of concept and failover testing demo on a Cisco UCS, VMware and NetApp environment.Â As I sit on the train heading back to Washington from NYC I thought it might be a good time to put together a technical post on the failover behavior of UCS blades.Â UCS has some advanced availability features that should be highlighted, it additionally has some areas where failover behavior may not be obvious.Â In this post Iâ€™m going to cover server failover situations within the UCS system, without heading very deep into the connections upstream to the network aggregation layer (mainly because Iâ€™m hoping Brad Hedlund at http://bradhedlund.com will cover that soon, hurry up Brad 😉

**Update** Brad has posted his UCS Networking Best Practices Post I was hinting at above.Â It's a fantastic video blog in HD, check it out here: http://bradhedlund.com/2010/06/22/cisco-ucs-networking-best-practices/

To start this off letâ€™s get up to a baseline level of understanding on how UCS moves server traffic.Â UCS is comprised of a number of blade chassis and a pair of Fabric Interconnects (FI.)Â The blade chassis hold the blade servers and the FIs handle all of the LAN and SAN switching as well as chassis/blade management that is typically done using six separate modules in each blade chassis in other implementations.

Note: When running redundant Fabric interconnects you must configure them as a cluster using L1 and L2 cluster links between each FI.Â These ports carry only cluster heartbeat and high-level system messages no data traffic or Ethernet protocols and therefore I have not included them in the following diagrams.

UCS Network Connectivity

Each individual blade gets connectivity to the network(s) via mezzanine form factor I/O card(s.)Â Depending on which blade type you selectÂ each blade will either have one redundant set of connections to the FIs or two redundant sets.Â Regardless of the type of I/O card you select you will always have 1x10GE connection to each FI through the blade chassis I/O module (IOM.)

UCS Blade Connectivity

In the diagram your seeing the blade connectivity for a blade with a single mezzanine slot.Â You can see that the blade is redundantly connected to both Fabric A and Fabric B via 2x10GE links.Â This connection occurs via the IOM which is not a switch itself and instead acts as a remote device managed by the fabric interconnect. What this means is that all forwarding decisions are handled by the FIs and frames are consistently scheduled within the system regardless of source and or destination.Â The total switching latency of the UCS system is approximately equal to a top-of-rack switch or blade form factor LAN switch within other blade products.Â Because the IOM is not making switching decisions it will need another method to move 8 internal mid-plane ports traffic upstream using itâ€™s 4 available uplinks.Â the method it uses is static pinning.Â This method provides a very elegant switching behavior with extremely predictable failover scenarios. Letâ€™s first look at the pinning later what this means for the UCS network failures.

Static Pinning

The chart above shows the static pinning mechanism used within UCS.Â Given the configured number of uplinks from IOM to FI you will know exactly which uplink port a particular mid-plane port is using.Â Each half-width blade attaches to a single mid-plane port and each full width blade attaches to two.Â In the diagram the use of three ports does not have a pinning mechanism because this is not supported.Â If three links are used the 2 port method will define how uplinks are utilized.Â This is because eight devices cannot be evenly load-balanced across three links.

IOM Connectivity

The example above shows the numbering of mid-plane ports.Â If you were using half width blades their numbering would match.Â When using full-width blades each blade has access to a pair of mid-plane ports (1-2, 3-4, 5-6, 7-8.)In the example above blade three would utilize mid-plane port three in the left example and one in the second based on the static pinning in the chart.

So now letâ€™s discuss how failover happens, starting at the operating system.Â We have two pieces of failover to discuss, NIC teaming, and SAN multi-pathing.Â In order to understand that we need a simple logical connectivity view of how a UCS blade seeâ€™s the world.

UCS Logical Connectivity

In order to simplify your thinking when working with blade systems reduce your logical diagram to the key components, do this by removing the blade chassis itself from the picture.Â Remember that a blade is nothing more than a server connected to a set of switches, the only difference is that the first hop link is on the mid-plane of the chassis rather than a cable.Â The diagram above shows that a UCS blade is logically cabled directly to redundant Storage Area Network (SAN) switches for Fibre Channel (FC) and to the FI for Ethernet.Â Out of personal preference I leave the FIs out of the SAN side of the diagram because they operate in N_Port Virtualizer (NPV) mode which means forwarding decisions are handled by the upstream NPiV standard compliant SAN switch.

Starting at the Operating System (OS) we will work up the network stack to the FIs to discuss failover.Â We will be assuming FCoE is being used, if you are not using FCoE ignore the FC piece of the discussion as the Ethernet will remain the same.

SAN Multi-Pathing:

SAN multi-pathing is the way we obtain redundancy in FC, FCoE, and iSCSI networks.Â It is used to provide the OS with two separate paths to the same logical disk.Â This allows the server to access the data in the event of a failure and in some cases load-balance traffic across two paths to the same disk.Â Multi-pathing comes in two general flavors: active/active, or active passive.Â Active/active load balances and has the potential to use the full bandwidth of all available paths.Â Active/Passive uses one link as a primary and reserves the others for failover.Â Typically the deciding factor is cost vs. performance.

Multi-pathing is handled by software residing in the OS usually provided by the storage vendor.Â The software will monitor the entire path to the disk ensuring data can be written and/or read from the disk via that path.Â Any failure in the path will cause a multi-pathing failover.

Multi-Pathing Failure Detection

Any of the failures designated by the Xâ€™s in the diagram above will trigger failover, this also includes failure of the storage controller itself which are typically redundant in an enterprise class array.Â SAN multi-pathing is an end-to-end failure detection system.Â This is much easier to implement in SAN as there is one constant target as opposed to a LAN where data may be sent to several different targets across the LAN and WAN.Â Within UCS SAN multi-pathing does not change from the system used for standalone servers.Â Each blade is redundantly connected and any path failure will trigger a failover.

NIC-Teaming:

NIC teaming is handled in one of three general ways: active/active load-balancing, active/passive failover, or active/active transmit with active/passive receive.Â The teaming type you use is dependant on the network configuration.

Supported teaming Configurations

In the diagram above we see two network configurations, one with a server dual connected to two switches, and a second with a server dual connected to a single switch using a bonded link.Â Bonded links act as a single logical link with the redundancy of the physical links within.Â Active/Active load-balancing is only supported using a bonded link due to MAC address forwarding decisions of the upstream switch.Â In order to load balance an active/active team will share a logical MAC address, this will cause instability upstream and lost packets if the upstream switches donâ€™t see both links as a single logical link.Â This bonding is typically done using the Link Aggregation Control Protocol (LACP) standard.

If you glance back up at the UCS logical connectivity diagram youâ€™ll see that UCS blades are connected in the method on the left of the teaming diagram.Â This means that our options for NIC teaming are Active/Passive failover and Active Active transmit only.Â This is assuming a bare metal OS such as Windows or Linux installed directly on the hardware, when using virtualized environments such as VMware all links can be actively used for transmit and receive because there is another layer of switching occurring in the hypervisor.Â

I typically get feedback that the lack of active/active NIC teaming on UCS bare metal blades is a limitation.Â In reality this is not the case.Â Remember Active/Active NIC teaming was traditionally used on 1GE networks to provide greater than 1GE of bandwidth.Â This was limited to a max of 8 aggregated links for a total of 8GE of bandwidth.Â A single UCS link at 10GE provides 20% more bandwidth than an 8 port active/active team.

NIC teaming like SAN multi-pathing relies on software in the OS, but unlike SAN multi-pathing it typically only detects link failures and in some cases loss of a gateway.Â Due to the nature of the UCS system NIC teaming in UCS will detect failures of the mid-plane path, the IOM, the utilized link from the IOM to the Fabric Interconnect or the FI itself.Â This is because the IOM is a linecard of the FI and the blade is logically connected directly to the FI.

UCS Hardware Failover:

UCS has a unique feature on several of the available mezzanine cards to provide hardware failure detection and failover on the card itself.Â Basically some of the mezzanine cards have a mini-switch built in with the ability to fail path A to path B or vice versa.Â This provides additional failure functionality and improved bandwidth/failure management.Â This feature is available on Generation I Converged network Adapters (CNA) and the Virtual Interface Card (VIC) and is currently only available in UCS blades.

UCS Hardware Failover

UCS Hardware failover will provide greater failure visibility than traditional NIC teaming due to advanced intelligence built into the FI as well as the overall architecture of the system.Â In the diagram above HW failover detects: mid-plane path, IOM and IOM uplink failures as link failures due to the architecture.Â Additionally if the FI loses itâ€™s upstream network connectivity to the LAN it will signal a failure to the mezzanine card triggering failure.Â In the diagram above any failure at a point designated by an X will trigger the mezzanine card to divert Ethernet traffic to the B path.Â UCS hardware failover applies only to Ethernet traffic as SAN networks are built as redundant independent networks and would not support this failover method.

Using UCS hardware failover provides two key advantages over other architectures:

Allows redundancy for NIC ports in separate subnets/VLANs which NIC teaming cannot do.
Provides the ability for network teams to define the failure capabilities and primary path for servers alleviating misconfigurations caused by improper NIC teaming settings.

IOM Link Failure:

The next piece of UCS server failover involves the I/O modules themselves.Â Each I/O module has a maximum of four 10GE uplinks providing 8x10GE mid-plane connections to the blades at an oversubscription of 1:1 to 8:1 depending on configuration.Â As stated above UCS uses a static non-configurable pinning mechanism to assign a mid-plane port to a specific uplink from the IOM to the FI.Â Using this pinning system allows the IOM to operate as an extension of the FI without the need for Spanning Tree Protocol (STP) within the UCS system.Â Additionally this system provides a very clear network design for designing oversubscription in both nominal and failure situations.

For the discussion of IOM failover we will use an example of a max configuration of 8 half-width blades and 4 uplinks on each redundant IOM.

Fully Configured 8 Blade UCS Chassis

In this diagram each blade is currently redundantly connected via 2x10GE links.Â One link through each IOM to each FI.Â Both IOMs and FIs operate in an active/active fashion from a switching perspective so each blade in this scenario has a potential bandwidth of 20GE depending on the operating system configuration.Â The overall blade chassis is configured with 2:1 oversubscription in this diagram as each IOM is using its max of 4x10GE uplinks while providing its max of 8x10GE mid-plane links for the 8 blades.Â If each blade were to attempt to push a sustained 20GE of throughput at the same time (very unlikely scenario) it would receive only 10GE because of this oversubscription.Â The bandwidth can be finely tuned to ensure proper performance in congestion scenarios such as this one using Quality of Service (QoS) and Enhanced Transmission Selection (ETS) within the UCS system.

In the event that a link fails between the IOM and the FI the servers pinned to that link will no longer have a path to that FI.Â The blade will still have a path to the redundant FI and will rely on SAN multi-pathing, NIC teaming and or UCS hardware failover to detect the failure and divert traffic to the active link.

For example if link one on IOM A fails blades one and five would lose connectivity through Fabric A and any traffic using that path would fail to link one on Fabric B ensuring the blade was still able to send and receive data.Â When link one on IOM A was repaired or replaced data traffic would immediately be able to start using the A path again.

IOM A will not automatically divert traffic from Blade one and five to an operational link, nor is this possible through a manual process.Â The reason for this is that diverting blade one and fives traffic to available links would further oversubscribe those links and degrade servers that should be unaffected by the failure of link one.Â In a real world data center a failed link will be quickly replaced and the only servers that will have been affected are blade one and five.Â

In the event that the link cannot be repaired quickly there is a manual process called re-acknowledgement which an administrator can perform.Â This process will adjust the pinning of IOM to FI links based on the number of active links using the same static pinning referenced above.Â In the above example servers would be re-pinned based on two active ports because three port configurations are not supported.Â

Overall this failure method and static pinning mechanism provides very predictable bandwidth management as well as limiting the scope of impact for link failures.

Summary:

The UCS system architecture is uniquely designed to minimize management points and maximize link utilization by removing dependence on STP internally.Â Because of its unique design network failure scenarios must be clearly understood in order to maximize the benefits provided by UCS.Â The advanced failure management tools within UCS will provide for increased application uptime and application throughput in failure scenarios if properly designed.

Building a Private Cloud

Private clouds are currently one of the most popular concepts in cloud computing. They promise the flexibility of cloud infrastructures without sacrificing the control of owning and maintaining your own data center. For a definition of cloud architectures see my previous blog on Cloud Types (http://www.definethecloud.net/?p=109.)

Private clouds are an architecture that is owned by an individual company typically for internal use. in order to be considered a true cloud architecture it must layer automation and orchestration over robust scalable architectures. The intent of private clouds is the ability to have an infrastructure that reacts fluidly to business changes by scaling up and scaling down as applications and requirements change. Typically consolidation and virtualization are the foundation of these architectures and advanced management, monitoring and automation systems are layered on top. In some cases this can be taken a step further by loading cloud management software suites on the underlying infrastructure to provide an internal self service Software as a Service (SaaS) or Platform as a Service (PaaS) environment.

Private cloud architectures provide the additional benefit of being an excellent way to test the ability for a company to migrate to public cloud architecture. Additionally if designed correctly private clouds also act as a migration step to public clouds by migrating applications onto cloud based platforms without exporting them to a cloud service host. Private clouds can also be used in conjunction with public clouds in order to leverage public cloud resources for extra capacity, failover, or disaster recovery purposes. This use is known as a hybrid cloud.

Private cloud architectures can be done in a roll-your-own fashion, selecting best of breed hardware, software, and services to build the appropriate architecture. This can maximize the reuse of existing equipment while providing a custom tailored solution to achieve specific goals. The drawback with roll-your-own solutions is that it requires extensive in-house knowledge in order to architect the solution properly.

A more common practice for migration into private clouds is to use packaged solutions offered by the major IT vendors, companies like IBM, Sun, Cisco, and HP have announced cloud or cloud-like architecture solutions and initiatives. These provide a more tightly coupled solution, and in some cases a single point of contact and expertise on the complete solution. These types of solutions can expedite your migration to the private cloud.

When selecting the hardware and software for private cloud infrastructures ensure you do your homework. Work with a trusted integrator or reseller with expertise in the area, gather multiple vendor proposals and read the fine print. These solutions are not all created equal. Some of the offered solutions are no more than vaporware and a good number are just repackaging of old junk in a shiny new part number. Some will support a staged migration and others will require rip-and-replace or at least a new build out.

There are several key factors I would focus on when selecting a solution:

Compatibility and Support:

Tested compatibility and simplified support are key factors that should be considered when choosing a solution. If you use products from multiple vendors that donâ€™t work together youâ€™ll need to tie the support pieces together in-house and may need to open and maintain several support tickets when things go awry. Additionally if compatibility hasn't been tested or support isnâ€™t in place for a specific configuration you may be up a creek without a paddle when something comes up.

Flexibility vs. Guaranteed Performance:

Some of the available solutions are very strict on hardware types and quantities but in return provide performance guarantees that have been thoroughly tested. This is a trade off that must be considered.

Hardware subcomponents of the solution:

Building a private cloud is a large commitment to both architectural and organizational changes. Real Return On Investment (ROI) wonâ€™t be seen without both. When making that kind of investment you donâ€™t want to end up with a subpar component of your infrastructure (software or hardware) because your vendor tried to bundle their best of breed X and best of breed Y with their so so Z. Getting everything under one corporate logo has its pros and cons.

Hardware Virtualization and Abstraction:

A great statement Iâ€™ve heard about cloud computing was that when defining it if you start talking about hardware youâ€™re already wrong (I donâ€™t remember the source so if you know it please comment.) This is because cloud is more about the process and people than the equipment. When choosing hardware/software for private cloud keep this in mind. You donâ€™t want to end up with a private cloud that canâ€™t flex because your software and process is tied to the architecture or equipment underneath.

Summary:

Private cloud architectures provide a fantastic set of tools to regain control of the data center and turn it back into a competitive advantage rather than a hole to throw money into. Many options and technologies exist to accelerate your journey to private cloud but they must be carefully assessed. If you donâ€™t have the in-house expertise but are serious about cloud there are lots of consultant and integrator options out there to help walk you through the process.

FCoE multi-hop; Do you Care?

There is a lot of discussion in the industry around FCoEâ€™s current capabilities, and specifically around the ability to perform multi-hop transmission of FCoE frames and the standards required to do so. A recent discussion between Brad Hedlund at Cisco and Ken Henault at HP (http://bit.ly/9Kj7zP) prompted me to write this post. Ken proposes that FCoE is not quite ready and Brad argues that it is.

When looking at this discussion remember that Cisco has had FCoE products shipping for about 2 years, and has a robust product line of devices with FCoE support including: UCS, Nexus 5000, Nexus 4000 and Nexus 2000, with more products on the road map for launch this year. No other switching vendor has this level of current commitment to FCoE. For any vendor with a less robust FCoE portfolio it makes no sense to drive FCoE sales and marketing at this point and so you will typically find articles and blogs like the one mentioned above. The one quote from that blog that sticks out in my mind is:

â€œSolutions like HPâ€™s upcoming FlexFabric can take advantage of FCoE to reduce complexity at the network edge, without requiring a major network upgrades or changes to the LAN and SAN before the standards are finalized.â€

If you read between the lines here it would be easy to take this as â€˜FCoE isnâ€™t ready until we are.â€™ This is not unusual and if you take a minute to search through articles about FCoE over the last 2-3 years youâ€™ll find that Cisco has been a big endorser of the protocol throughout (because they actually had a product to sell) and other vendors become less and less anti-FCoE as they announce FCoE products.

Itâ€™s also important to note that Cisco isnâ€™t the only vendor out there embracing FCoE: NetApp has been shipping native FCoE storage controllers for some time, EMC has them road mapped for the very near future, Qlogic is shipping a 2nd generation of Converged Network adapter, and Emulex has fully embraced 10Gig Ethernet as the way forward with their OneConnect adapter (10GE, iSCSI, FCoE all in one card.) Additionally support for FCoE switching of native Fibre Channel storage is widely supported by the storage community.

Fibre Channel over Ethernet (FCoE) is defined in IEEE FC-BB5 and requires the switches it traverses to support the IEEE Data Center Bridging (DCB)standards for proper traffic treatment on the network. For more information on FCoE or DCB see my previous posts on the subjects (FCoE: http://www.definethecloud.net/?p=80, DCB: http://www.definethecloud.net/?p=31.)

DCB Has four major components, and the one in question in the above article is Quantized Congestion Notification (QCN) which the article states is required for multi-hop FCoE. QCN is basically a regurgitation of FECN and BECN from frame relay. It allows a switch to monitor itâ€™s buffers and push congestion to the edge rather than clog the core. In the comments Brad correctly states that QCN is not required for FCoE, the reason for this is that Fibre Channel operates today without any native version of QCN, therefore when placing it on Ethernet you will not need to add functionality that wasnâ€™t there to begin with, remember Ethernet is just a new layer 1-2 for native FC layers 2-4, the FC secret sauce remains unmodified. Remember that not every standard defined by a standards body has to be adhered to by every device, some are required, some are optional. Logical SANs are a great example of an optional standard.

Rather than discuss what is or isnâ€™t required for multi-hop FCoE Iâ€™d like to ask a more important question that we as engineers tend to forget: Do I care? This question is key because it avoids having us argue the technical merits of something we may never actually need, or may not have a need for today.

Do we care?

First letâ€™s look at why we do multi-hop anything: to expand the port-count of our network. Take TCP/IP networks and the internet for example, we require the ability to move packets across the globe through multiple routers (hops.) This is in order to attach devices on all corners of the globe.

Now letâ€™s look at what we do with FC today: typically one or two hop networks (sometimes three) used to connect several hundred devices (occasionally but rarely more.) Itâ€™s actually quite common to find FC implementations with less than 100 attached ports. This means that if you can hit the right port count without multiple hops you can remove complexity and decrease latency, in Storage Area Networks (SAN) we call this the collapsed core design.

The second thing to consider is a hypothetical question: If FCoE were permanently destined for single hop access/edge only deployments (it isnâ€™t) should that actually stop you from using it? The answer here is an emphatic no, I would still highly recommend FCoE as an access/edge architecture even if it were destined to connect back to an FC SAN and Ethernet LAN for all eternity. Letâ€™s jump to some diagrams to explain. In the following diagrams Iâ€™m going to focus on Cisco architecture because as stated above they are currently the only vendor with a full FCoE product portfolio.

In the above diagram you can see a fairly dynamic set of FCoE connectivity options. Nexus 5000 can be directly connected to servers, or to Nexus 4000 in IBM BladeCenter to pass FCoE. It can also be connected to 10GE Nexus 2000s to increase its port density.

To use the nexus 5000 + 2000 as an example itâ€™s possible to create a single-hop (2000 isnâ€™t an L2 hop it is an extension of the 5000) FCoE architecture of up to 384 ports with one point of switching management per fabric. If you take server virtualization into the picture and assume 384 servers with a very modest V2P ratio of 10 virtual machines to 1 physical machine that brings you to 3840 servers connected to a single hop SAN. That is major scalability with minimal management all without the need for multi-hop. The diagram above doesnâ€™t include the Cisco UCS product portfolio which architecturally supports up to 320 FCoE connected servers/blades.

The next thing Iâ€™ve asked you to think about is whether or not you should implement FCoE in a hypothetical world where FCoE stays an access/edge architecture forever. The answer would be yes. In the following diagrams I outline the benefits of FCoE as an edge only architecture.

The first benefit is reducing the networks that are purchased, managed, power, and cooled from 3 to 1 (2 FC and 1 Eth to 1 FCoE.) Even just at the access layer this is a large reduction in overhead and reduces the refresh points as I/O demands increase.

The second benefit is the overall infrastructure reduction at the access layer. Taking a typical VMware server as an example we reduce 6x 1GE ports, 2x 4GFC ports and the 8 cables required for them to 2x 10GE ports carrying FCoE. This increases total bandwidth available while greatly reducing infrastructure. Donâ€™t forget the 4 top-of-rack switches (2x FC, 2x GE) reduced to 2 FCoE switches.

Since FCoE is fully compatible with both FC and pre-DCB Ethernet this requires 0 rip-and-replace of current infrastructure. FCoE is instead used to build out new application environments or expand existing environments while minimizing infrastructure and complexity.

What if I need a larger FCoE environment?

If you require a larger environment than is currently supported extending your SAN is quite possible without multi-hop FCoE. FCoE can be extended using existing FC infrastructure. Remember customers that require an FCoE infrastructure this large already have an FC infrastructure to work with.

What if I need to extend my SAN between data centers?

FCoE SAN extension is handled in the exact same way as FC SAN extension, CWDM, DWDM, Dark Fiber, or FCIP. Remember weâ€™re still moving Fibre Channel frames.

Summary:

FCoE multi-hop is not an argument that needs to be had for most current environments. FCoE is a supplemental technology to current Fibre Channel implementations. Multi-hop FCoE will be available by the end of CY2010 allowing 2+ tier FCoE networks with multiple switches in the path, but there is no need to wait for them to begin deploying FCoE. The benefits of an FCoE deployment at the access layer only are significant, and many environments will be able to scale to full FCoE roll-outs without ever going mutli-hop.

The Cloud Storage Argument

The argument over the right type of storage for data center applications is an ongoing battle. This argument gets amplified when discussing cloud architectures both private and public. Part of the reason for this disparity in thinking is that there is no â€˜one size fits all solution.â€™ The other part of the problem is that there may not be a current right solution at all.

When we discuss modern enterprise data center storage options there are typically five major choices:

Fibre Channel (FC)
Fibre Channel over Ethernet (FCoE)
Internet Small Computer System Interface (iSCSI)
Network File System (NFS)
Direct Attached Storage (DAS)

In a Windows server environment these will typically be coupled with Common internet File Service (CIFS) for file sharing. Behind these protocols there are a series of storage arrays and disk types that be used to meet the applications I/O requirements.

As people move from traditional server architectures to virtualized servers, and from static physical silos to cloud based architectures they will typically move away from DAS into one of the other protocols listed above to gain the advantages, features and savings associated with shared storage. For the purpose of this discussion we will focus on these four: FC, FCoE, iSCSI, NFS.

The issue then becomes which storage protocol to use for transport of your data from the server to the disk? Iâ€™ve discussed the protocol differences in a previous post (http://www.definethecloud.net/?p=43) so I wonâ€™t go into the details here. Depending on who youâ€™re talking to itâ€™s not uncommon to find extremely passionate opinions. There a quite a few consultants and engineers that are hard coded to one protocol or another. That being said most end-users just want something that works, performs adequately and isnâ€™t a headache to manage.

Most environments currently work on a combination of these protocols, plenty of FC data centers rely on DAS to boot the operating system and NFS/CIFS for file sharing. The same can be said for iSCSI. With current options a combination of these protocols is probably always going to be best, iSCSI, FCoE, and NFS/CIFS can be used side by side to provide the right performance at the right price on an application by application basis.

The one definite fact in all of the opinions is that running separate parallel networks as we do today with FC and Ethernet is not the way to move forward, it adds cost, complexity, management, power, cooling and infrastructure that isnâ€™t needed. Combining protocols down to one wire is key to the flexibility and cost savings promised by end-to-end virtualization and cloud architectures. If thatâ€™s the case which wire do we choose, and which protocol rides directly on top to transport the rest?

10 Gigabit Ethernet is currently the industries push for a single wire and with good reason:

Itâ€™s currently got enough bandwidth/throughput to do it (10gigabits using 64b/66b encoding as opposed to FC/Infiniband which currently use 8b/10b with 20% overhead)
Itâ€™s scaling fast 40GE and 100GE are well on their way to standardization (As opposed to 16G and 32G FC)
Everyone already knows and uses it, yes that includes you.

For the sake of argument letâ€™s assume we all agree on 10GE as the right wire/protocol to carry all of our traffic, what do we layer on top? FCoE, iSCSI, NFS, something else? Well that is a tough question. the first part of the answer is you donâ€™t have to decide, this is very important because none of these protocols is mutually exclusive. The second part of the answer is, maybe none of these is the end-all-be-all long-term solution. Each current protocol has benefits and draw backs so letâ€™s take a quick look:

iSCSI: Block level protocol carrying SCSI over IP. Works with standard Ethernet but can have performance issues on congested networks, also incurs IP protocol overhead. iSCSI is great on standard Ethernet networks until congestion occurs, once the network becomes fully utilized iSCSI performance will tend to drop.
FCoE: Block level protocol which maintains Fibre Channel reliability and security while using underlying Ethernet. Requires 10GE or above and DCB (http://www.definethecloud.net/?p=31) capable switches. FCoE is currently well proven and reliable at the access layer and a fantastic option there, but no current solutions exist to move it up further into the network. Products are on the road map to push FCoE further into the network but that may not necessarily be the best way forward.
NFS: File level protocol which runs on top of UDP or TCP and IP.

And a quick look at comparative performance:

Protocol Performance

While the above performance model is subjective and network tuning and specific equipment will play a big role the general idea holds sound.

One of the biggest factors that needs to be considered when choosing these protocols is block vs. file. Some applications require direct block access to disk, many databases fall into this category. As importantly if you want to boot an operating system from disk block level protocol (iSCSI, FCoE) are required. This means that for most diskless configurations youâ€™ll need to make a choice between FCoE and iSCSI (still within the assumption of consolidating on 10GE.) Diskless configurations have major benefits in large scale deployments including power, cooling, administration, and flexibility so you should at least be considering them.

If you chosen a diskless configuration and settled on iSCSI or FCoE for your boot disks now you still need to figure out what to do about file shares? CIFS or NFS are your next decision, CIFS is typically the choice for Windows, and NFS for Linux/UNIX environments. Now youâ€™ve wound up with 2-3 protocols running to get your storage settled and your stacking those alongside the rest of your typical LAN data.

Now to look at management step back and take a look at block data as a whole. If youâ€™re using enterprise class storage youâ€™ve got several steps of management to configure the disk in that array. It varies with vendor but typically something to the effect of:

Configure the RAID for groups of disks
Pool multiple RAID groups
Logically sub divide the pool
Assign the logical disks to the initiators/servers
Configure required network security (FC zoning/ IP security/ACL, etc)

While this is easy stuff for storage and SAN administrators itâ€™s time consuming, especially when you start talking about cloud infrastructures with lots and lots of moves adds and changes. It becomes way to cumbersome to scale into petabytes with hundreds or thousands of customers. NFS has more streamlined management but it canâ€™t be used to boot an OS. This makes for extremely tough decisions when looking to scale into large virtualized data center architectures or cloud infrastructure.

There is a current option that allows you to consolidate on 10GE, reduce storage protocols and still get diskless servers. I
tâ€™s definitely not the solution for every use case (there isnâ€™t one), and itâ€™s only a great option because there arenâ€™t a whole lot of other great options.

In a fully virtualized environment NFS is a great low management overhead protocol for Virtual Machine disks. Because it canâ€™t boot we need another way to get the operating system to server memory. Thatâ€™s where PXE Boot comes in. Pre eXecutionEnvironment (PXE) is a network OS boot that works well for small operating systems, typically terminal clients or Linux images. It allows for a single instance of the operating system to be stored on a PXE server attached to the network, and a diskless server to retrieve that OS at boot time. Because some virtualization operating systems (Hypervisors) are light weight, they are great candidates for PXE boot. This allows the architecture below.

PXE/NFS 100% Virtualized Environment

Summary:

While there are several options for data center storage none of them solves every need. Current options increase in complexity and management as the scale of the implementation increases. Looking to the future we need to be looking for better ways to handle storage. Maybe block based storage has run itâ€™s course, maybe SCSI has run itâ€™s course, either way we need more scalable storage solutions available to the enterprise in order to meet the growing needs of the data center and maintain manageability and flexibility. New deployments should take all current options into account and never write off the advantages of using more than one, or all of them where they fit.

HP Flex-10, Cisco VIC, and Nexus 1000v

When discussing the underlying technologies for cloud computing topologies virtualization is typically a key building block.Â Virtualization can be applied to any portion of the data center architecture from load-balancers to routers, and from servers to storage.Â Server virtualization is one of the most widely adopted virtualization technologies, and provides a great deal of benefits to the server architecture.Â

One of the most common challenges with server virtualization is the networking.Â Virtualized servers typically consist of networks of virtual machines that are configured by the server team with little to no management/monitoring possible from the network/security teams.Â This causes inconsistent policy enforcement between physical and virtual servers as well as limited network functionality for virtual machines.

Virtual Networks

The separate network management models for virtual servers and physical servers presents challenges to: policy enforcement, compliance, and security, as well as adds complexity to the configuration and architecture of virtual server environments.Â Due to this fact many vendors are designing products and solutions to help draw these networks closer together.

The following is a discussion of three products that can be used for this, HPâ€™s Flex-10 adapters, Ciscoâ€™s Nexus 1000v and Ciscoâ€™s Virtual interface Card (VIC.)Â

This is not a pro/con or discussion of which is better, just an overview of the technology and how it relates to VMware.

HP Flex-10 for Virtual Connect:

Using HPâ€™s Virtual Connect switching modules for C-Class blades and either Flex-10 adapters or Lan-On-Motherboard (LOM) administrators can â€˜partition the bandwidth of a single 10GbÂ pipeline into multiple â€œFlexNICs.â€ In addition, customers can regulate the bandwidth for each partition by setting it to a user-defined portion of the total 10Gb connection. Speed can be set from 100 Megabits per second to 10 Gigabits per second in 100 Megabit increments.â€™ (http://bit.ly/boRsiY)

This allows a single 10GE uplink to be presented to any operating system as 4 physical Network Interface Cards (NIC.)

FlexConnect

In order to perform this interface virtualization FlexConnectÂ uses internal VLANÂ mappings for traffic segregation within the 10GEÂ Flex-10 port (mid-plane blade chassis connection from the Virtual Connect Flex-10 10GbEÂ interconnect module and the Flex-10 NIC device.)Â Each FlexNICÂ can present one or more VLANs to the installed operating system.

Some of the advantages with this architecture are:

A single 10GE link can be divided into 4 separate logical links each with a defined portion of the bandwidth.
More interfaces can be presented from fewer physical adapters which is extremely advantageous within the limited space available on blade servers.

When the installed operating system is VMware this allows for 2x10GEÂ links to be presented to VMware as 8x separate NICsÂ and used for different purposes such as vMotion, Fault Tolerance (FT), Service Console, VM kernel and data.

The requirements for Flex-10 as described here are:

HP C-Class blade chassis
VC Flex-10 10GE interconnect module (HP blade switches)
Flex-10 LOM and or Mezzanine cards

Cisco Nexus 1000v:

â€˜Cisco Nexus^Â®Â 1000V Series Switches are virtual machine access switches that are an intelligent software switch implementation for VMware vSphere environments running the Cisco^Â®Â NX-OS operating system. Operating inside the VMware ESXÂ hypervisor, the Cisco Nexus 1000V Series supports Cisco VN-Link server virtualization technology to provide:

â€¢ Policy-based virtual machine (VM) connectivity

â€¢ Mobile VM security and network policy, and

â€¢ Non-disruptive operational model for your server virtualization, and networking teamsâ€™(http://bit.ly/b4JJX5.)

The Nexus 1000vÂ is a Cisco software switch which is placed in the VMware environment and provides physical type network control/monitoring to VMware virtual networks.Â The Nexus 1000v is comprised ofÂ two components the Virtual Supervisor Module (VSM) and Virtual Ethernet Module (VEM.)Â The Nexus 1000vÂ does not have hardware requirements and can be used with any standards compliant physical switching infrastructure.Â Specifically the upstream switch should support 802.1q trunks and LACP.

Cisco Nexus 1000v

Using the Nexus 1000v Network teams have complete control over the virtual network and manage it using the same tools and policies used on the physical network.

Some advantages of the 1000v are:

Consistent policy enforcement for physical and virtual servers
vMotion aware policies migrate with the VM
Increased, security, visibility and control of virtual networks

The requirements for Cisco Nexus 1000v are:

vSphere 4.0 or higher
Enterprise + VMware license
Per physical host CPU VEM license
Virtual Center Server

Cisco Virtual interface Card (VIC):

The Cisco VIC provides interface virtualization similar to the Flex-10 adapter.Â One 10GEÂ port is able to be presented to an operating system as up to 128 virtual interfaces depending on the infrastructure. â€˜The Cisco UCS M81KR presents up to 128 virtual interfaces to the operating system on a given blade. The virtual interfaces can be dynamically configured by Cisco UCS Manager as either Fibre Channel or Ethernet devicesâ€™ (http://bit.ly/9RT7kk.)

Fibre Channel interfaces are known as vFCÂ and Ethernet interfaces are known as vEth, they can be used in any combination up to the architectural limits.Â Currently the VIC is only available for Cisco UCS blades but will be supported on UCS rack mount servers as well by the end of 2010.Â Interfaces are segregated using an internal tagging mechanism known as VN-Tag which does not use VLANÂ tags and operates independently of VLAN operation.

Virtual Interface Card

Each virtual interface acts as if directly connected to a physical switch port and can be configured in Access or Trunk mode using 802.1q standard trunking. These interfaces can then be used by any operating system or VMware.Â For more information on their use see my post Defining VN-Link (http://bit.ly/ddxGU7.)

VIC Advantages:

Granular configuration of multiple Fibre Channel and Ethernet ports on one 10GE link.
Single point of network configuration handled by a network team rather than a server team.

Requirements:

Cisco UCS B-series blades (until C-Series support is released)
Cisco Fabric interconnect access layer switches/managers.

Summary:

Each of these products has benefits in specific use cases and can reduce overhead and/or administration for server networks.Â When combining one or more of these products you should carefully analyze the benefits of each and identify features that may be sacrificed by combining the two.Â For instance using the Nexus 1000vÂ along with FlexConnect adds a Server administered network management layer in between the physical network and virtual network.

Nexus 1000v with Flex-10

Comments and corrections are always welcome.

Virtualization

While not a new concept virtualization has hit the main stream over the last few years and become a uncontrollable buzz word driven by VMware, and other server virtualization platforms. Virtualization has been around in many forms for much longer than some realizes, things like Logical partitions (LPAR) on IBM Mainframes have been around since the 80's and have been extended to other non-mainframe platforms. Networks have been virtualized by creating VLANs for years. The virtualization term now gets used for all sorts of things in the data center. like it or love the term doesn't look like it's going away anytime soon.

Virtualization in all of its forms is a pillar of Cloud Computing especially in the private/internal cloud architecture. To define it loosely for the purpose of this discussion let's use 'The ability to divide a single hardware device or infrastructure into separate logical components.

Virtualization is key to building cloud based architectures because it allows greater flexibility and utilization of the underlying equipment. Rather than requiring separate physical equipment for each 'Tenant' multiple tenants can be separated logically on a single underlying infrastructure. This concept is also known as 'multi-tenancy.' Depending on the infrastructure being designed a tenant can be an individual application, internal team/department, or external customer. There are three areas to focus on when discussing a migration to cloud computing, servers, network, and storage.

Server Virtualization:

Within the x86 server platform (typically the Windows/Linux environment.) VMware is the current server virtualization leader. Many competitors exist such as Microsoft's HyperV and Zen for Linux, and they are continually gaining market share. The most common server virtualization allows a single physical server to be divided into logical subsets by creating virtual hardware, this virtual hardware can then have an Operating System and application suite installed and will operate as if it were an independent server. Server virtualization comes in two major flavors, Bare metal virtualization and OS based virtualization.

Bare metal virtualization means that a lightweight virtualization capable operating system is installed directly on the server hardware and provides the functionality to create Virtual Servers. OS based virtualization operates as an application or service within an OS such as Microsoft Windows that provides the ability to create virtual servers. While both methods are commonly used Bare Metal virtualization is typically preferred for production use due to the reduced overhead involved.

Server virtualization provides many benefits but the key benefits to cloud environments are: increased server utilization, and operational flexibility. Increased utilization means that less hardware is required to perform the same computing tasks which reduces overall cost. The increased flexibility of virtual environments is key to cloud architectures. When a new application needs to be brought online it can be done without procuring new hardware, and equally as important when an application is decommissioned the physical resources are automatically available for use without server repurposing. Physical servers can be added seamlessly when capacity requirements increase.

Network Virtualization:

Network virtualization comes in many forms. VLANs, LSANs, VSANs allow a single physical LAN or SAN architecture to be carved up into separate networks without dependence on the physical connection. Virtual Routing and Forwarding (VRF) allows separate routing tables to be used on a single piece of hardware to support different routes for different purposes. Additionally technologies exist which allow single network hardware components to be virtualized in a similar fashion to what VMware does on servers. All of these tools can be used together to provide the proper underlying architecture for cloud computing. The benefits of network virtualization are very similar to server virtualization, increased utilization and flexibility.

Storage Virtualization:

Storage virtualization encompasses a broad range of topics and features. The term has been used to define anything from the underlying RAID configuring and partitioning of the disk to things like IBMs SVC, and NetApp's V-Series both used for managing heterogeneous storage. Without getting into what's right and wrong when talking about storage virtualization, let's look at what is required for cloud.

First consolidated storage itself is a big part of cloud infrastructures in most applications. Having the data in one place to manage can simplify the infrastructure, but also increases the feature set especially when virtualizing servers. At a top-level looking at storage for cloud environments there are two major considerations: flexibility and cost. The storage should have the right feature set and protocol options to support the initial design goals, it should also offer the flexibility to adapt as the business requirements change. Several vendors offer great storage platforms for cloud environments depending on the design goals and requirements. Features that are typically useful for the cloud (and sometimes lumped into virtualization) are:

De-Duplication - Maintaining a single copy of duplicate data, reducing overall disk usage.

Thin-provisioning - Optimizes disk usage by allowing disks to be assigned to servers/applications based on predicted growth while consuming only the used space. Allows for applications to grow without pre-consuming disk.

Snapshots - Low disk use point in time record which can be used in operations like point-in-time restores.

Overall virtualization from end-to-end is the foundation of cloud environments, allowing for flexible high utilization infrastructures.

What's a cloud?

So to start things off I thought I'd take a stab at defining the cloud. This is always an interesting subject because so many people have placed very different labels and definitions on the cloud. YouTube is filled with videos of high dollar IT talking heads spitting up non-sensical answers as to what cloud is, or in many cases diverting the question and discussing something they understand. So before we get into what it is let's talk about why it's so hard to define?

Part of the difficulty in defining cloud comes from the fact that the term gets its power from being loosely defined. If you put a strict definition on it, the term becomes useless. For example put yourself in the shoes of an IT vendor account manager (sales rep), if you are an account manager stay in your own shoes for the next excercise and forget I said sales rep.

Now from those shoes imagine yourself in a meeting with a CxO discussing the data center. A question such as 'Have you looked into implementing a cloud strategy and if so what are your goals' can be quite powerful. It's an open-ended question that leaves plenty of room for discussion. Within that discussion there is a large opportunity to identify business challenges, and begin to narrow down solutions which equates to potential product sales to meet the customers requirements.

If cloud had a strict definition such as 'Providing an off-site hosted service at a subscription rate' it would only be applicable to a handful of customers. Any strict definition of cloud based on size, location, infrastructure requirements, etc. reduces the overall usability of the term. This doesn't imply that cloud is just a sales term and should be ignored, but it does make the definition more complicated.

From a sales perspective the value of cloud is the flexibility of the term, which is quite interestingly one of the technical values to the customer (we'll discuss that later.) From a customer perspective the real challenge is defining what it means to you. Service providers such as BT and AT&T will have very different definitions of cloud from Amazon and Google. Amazon and Google will define cloud differently than SalesForce.com, and they'll all have totally different definitions than the average enterprise customer. This is because the definition of cloud is in the eye of the beholder, it's all about how the concept can be effectively applied to the individual business demands.

Part of the reason engineers tend to cringe when they hear the term cloud is that it has no real meaning to an engineer. Engineers work with defined terms that are quantifiable, bandwidth, bus speed, port count, disk space. If you use any of those terms in your definition you've already missed the point. To make matters worse the more cloud gets discussed the more confusing it becomes from an engineering standpoint, we started with just cloud and now have: private cloud, internal cloud, public cloud, secure cloud, hybrid cloud, semi-private cloud. It's akin to LAN, MAN, WAN, CAN, PAN, GAN to describe various 'Area networks' except at least those can move data and cloud is just a term.

Cloud is not an engineering term, cloud is a business term.

Cloud does not solve IT problems, cloud solves business problems.

So to bring a definition to the term let's stay at 10,000 feet and think conceptually, because that's what it's really all about. Think about the last time you saw a PDF, or drew a whiteboard that discussed moving data across the internet. Did you draw the complex series of routers and switches, optical mux's and de-mux's that moved your packet from London to Beijing? Of course not, but what did you draw instead, why? If you're like most of the world you drew a cloud, you drew that cloud because you don't care about the underlying infrastructure or complexity. You know that the web has already been built and it works. You know that if you put a packet in one end, it will eventually come out the other. You don't care how it gets there, only that it does. The term cloud is no more complex than that, it's all about putting together an infrastructure that gets the job done without having to dwell on the underlying components.

The point of moving to a cloud architecture is a rethinking of what the data center really does and how it does it in order to alleviate current data center issues without causing new ones. Start by asking what is the purpose of the data center, and really take some time to think about it. The entire data center, from the input power through the distribution system and UPS, the cooling, the storage, the network, and the servers are all there to do one thing, run applications. The application truly is the center of the data center universe. If you've ever been on a help desk team and got a call from a user saying 'the network is slow' it wasn't because they ran a throughput test and found a bottleneck, it was because their email wouldn't load or it took them an extra 15 seconds to access a database. The data center itself is there to support the applications that run the business.

That tends to be a hard pill to swallow for people who have spent their lives in networks, or storage, or even server hardware because in reality the only Oscar they could ever receive is best supporting actor, applications are the star of the show. No company built a network and then decided they should find an application to use up some of that bandwidth. We've built our infrastructures to support the applications we choose to run our business, and that's where the problems came from.

We've built data center infrastructures one app at a time as our businesses grew, adding on where necessary and removing when/if possible. What we've ended up with is a Frankenstein type mess of siloed architecture and one trick pony infrastructure. We've consolidated, and scaled in or out, we've virtualized all to try to fix this and we've failed. We've failed because we've taken a view of the current problems while wearing blinders that prevented seeing the big picture. The big picture is that businesses require apps, apps come and go, and apps require infrastructure. The solution is building a flexible, available, high performance platform that adapts to our immediate businesses needs. A dynamic infrastructure that ties directly to the business needs without the need for procurement or decommission when an app comes up or hits its end-of-life. That applies to any type of cloud you care to talk about.

In future blogs I'll describe the technical and business drivers behind cloud solutions, the types of clouds, the risks of cloud, and the technology that enables a move to cloud. I'll do my best to remain vendor neutral and keep my opinions out of it as much as possible. I welcome any and all feedback, comments and corrections.

The cloud doesn't have to be a pig