Thoughts From a Global Technology Leadership Forum

I recently had the privilege to attend and participate in a global technology leadership forum.  The forum consisted of technology investors, vendors and thought leaders and was an excellent event.  The tracks I focused on were VDI, Big Data, Data Center Infrastructure, Data Center Networks, Cloud and Collaboration.  The following are my notes from the event:

VDI:

There was a lot of discussion around VDI and a track dedicated to it.  The overall feeling was that VDI has not lived up to its hype over the last few years, and while it continues to grow market share it never reaches the predicted numbers, or hits the bubble that is predicted for it.  For the most part the technical experts agreed on the following:

There was some disagreement on whether VDI is the right next step for the enterprise.  The split I saw was nearly 50/50 with half thinking it is the way forward and will be deployed in greater and greater scale, and the other half thinking it is one of many viable current solutions and may not be the right 3-5 year goal.  I’ve expressed my thoughts previously: http://www.definethecloud.net/vdi-the-next-generation-or-the-final-frontier. Lastly we agreed that the key leaders in this space are still VMware and Citrix.  While each have pros and cons it was believed that both solutions are close enough as to be viable and that VMware’s market share and muscle make it very possible to pull into a dominant lead.  Other players in this space were complete afterthoughts.

Big Data:

Let me start by saying I know nothing about big data.  I sat in these expert sessions to understand more about it, and they were quite interesting.  Big data sets are being built, stored, and analyzed.  Customer data, click traffic, etc. are being housed to gather all types of information and insight.  Hadoop clusters are being used for processing data, cloud storage such as Amazon S3 is being utilized as well as on-premises solutions.  The main questions were in regard to where the data should be stored and where it should be processed, as well as the compliance issues that may arise with both.  Another interesting question was the ability to leave the public cloud if your startup turns big enough to beat the costs of public cloud with a private one.  For example if you have a lot of data you can mail Amazon disks to get it into S3 faster than WAN speed, but to our knowledge they can’t/won’t mail your disk back if you want to leave.

Data Center Infrastructure:

Overall there was an agreement that very few data center infrastructure (defined here as compute, network, storage) conversations occur without chat about cloud.  Cloud is a consideration for IT leaders from the SMB to large global enterprise.  That being said while cloud may frame the discussion the majority of current purchases are still focused on consolidation and virtualization, with some automation sprinkled in.  Private-cloud stacks from the major vendors also come into play helping to accelerate the journey, but many are still not true private clouds (see: http://www.definethecloud.net/the-difference-between-private-cloud-and-converged-infrastructure.)

Data Center Networks:

I moderated a session on flattening the data center networks, this is currently referred to as building ‘fabrics.’  The majority of the large network players have announced or are shipping ‘fabric’ solutions.  These solutions build multiple active paths at Layer 2 alleviating the blocked links traditional Spanning-Tree requires.  This is necessary as we converge our data and ask more of our networks.  The panel agreed that these tools are necessary but that standards are required to push this forward and avoid vendor lock-in.  As an industry we don’t want to downgrade our vendor independence to move to a Fabric concept.  That being said most agree that pre-standard proprietary deployments are acceptable as long as the vendor is committed to the standard and the hardware is intended to be standards compliant.

Cloud:

One of the main discussions conversations I had was in regards to PaaS.  While many agree that PaaS and SaaS are the end goals of public and private clouds, the PaaS market is not yet fully mature (see: http://www.networkcomputing.com/private-cloud/231300278.)  Compatibility, interoperability and lock-in were major concerns overall for PaaS.  Additionally while there are many PaaS leaders, the market is so immature leadership could change at any time, making it hard to pick which horse to back. 

Another big topic was open and open source.  Open Stack, Open Flow and open source players like RedHat.  With RedHat’s impressive YoY growth they are tough to ignore and there is a lot of push for open source solutions as we move to larger and larger cloud systems.  The feeling is that larger and more technically adept IT shops will be looking to these solutions first when building private clouds.

Collaboration:

Yet another subject I’m not an expert on but wanted to learn more about.  The first part of the discussion entailed deciding what we were discussing i.e. ‘What is collaboration.’  With the term collaboration encompassing: voice, video, IM, conferencing, messaging, social media, etc. depending on who you talk to this was needed.  We settled into a focus on enterprise productivity tools, messaging, information repositories, etc.  The overall feeling was that there are more questions than answers in this space.  Great tools exist but there is no clear leaders.  Additionally integration between enterprise tools and public tools was a topic and involved the idea of ensuring compliance.  One of the major discussions was building internal adoption and maintaining momentum.  The concern with a collaboration tool rollout is the initial boom of interest followed by a lull and eventual death of the tool as users get bored with the novelty before finding any ‘stickiness.’

Why FCoE Standards Matter

Mike Fratto at Network Computing recently wrote an article titled ‘FCoE: Standards Don’t Matter; Vendor Choice Does’ (http://www.networkcomputing.com/storage-networking-management/231002706.)

I definitely differ from Mike’s opinion on the subject.  While I’m no fan of the process of making standards (puts sausage making to shame), or the idea of slowing progress to wait on standards, I do feel they are an absolutely necessary part of FCoE’s future.  It’s all about the timing at which we expect them, the way in which they’re written, and most importantly the way in which they’re adhered to.

Mike bases his opinion on Fibre Channel history and accurately describes the strangle hold the storage vendors have had on the customer.  The vendor’s Hardware Compatibility List (HCL) dictates which vendor you could connect to, and which model and which firmware you can use.  Slip off the list and you lose support.  This means that in the FC world customers typically went with the Storage Area Network (SAN) their VAR or storage vendor recommended, and stuck with it.  While not ideal this worked fine in the small network environment of SAN with the specialized and dedicated purpose of delivering block data from array to server.  These extreme restrictions based on storage vendors and protocol compatibility will not fly as we converge networks.

As worried as storage/SAN admins may be about moving their block data onto Ethernet networks, the traditional network admins may be more worried because of the interoperability concept.  For years network admins have been able to intermix disparate vendors technology to build the networks that they desired, best-of-breed or not.  A load-balancer here, firewall there, data center switch here and presto everything works.  They may have had to sacrifice some features (proprietary value add-that isn’t compatible) but they could safely connect the devices.  More importantly they didn’t have to answer to an HCL dictated by some end-point (storage disk) or another on their network.

For converged networking to work, this freedom must remain.  Adding FCoE to consolidate infrastructure cannot lock network admins into storage HCLs and extreme hardware incompatibility.  This means that the standards must exist, be agreed upon, be specific enough, and be adhered to.  While Mike is correct, you probably won’t want to build multi-vendor networks day one, you will want to have the opportunity to incorporate other services, and products, migrate from one vendor to another, etc.  You’ll want an interoperable standard that allows you to buy 3rd party FCoE appliances for things like de-duplication, compression, encryption or whatever you may need down the road.  We’re not talking about building an Ethernet network dedicated to FCoE, we’re talking about building one network to rule them all (hopefully we never have to take it to Mordor and toss it into molten lava.)  To run one network we need the standards and compatibility that provide us flexibility.

There is no reason for storage vendors to hold the keys to what you can deploy any longer.  Hardware is stable, and if standards are in place the network will properly transport the blocks.  Customers and resellers shouldn’t accept lock in and HCL dictation just because that has been the status quo.  We’re moving the technology forward move your thinking forward.  The issue in the past has been the looseness with which IEEE FCBB-5 is written on some aspects since it’s inception.  This leaves room for interpretation which is where interoperability issues arise between vendors who are both ‘standards based.’  The onus is on us as customers, resellers and an IT community to demand that the standards be well defined, and that the vendors adhere to them in an interoperable fashion. 

Do not accept incompatibility and lack of interoperability in your FCoE switching just because we made the mistake of allowing that to happen with pure FC SANs.  Next time your storage vendor wants a few hundred thousand for your next disk array tell them it isn't happening unless you can plug it into any standards compliant network without fear of their HCL and loss of support.

Flexpod Discussion with Vaughn Stewart and Abhinav Joshi

I enjoyed a great conversation with Netapp's Vaughn Stewart and Cisco's Abhinav Joshi about FlexPod last week during Cisco Live 2011. Check out the video below.

VDI, the Next Generation or the Final Frontier?

After sitting through a virtualization sales pitch focused around Virtual Desktop Infrastructures (VDI) this afternoon I had several thoughts on the topic I thought may be blog worthy.

VDI has been a constant buzzword for a few years now, riding the coattails of server virtualization.  For the majority of those years you can search back and find predictions from the likes of Gartner touting ‘This is the year for VDI’ or making other similar statements, typically with projected growth rates that don’t ever happen.  What you won’t see is those same analyst organizations reaching back the year after and answering to why they over hyped it, or were blatantly incorrect. (Great idea for a yearly blog here, analyzing previous years failed predictions.)

The reasons they’ve been incorrect vary over the years starting with technical inadequacy of the infrastructures and lack of real understanding as an industry.  When VDI first hit the forefront many of us (myself included) made the assumption desktops could be virtualized the same as servers (Windows is Windows right?)  What we neglected to account for is the plethora of varying user applications, the difficulty of video and voice, and other factors such as boot storms which are unique and or more amplified within VDI environments than their server counterparts.  From there for a short while the VDI rollout horror stories and memories of failed Proof of Concepts slowed adoption and interest for a short period.

Now we’re at a point where the technology can overcome the challenges and the experts are battle hardened with knowledge of success and failures in various environments; yet still adoption is slow.  Users are bringing new devices into the workplace and expecting them to interface with enterprise services; yet still adoption is slow.  We supposedly have a more demanding influx of younger generation employees who demand remote access from their chosen devices; yet still adoption is slow.  This doesn’t mean that VDI isn’t being adopted, nor that the market share numbers aren’t increasing across the board; it’s just slow.

The reason for this is that our thinking and capabilities for service delivery have surpassed the need for VDI in many environments. VDI wasn’t an end-goal but instead an improvement over individually managed, monitored, and secured local end-user OS environments.  The end-goal isn’t removing the OS tie to the hardware on the end-point (which is what VDI does) but instead removing the applications tie to the OS; or more simply put: removing any local requirements for access to the services.  Starting to sound like cloud?

Cloud is the reason enterprise IT hasn’t been diving into VDI head first, the movement to cloud services has shown that for many we may have passed the point where VDI could show true Return On Investment (ROI) before being obsoleted.  Cloud is about delivering the service to any web connected end-point on-demand regardless of platform (OS.)  If you can push the service to my iOS, Android, Windows, Linux, etc. device without the requirement for a particular OS, then what’s the need for VDI?

To use a real world example I am a Microsoft zealot, I use Windows 7, Bing for search and only IE for browsing on my work and personal computers (call me retro.)  I also own an iPad, mainly due to the novelty and the fact that I got addicted to ‘Flight Control’ on a friends iPad at release of the original.  I occasionally use the iPad for what I’d call ‘productivity work’ related to my primary role or side projects.  Using my iPad I do things like: Access corporate email for the company I work for and my own, review files, access Salesforce, and Salesforce Chatter, and even perform some remote equipment demos, my files are seamlessly synched between my various other computers.  I do all of this without a Windows 7 virtual desktop running on my iPad, it’s all done through apps connected to these services directly.  In fact the only reason I have VDI client applications on my iPad is to demo VDI, not to actually work.

Now an iPad is not a perfect example, I’d never use it for developing content (slides, reports, spreadsheets, etc.) but I do use it for consuming content, email, etc.  To develop I turn to a laptop with full keyboard, screen and some monitor outputs.  This laptop may be a case for VDI but in reality why?  If the services I use are cloud based, public or private, and the data I utilize is as well, then the OS is irrelevant again.  With office applications moving to the cloud (Microsoft Office 365, Google Docs, etc.) along with many others, and many services and applications already there, what is the need for a VDI infrastructure?

VDI like server virtualization is really a band-aid for an outdated application deployment process which uses local applications tied to a local OS and hardware.  Virtualizing the hardware doesn’t change that model but can provide benefits such as:

Once the wound of our current application deployment model has fully healed, the band-aid comes off and we have service delivery from cloud computing environments free of any OS or hardware ties.

So friends don’t let friends virtualize desktops right?

Not necessarily.  As shown above VDI can have significant advantages over standard desktop deployment.  Those advantages can drive business flexibility and reduce costs.  The difficult questions will become

Many organizations will still see benefits from deploying VDI today because the ROI of VDI will occur more quickly than the ability to deliver all business apps as a service.  Additionally VDI is an excellent way to begin getting your feet wet with the concepts of supporting any device with organizational controls and delivering services remotely.  Coupling VDI with things like thin apps will put you one step closer while providing additional flexibility to your IT environment.

When assessing a VDI project you’ll want to take a close look at the time it will take your organization to hit ROI with the deployment and assess that against the time it would take to move to a pure service delivery model (if your organization would be capable of such.)  VDI is a fantastic tool in the data center tool bag, but like all others it’s not the right tool for every job.  VDI is definitely the Next Generation but it is not The Final Frontier.

Additional fun:

Here are some sales statements that are commonly used when pitching VDI, all of these I consider to be total hogwash.  Try out or modify a few of my one line answers next time your vendors there telling you about the wonderful world of VDI and why you need it now.

Vendor: ‘<Insert analyst here (Gartner, etc.)> says that 2011 is the year for VDI.’  Alternatively ‘<Insert analyst here (Gartner, etc.)> predicts VDI to grow X amount this year.’

My answer: ‘That’s quite interesting, let’s adjourn for now and reconvene when you’ve got data on <Insert analyst here (Gartner, etc.)>’s VDI predictions for the previous 5 years.’

Vendor: ‘The next generation of workers coming from college demand to use the devices and services they are used to, to do their job.’

My answer: ‘Excellent, they’ll enjoy working somewhere that allows that, we have corporate policies and rules to protect our data and network.’  This won’t work in every case as Mike Stanley (@mikestanley) pointed out to me, universities for example have student IT consumers who are the paying customers, this would be much more difficult in such cases.

Vendor: ‘People want a Bring Your Own (BYO) device model.’

My Answer: ‘If I bring my own device and the fact that I want to matters, what makes you think I’ll want your desktop?  Just give me application or service.’

Server/Desktop Virtualization–A Best of Breed Band-Aid

Virtualization is a buzzword that has moved beyond into mainstream use and enterprise deployment.  A few years back vendors were ‘virtualization-washing’ their products and services the way many ‘cloud-wash’ the same today.  Now a good majority of enterprises are well into their server virtualization efforts and moving into Virtual Desktop Infrastructures (VDI) and cloud deployments.  This is not by accident, hardware virtualization comes with a myriad of advantages such as: resource optimization, power and cooling savings, flexibility, rapid deployment, etc.  That being said we dove into server/desktop virtualization with the same blinders on we’ve worn as an industry since we broke away from big iron.  We effectively fix point-problems while ignoring big picture, and create new problems in the process:

The underlying issue is the way in which we design our applications.  When we moved to commodity servers we built an application model with a foundation of one application, one operating system (OS), one server.  We’ve maintained that model ever since.  Server/desktop virtualization provides benefits but does not change this model it just virtualizes the underlying server and places more silos on a single piece of hardware to increase utilization.  Our applications and the services they deliver are locked into this model and suffer from it when we look at scale, flexibility and business continuance.

This is not a sustainable model, or at best not the most efficient model for service delivery.  Don’t take my word for it, jump on Bing and do a search for recent VMware acquisitions/partnerships.  The dominant giant in virtualization is acquiring companies or partnering with companies poised to make it the dominant giant in PaaS and SaaS.   Cloud computing as a whole offers the opportunity to rethink service delivery, or possibly more importantly brings the issue of service delivery and IT costing to the front of our minds. 

Moving applications and services to robust, highly available, flexible architectures is the first step in transforming IT to a department that enables the business.  The second step is removing the application OS silo and building services that can scale up and down independent of the underlying OS stack.  When you talk about zero downtime business continuance, massively scalable applications, global accessibility and other issues the current model is an anchor. 

That being said transforming these services is no small task.  Redesigning applications to new architectures can be monumental.  Redesigning organizations/processes and retraining people can be even more difficult.  The technical considerations for designing global highly available services touches on every aspect of application and architecture design: storage, network, web access, processing, etc.  That being said the tools are either available or rapidly emerging.

Any organization looking to make significant IT purchases or changes should be considering all of the options and looking at the big picture as much as possible.  The technology is available to transform the way we do business.  It may not be right for every organization or application but it’s not an all or nothing proposition.  There’s no fault in virtualizing servers and desktops today, but the end goal on the road map should be efficient service delivery optimized to the way you do business.

For more of my thoughts on this see my post on www.networkcomputing.com: http://www.networkcomputing.com/private-cloud/230600012.

The Cloud Rules

Cloud Computing Concepts:

These are Twitter sized quick thoughts. If you’d like more elaboration or have a comment participation is highly encouraged.  As I’ve run out of steam on this I’ve decided to move it into a blog rather than a page.

OTV and Vplex: Plumbing for Disaster Avoidance

High availability, disaster recovery, business continuity, etc. are all key concerns of any data center design. They all describe separate components of the big concept: 'When something does go wrong how do I keep doing business.'

Very public real world disasters have taught us as an industry valuable lessons in what real business continuity requires. The Oklahoma City bombing can be at least partially attributed to the concepts of off-site archives and Disaster Recovery (DR.) Prior to that having only local or off-site tape archives was commonly acceptable, data gets lost I get the tape and restore it. That worked well until we saw what happens when you have all the data and no data center to restore to.

September 11th, 2001 taught us another lesson about distance. There were companies with primary data centers in one tower and the DR data center in the other. While that may seem laughable now it wasn't unreasonable then. There were latency and locality gains from the setup, and the idea that both world class engineering marvels could come down was far-fetched.

With lessons learned we're now all experts in the needs of DR, right up until the next unthinkable happens ;-). Sarcasm aside we now have a better set of recommended practices for DR solutions to provide Business Continuity (BC.). It's commonly acceptable that the minimum distance between sites be 50KM away. 50KM will protect from an explosion, a power outage, and several other events, but it probably won't protect from a major natural disaster such as earthquake or hurricane. If those are concerns the distance increases, and you may end up with more than two data centers.

There are obviously significant costs involved in running a DR data center. Due to these costs the concept of running a 'dark' standby data center has gone away. If we pay for: compute, storage, and network we want to be utilizing it. Running Test/Dev systems or other non-frontline mission critical applications is one option, but ideally both data centers could be used in an active fashion for production workloads with the ability to failover for disaster recovery or avoidance.

While solutions for this exist within the high end Unix platforms and mainframes it has been a tough cookie to crack in the x86/x64 commodity server system market. The reason for this is that we've designed our commodity server environments as individual application silos directly tied to the operating system and underlying hardware. This makes it extremely complex to decouple and allow the application itself to live resident in two physical locations, or at least migrate non-disruptively between the two. 

In steps VMware and server virtualization.  With VMware’s ability to decouple the operating system and application from the hardware it resides on.  With the direct hardware tie removed, applications running in operating systems on virtual hardware can be migrated live (without disruption) between physical servers, this is known as vMotion.  This application mobility puts us one step closer to active/active datacenters from a Disaster Avoidance (DA) perspective, but doesn’t come without some catches: bandwidth, latency, Layer 2 adjacency, and shared storage.

The first two challenges can be addressed between data centers using two tools: distance and money.  You can always spend more money to buy more WAN/MAN bandwidth, but you can’t beat physics, so latency is dependent on the speed of light and therefore distance.  Even with those two problems solved there has traditionally been no good way to solve the Layer 2 adjacency problem.  By Layer 2 adjacency I’m talking about same VLAN/Broadcast domain, i.e. MAC based forwarding.  Solutions have existed and still exist to provide this adjacency across MAN and WAN boundaries (EoMPLS and VPLS) but they are typically complex and difficult to manage with scale.  Additionally these protocols tend to be cumbersome due to L2 flooding behaviors.

Up next is Cisco with Overlay Transport VLANs (OTV.)  OTV is a Layer 2 extension technology that utilizes MAC routing to extend Layer 2 boundaries between physically separate data centers.  OTV offers both simplicity and efficiency in an L2 extension technology by pushing this routing to Layer 2 and negating the flooding behavior of unknown unicast.  With OTV in place a VLAN can safely span a MAN or WAN boundary providing Layer 2 adjacency to hosts in separate data centers.  This leaves us with one last problem to solve.

The last step in putting the plumbing together for Long Distance vMotion is shared storage.  In order for the magic of vMotion to work, both the server the Virtual Machine (VM) is currently running on, and the server the VM will be moved to must see the same disk.  Regardless of protocol or disk type both servers need to see the files that comprise a VM.  This can be accomplished in many ways dependent on the storage protocol you’re using, but traditionally what you end up with is one of the following two scenarios:image

In the diagram above we see that both servers can access the same disk, but that the server in DC 2, must access the disk across the MAN or WAN boundary, increasing latency and decreasing performance.  The second option is:

image

In the next diagram shown above we see storage replication at work.  At first glance it looks like this would solve our problem, as the data would be available in both data centers, however this is not the case.  With existing replication technologies the data is only active or primary in one location, meaning it can only be read from and written to on a single array.  The replicated copy is available only for failover scenarios.  This is depicted by the P in the diagram.  While each controller/array may own active disk as shown, it’s only accessible on a single side at a single time, that is until Vplex.

EMC’s Vplex provides the ability to have active/active read/write copies of the same data in two places at the same time.  This solves our problem of having to cross the MAN/WAN boundary for every disk operation.  Using Vplex the virtual machine data can be accessed locally within each data center.

image

Putting both pieces together we have the infrastructure necessary to perform a Long Distance vMotion as shown above.

Summary:

OTV and Vplex provide an excellent and unique infrastructure for enabling long-distance vMotion.  They are the best available ‘plumbing’ for use with VMware for disaster avoidance.  I use the term plumbing because they are just part of the picture, the pipes.  Many other factors come into play such as rerouting incoming traffic, backup, and disaster recovery.  When properly designed and implemented for the correct use cases OTV and Vplex provide a powerful tool for increasing the productivity of Active/Active data center designs.

World Wide Technology’s Upcoming Geek Day

Coming up very quickly is World Wide Technology’s (www.wwt.com) annual Geek Day, March 10th 2011 (http://www.wwt.com/geekday/.)  I’m very much looking forward to the event for two reasons:

  1. It’s free to customers
  2. It’s totally focused on geeks interacting with geeks.

The event is focused around live interactive demo’s from sponsor technology companies with breakout sessions chosen by the attendees via online voting.  My favorite parts are that the sponsors aren’t allowed to do lead collecting (badge scanning you know from conferences), gimmicky swag giveaways, or stock their booths with gobs of marketing fluff.  It’s true focus is the demo, and engineer to engineer discussion.  See the link above for more information, and the video below for some customer feedback on the events.  I hope to see you here in St. Louis in March!

Cisco unified Computing System (UCS) High-Level Overview

I’ve been looking for tools to supplement Power Point, Whiteboard, etc. and Brian Gracely (@bgracely) suggested I try Prezi (www.prezi.com.) Prezi is a very slick tool for non-slide based presentations.   I don’t think it will replace slides or white board for me, but it’s a great supplement.  It’s got a fairly quick learning curve if you watch the quick tutorials.  Additionally it works quite well for mind-mapping, I just throw all of my thoughts on the canvas and then start tying them together, whereas slides are very linear and take more planning.  My favorite feature of Prezi is the ability to break out of the flow, and quickly return to it at any time during a presentation.  I love this because real world discussions never go the way you mapped them out in advance.  To start learning the tool I created the following high-level overview of the Cisco Unified Computing System (UCS.)  This content is fully/usable and recyclable so do with it what you want!

Disaster Recovery and the Cloud

It goes without saying that modern business relies on information technology.  As a result, it is essential that operations personnel consider the business impact of outages and plan accordingly.  As an illustration, Virgin Blue recently experienced a twenty-hour outage in its reservation system that resulted in losses of up to $20 million dollars.  The cloud provides both considerable opportunities and significant challenges relating to disaster recovery.

In general, organizations must currently build multiple levels of redundancy into their systems to reach high-availability targets and to protect themselves from catastrophic outages during a natural or man-made disaster.  A disaster recovery strategy requires that data and critical application infrastructure be duplicated at a separate location, away from the primary datacenter.  Cutting over to a disaster recovery site is usually not instantaneous and redundancy is often lost during the contingency operating plan.  For this reason, site-local redundancy mechanisms - such as high availability network systems, failover for portions of the application stack, and SAN-level redundancy are also required to achieve availability goals.  Public clouds often further complicate disaster recovery planning, as the organization’s critical systems may now be spread across their own infrastructure and a multitude of outside vendors, each with their own data model and recovery practices.

Business requirements and application criticality should guide the approach chosen for business continuity.  Consider the concepts of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). The RPO of a system is the specified amount of data that may be lost in the event of a failure, while the RTO of a system is the amount of time that it will take to bring the system back online after a failure.  In general, site-local mechanisms will provide near-instantaneous RPO and RTO, while disaster recovery systems often will have an RPO of several hours or days of information, and an RTO measured in tens of minutes. Through increasingly sophisticated (and costly) infrastructures, these times can be reduced but not entirely eliminated.

Timeline illustrating concepts of RPO and RTO

Illustration of RTO and RPO in a backup system

Dedicated redundancy infrastructure, both site-local and for disaster recovery purposes, must be regularly tested.  Additionally, it is essential to ensure that the disaster recovery environment is compatible with the existing infrastructure and capable of running the critical application.  This is an area where change management procedures are important, to ensure that critical changes to the production infrastructure are made in the standby environment as well.  Otherwise, the standby environment may not be able to correctly run the application when the disaster recovery plan is activated.

The primary factor that determines RTO and RPO is the approach used to move data to the contingency site.  The easiest and lowest cost approach is tape backup.  In this case, the RPO is the time between successive backups moved off-site (perhaps a week or more) and the RTO is the amount of time necessary to retrieve the backups, restore the backups, and activate the contingency site.  This may be a significant amount of time, especially if personnel are not readily available during the disaster scenario.  Alternatively, a hot contingency site may be maintained, and database log-shipping or volume snapshotting/replication can be used to send business data to the secondary site.  These systems are costly, but readily attain an RTO of under an hour, and an RPO of perhaps one day.  With substantial investment and complexity, RPO can even be reduced to the range of minutes.  However, organizations have often been surprised to find that the infrastructure doesn't work when it is called upon, often because of the complexity of the infrastructure and the difficulties involved in testing a standby site.

When procuring IaaS (Infrastructure as a Service) or SaaS (Software as a Service), it is essential for the organization to perform due diligence regarding what disaster recovery mechanisms the service vendor uses. The stakes are too high to trust service level agreements alone (in the case of a catastrophic failure during a disaster, will the vendor be solvent and will the compensation received be sufficient to compensate for business losses?).

Disaster Recovery as a Service, or DRaaS, is an emerging category for organizations that wish to control their own infrastructure but not maintain the disaster recovery systems themselves.  With a DRaaS offering, an IT organization does not directly build a contingency site, but instead relies on a vendor to do so on a dedicated or utility computing infrastructure.  The cloud's advantages in elasticity and cost-reduction are significant benefits in a disaster recovery scenario, and service offerings allow organizations to outsource portions of contingency planning to vendors with expertise in the area.  However, many of the complexities remain and it is essential to perform the due diligence to ensure that the contingency plan will work and provide a sufficient level of service if called upon.

Finally, there are emerging technologies that combine site-local redundancy and disaster recovery into a unified system.  For example, distributed synchronous multi-master databases allow an application to be spread across multiple locations, including cloud availability zones, with the application active and processing transactions in all of them.  A specified portion of the system can be lost without any downtime or recovery effort.  These emerging systems offer the prospect of dramatically reducing costs and minimizing the risk of contingency sites not functioning properly.

About the Author

Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company's strategic technical direction.  He is a recognized leader in developing new technologies and has extensive experience in datacenter operations and distributed systems.