OTV and Vplex: Plumbing for Disaster Avoidance

High availability, disaster recovery, business continuity, etc. are all key concerns of any data center design. They all describe separate components of the big concept: ‘When something does go wrong how do I keep doing business.’

Very public real world disasters have taught us as an industry valuable lessons in what real business continuity requires. The Oklahoma City bombing can be at least partially attributed to the concepts of off-site archives and Disaster Recovery (DR.) Prior to that having only local or off-site tape archives was commonly acceptable, data gets lost I get the tape and restore it. That worked well until we saw what happens when you have all the data and no data center to restore to.

September 11th, 2001 taught us another lesson about distance. There were companies with primary data centers in one tower and the DR data center in the other. While that may seem laughable now it wasn’t unreasonable then. There were latency and locality gains from the setup, and the idea that both world class engineering marvels could come down was far-fetched.

With lessons learned we’re now all experts in the needs of DR, right up until the next unthinkable happens ;-). Sarcasm aside we now have a better set of recommended practices for DR solutions to provide Business Continuity (BC.). It’s commonly acceptable that the minimum distance between sites be 50KM away. 50KM will protect from an explosion, a power outage, and several other events, but it probably won’t protect from a major natural disaster such as earthquake or hurricane. If those are concerns the distance increases, and you may end up with more than two data centers.

There are obviously significant costs involved in running a DR data center. Due to these costs the concept of running a ‘dark’ standby data center has gone away. If we pay for: compute, storage, and network we want to be utilizing it. Running Test/Dev systems or other non-frontline mission critical applications is one option, but ideally both data centers could be used in an active fashion for production workloads with the ability to failover for disaster recovery or avoidance.

While solutions for this exist within the high end Unix platforms and mainframes it has been a tough cookie to crack in the x86/x64 commodity server system market. The reason for this is that we’ve designed our commodity server environments as individual application silos directly tied to the operating system and underlying hardware. This makes it extremely complex to decouple and allow the application itself to live resident in two physical locations, or at least migrate non-disruptively between the two. 

In steps VMware and server virtualization.  With VMware’s ability to decouple the operating system and application from the hardware it resides on.  With the direct hardware tie removed, applications running in operating systems on virtual hardware can be migrated live (without disruption) between physical servers, this is known as vMotion.  This application mobility puts us one step closer to active/active datacenters from a Disaster Avoidance (DA) perspective, but doesn’t come without some catches: bandwidth, latency, Layer 2 adjacency, and shared storage.

The first two challenges can be addressed between data centers using two tools: distance and money.  You can always spend more money to buy more WAN/MAN bandwidth, but you can’t beat physics, so latency is dependent on the speed of light and therefore distance.  Even with those two problems solved there has traditionally been no good way to solve the Layer 2 adjacency problem.  By Layer 2 adjacency I’m talking about same VLAN/Broadcast domain, i.e. MAC based forwarding.  Solutions have existed and still exist to provide this adjacency across MAN and WAN boundaries (EoMPLS and VPLS) but they are typically complex and difficult to manage with scale.  Additionally these protocols tend to be cumbersome due to L2 flooding behaviors.

Up next is Cisco with Overlay Transport VLANs (OTV.)  OTV is a Layer 2 extension technology that utilizes MAC routing to extend Layer 2 boundaries between physically separate data centers.  OTV offers both simplicity and efficiency in an L2 extension technology by pushing this routing to Layer 2 and negating the flooding behavior of unknown unicast.  With OTV in place a VLAN can safely span a MAN or WAN boundary providing Layer 2 adjacency to hosts in separate data centers.  This leaves us with one last problem to solve.

The last step in putting the plumbing together for Long Distance vMotion is shared storage.  In order for the magic of vMotion to work, both the server the Virtual Machine (VM) is currently running on, and the server the VM will be moved to must see the same disk.  Regardless of protocol or disk type both servers need to see the files that comprise a VM.  This can be accomplished in many ways dependent on the storage protocol you’re using, but traditionally what you end up with is one of the following two scenarios:image

In the diagram above we see that both servers can access the same disk, but that the server in DC 2, must access the disk across the MAN or WAN boundary, increasing latency and decreasing performance.  The second option is:


In the next diagram shown above we see storage replication at work.  At first glance it looks like this would solve our problem, as the data would be available in both data centers, however this is not the case.  With existing replication technologies the data is only active or primary in one location, meaning it can only be read from and written to on a single array.  The replicated copy is available only for failover scenarios.  This is depicted by the P in the diagram.  While each controller/array may own active disk as shown, it’s only accessible on a single side at a single time, that is until Vplex.

EMC’s Vplex provides the ability to have active/active read/write copies of the same data in two places at the same time.  This solves our problem of having to cross the MAN/WAN boundary for every disk operation.  Using Vplex the virtual machine data can be accessed locally within each data center.


Putting both pieces together we have the infrastructure necessary to perform a Long Distance vMotion as shown above.


OTV and Vplex provide an excellent and unique infrastructure for enabling long-distance vMotion.  They are the best available ‘plumbing’ for use with VMware for disaster avoidance.  I use the term plumbing because they are just part of the picture, the pipes.  Many other factors come into play such as rerouting incoming traffic, backup, and disaster recovery.  When properly designed and implemented for the correct use cases OTV and Vplex provide a powerful tool for increasing the productivity of Active/Active data center designs.

GD Star Rating

Redundancy in Data Storage: Part 1: RAID Levels

I recently read Joe Onisick’s piece, “Have We Taken Data Redundancy Too Far?”  I think Joe raises a good point, and this is a natural topic to dissect in detail after my previous article about cloud disaster recovery and business continuity.  I, too, am concerned by the variety of data redundancy architectures used in enterprise deployments and the duplication of redundancy on top of redundancy that often results.  In a series of articles beginning here, I will focus on architectural specifics of how data is stored, the performance implications of different storage techniques, and likely consequences to data availability and risk of data loss.

The first technology that comes to mind for most people when thinking of data redundancy is RAID, which stands for Redundant Array of Independent Drives.  There are a number of different RAID technologies, but here I will discuss just a few.  The first is mirroring, or RAID-1, which is generally employed with pairs of drives.  Each drive in a RAID-1 set contains the exact same information.  Mirroring generally provides double the random access read performance of a single disk, while providing approximately the same sequential read performance and write performance.  The resulting disk capacity is the capacity of a single drive.  In other words, half the disk capacity is sacrificed.

RAID-1, or Mirroring

RAID-1, or Mirroring;
Courtesy Colin M. L. Burnett

A useful figure of merit for data redundancy architectures is MTTDL, or Mean Time To Data Loss, which can be calculated for a given storage technology using the underlying MTBF, Mean Time Between Failures, and MTTR, Mean Time To Repair/Restore redundancy.  All “mean time” metrics really specify an average rate over an operating lifetime; in other words, if the MTTDL of an architecture is 20 years, there is a 1/20 = approximately 5% chance in any given year of suffering data loss.  Similarly, MTBF specifies the rate of underlying failures.   MTTDL includes only failures in the storage architecture itself, and not the risk of a user or application corrupting data.

For a two-drive mirror set, the classical calculation is:

This is a common reason to have hot-spares in drive arrays; allowing an automatic rebuild significantly reduces MTTR, which would appear to also significantly increase MTTDL.  However…

While hard drive manufacturers claim very large MTBFs, studies such as this one have consistently found numbers closer to 100,000 hours.  If recovery/rebuilding the array takes 12 hours, the MTTDL would be very large, implying an annual risk of data loss of less than 1 in 95,000.  Things don’t work this well in the real world, for two primary reasons:

  • The optimistic assumption that the risk of drive failure for two drives in an array is uncorrelated.  Because disks in an array were likely sourced at the same time and have experienced similar loading, vibration, and temperature over their working life, they are more likely to fail at the same time.  Also, some failure modes have a risk of simultaneously eliminating both disks, such as a facility fire or a hardware failure in the enclosure or disk controller operating the disks.
  • It is also assumed that the repair will successfully restore redundancy if a further drive failure doesn’t occur.  Unfortunately, a mistake may happen if personnel are involved in the rebuild.  Also, the still-functioning drive is under heavy load during recovery and may experience an increased risk of failure.  But perhaps the most important factor is that as capacities have increased, the Unrecoverable Read Error rate, or URE, has become significant.  Even without a failure of the drive mechanism, drives will permanently lose blocks of data at this specified (very low) rate, which generally varies between 1 error per 1014 bits read for low-end SATA drives to 1 per 1016 for enterprise drives.  Assuming that the drives in the mirror are 2 TB low-end SATA drives, and there is no risk of a rebuild failure other than by unrecoverable read errors, the rebuild failure rate is 17%.
RAID 1+0: Mirroring and Striping

RAID 1+0: Mirroring and Striping;
Courtesy MovGP

With the latter in mind, the MTTDL becomes:

When the rebuild failure rate is very large compared to 1/MTBF:

In this case, MTTDL is approximately 587,000 hours, or a 1 in 67 risk of losing data per year.

RAID-1 can be extended to many drives with RAID-1+0, where data is striped across many mirrors.  In this case, capacity and often performance scales linearly with the number of stripes.  Unfortunately, so does failure rate.  When one moves to RAID-1+0, the MTTDL can be determined by dividing the above by the number of stripes.  A ten drive (five stripes of two-disk mirrors) RAID-1+0 set of the above drives would have a 15% chance of losing data in a year (again without considering correlation in failures.)  This is worse than the failure rate of a single drive.


RAID-5 and RAID-6;
Courtesy Colin M. L. Burnett

Because of the amount of storage required for redundancy in RAID-1, it is typically only used for small arrays or applications where data availability and performance are critical.  RAID levels using parity are widely used to trade-off some performance for additional storage capacity.

RAID-5 stripes blocks across a number of disks in the array (minimum 3, but generally 4 or more), storing parity blocks that allow one drive to be lost without losing data.  RAID-6 works similarly (with more complicated parity math and more storage dedicated to redundancy) but allows up to two drives to be lost.  Generally, when a drive fails in a RAID-5 or RAID-6 environment, the entire array must be reread to restore redundancy (during this time, application performance usually suffers.)

While SAN vendors have attempted to improve performance for parity RAID environments, significant penalties remain.  Sequential writes can be very fast, but random writes generally entail reading neighboring information to recalculate parity.  This burden can be partially eased by remapping the storage/parity locations of data using indirection.

For RAID-5, the MTTDL is as follows:

Again, when the RFR is large compared to 1/MTBF, the rate of double complete drive failure can be ignored:

However, here RFR is much larger as it is calculated over the entire capacity of the array.  For example, achieving an equivalent capacity to the above ten-drive RAID-1+0 set would require 6 drives with RAID-5.  The RFR here would be over 80%, yielding little benefit from redundancy, and the array would have a 63% chance of failing in a year.

Properly calculating the RAID-6 MTTDL requires either Markov chains or very long series expansions, and there is significant difference in rebuild logic between vendors.  However, it can be estimated, when RFR is relatively large, and an unrecoverable read error causes the array to entirely abandon using that disk for rebuild, as:

Evaluating an equivalent, 7-drive RAID-6 array yields an MTTDL of approximately 100,000 hours, or a 1 in 11 chance of array loss per year.

The key things I note about RAID are:

  • The odds of data loss are improved, but not wonderful, even under favorable assumptions.
  • Achieving high MTTDL with RAID requires the use of enterprise drives (which have a lower unrecoverable error rate).
  • RAID only protects against independent failures.  Additional redundancy is needed to protect against correlated failures (a natural disaster, a cabinet or backplane failure, or significant covariance in disk failure rates.)
  • RAID only provides protection of the data written to the disk.  If the application, users, or administrators corrupt data, RAID mechanisms will happily preserve that corrupted data.  Therefore, additional redundancy mechanisms are required to protect against these scenarios.

Because of these factors, additional redundancy is required in conventional application deployments, which I will cover in subsequent articles in this series.

Images in this article created by MovGP (RAID-1+0, public domain) and Colin M. L. Burnett (all others, CC-SA) from Wikipedia.

This series is continued in Redundancy in Data Storage: Part 2: Geographical Replication.

About the Author

Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company’s strategic technical direction.  He is a recognized leader in developing new technologies and has extensive experience in datacenter operations and distributed systems.

GD Star Rating

Disaster Recovery and the Cloud

It goes without saying that modern business relies on information technology.  As a result, it is essential that operations personnel consider the business impact of outages and plan accordingly.  As an illustration, Virgin Blue recently experienced a twenty-hour outage in its reservation system that resulted in losses of up to $20 million dollars.  The cloud provides both considerable opportunities and significant challenges relating to disaster recovery.

In general, organizations must currently build multiple levels of redundancy into their systems to reach high-availability targets and to protect themselves from catastrophic outages during a natural or man-made disaster.  A disaster recovery strategy requires that data and critical application infrastructure be duplicated at a separate location, away from the primary datacenter.  Cutting over to a disaster recovery site is usually not instantaneous and redundancy is often lost during the contingency operating plan.  For this reason, site-local redundancy mechanisms – such as high availability network systems, failover for portions of the application stack, and SAN-level redundancy are also required to achieve availability goals.  Public clouds often further complicate disaster recovery planning, as the organization’s critical systems may now be spread across their own infrastructure and a multitude of outside vendors, each with their own data model and recovery practices.

Business requirements and application criticality should guide the approach chosen for business continuity.  Consider the concepts of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). The RPO of a system is the specified amount of data that may be lost in the event of a failure, while the RTO of a system is the amount of time that it will take to bring the system back online after a failure.  In general, site-local mechanisms will provide near-instantaneous RPO and RTO, while disaster recovery systems often will have an RPO of several hours or days of information, and an RTO measured in tens of minutes. Through increasingly sophisticated (and costly) infrastructures, these times can be reduced but not entirely eliminated.

Timeline illustrating concepts of RPO and RTO

Illustration of RTO and RPO in a backup system

Dedicated redundancy infrastructure, both site-local and for disaster recovery purposes, must be regularly tested.  Additionally, it is essential to ensure that the disaster recovery environment is compatible with the existing infrastructure and capable of running the critical application.  This is an area where change management procedures are important, to ensure that critical changes to the production infrastructure are made in the standby environment as well.  Otherwise, the standby environment may not be able to correctly run the application when the disaster recovery plan is activated.

The primary factor that determines RTO and RPO is the approach used to move data to the contingency site.  The easiest and lowest cost approach is tape backup.  In this case, the RPO is the time between successive backups moved off-site (perhaps a week or more) and the RTO is the amount of time necessary to retrieve the backups, restore the backups, and activate the contingency site.  This may be a significant amount of time, especially if personnel are not readily available during the disaster scenario.  Alternatively, a hot contingency site may be maintained, and database log-shipping or volume snapshotting/replication can be used to send business data to the secondary site.  These systems are costly, but readily attain an RTO of under an hour, and an RPO of perhaps one day.  With substantial investment and complexity, RPO can even be reduced to the range of minutes.  However, organizations have often been surprised to find that the infrastructure doesn’t work when it is called upon, often because of the complexity of the infrastructure and the difficulties involved in testing a standby site.

When procuring IaaS (Infrastructure as a Service) or SaaS (Software as a Service), it is essential for the organization to perform due diligence regarding what disaster recovery mechanisms the service vendor uses. The stakes are too high to trust service level agreements alone (in the case of a catastrophic failure during a disaster, will the vendor be solvent and will the compensation received be sufficient to compensate for business losses?).

Disaster Recovery as a Service, or DRaaS, is an emerging category for organizations that wish to control their own infrastructure but not maintain the disaster recovery systems themselves.  With a DRaaS offering, an IT organization does not directly build a contingency site, but instead relies on a vendor to do so on a dedicated or utility computing infrastructure.  The cloud’s advantages in elasticity and cost-reduction are significant benefits in a disaster recovery scenario, and service offerings allow organizations to outsource portions of contingency planning to vendors with expertise in the area.  However, many of the complexities remain and it is essential to perform the due diligence to ensure that the contingency plan will work and provide a sufficient level of service if called upon.

Finally, there are emerging technologies that combine site-local redundancy and disaster recovery into a unified system.  For example, distributed synchronous multi-master databases allow an application to be spread across multiple locations, including cloud availability zones, with the application active and processing transactions in all of them.  A specified portion of the system can be lost without any downtime or recovery effort.  These emerging systems offer the prospect of dramatically reducing costs and minimizing the risk of contingency sites not functioning properly.

About the Author

Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company’s strategic technical direction.  He is a recognized leader in developing new technologies and has extensive experience in datacenter operations and distributed systems.

GD Star Rating