This is a followup to my previous article, Redundancy in Data Storage: Part 1: RAID Levels, where I discussed various site-local data redundancy technologies. Here, I will attempt to detail many of the choices available to provide redundancy beyond the data center that organizations use to solve disaster recovery, business continuity, and continuity of operations (COOP).
It’s obvious that site-local redundancy isn’t enough for critical applications. The threat of natural disasters is always looming, regional power outages occur, building electrical and mechanical systems fail, and backhoes seem to hate fiber optic cable. Enterprises therefore attempt to use geographic redundancy to ensure that even when these things happen critical applications and data remain available. At the heart of making an application geographically redundant is making sure the application’s data resides in more than one geographical location. There are a number of technology and architectural choices that can be used to achieve this geographical replication of data. Often these solutions will be evaluated in terms of cost, RTO (recovery time objective), and RPO (recovery point objective), as I outlined in Disaster Recovery and the Cloud.
One obvious place to build redundancy is at the storage area network level. There are a variety of technologies available to replicate SAN volumes between geographic locations. Synchronous replication tightly couples the primary and backup sites and does not return success to the storage controller until a write completes in both locations, providing a zero RPO. However, synchronous replication requires very fast network connections and requires that the backup site be located very close to the primary location because otherwise latency will severely reduce storage performance. To allow the sites to be further apart, asynchronous replication can be used where the changes are streamed to the backup site but completion of the I/O is signalled before receiving an acknowledgement. Finally, point-in-time replication generates many snapshots of the storage and sends the delta between each snapshot.
All of these SAN replication approaches are bandwidth intensive. Applications make many changes to the disk as part of their ordinary functioning and these changes are almost certainly not encoded in a dense fashion that allows them to efficiently cross networks. An application might make small updates to the same disk block many times in short order and all of these changes would have to be sent across the network in asynchronous or synchronous replication. Point in time replication lowers this overhead a small amount (because redundant changes between snapshots are not sent) at the cost of worse RPO.
Redundancy can also be implemented through database replication. Just as in SAN replication, synchronous, asynchronous, and snapshot-based techniques are available. Many of the same tradeoffs apply, although generally database changes can be sent more efficiently across a WAN. Unfortunately, effectively using database replication to provide geographic redundancy is difficult. For one, database replication can only stand on its own if all of the critical application data resides within the database. This is often not the case. Moreover, sophisticated database deployments involving data partitioning, federation, and integration often greatly complicate replication to the point that effective configuration becomes prohibitive.
Finally, the application itself can handle data redundancy. Often the highest end applications (for instance financial, logistics, and reservation systems) require the federation of data at the application level. This allows extreme top-end performance to be reached and also allows compliance with various types of data jurisdiction requirements (for instance, national directives requiring customer identifiable information to remain in the country of origin.) Unfortunately, this is very difficult and error prone.
Data redundancy is only one piece of the business continuity problem. Applications require other infrastructure to run, such as the network and application servers. Some organizations are using virtualized approaches here with some success to build geographically redundant architectures. Others rely on configuration management technologies to ensure that the disaster recovery sites remain synchronized and ready to handle workload. Another important point to consider is how to handle moving the active instance of the application to the backup site, and also how to re-establish redundancy after a failure and move applications back to the primary. Any approach to provide geographic redundancy must be designed carefully and continually tested well, because today’s complicated application architectures provide too many opportunities for mistakes to be made in provisioning redundancy.
These replication techniques still require the site-local mechanisms like RAID discussed in part 1, because otherwise the facilities involved would be far too unreliable, and also require significant investments in network links, replication technologies, and personnel effort. Also, for the most part, these technologies require the duplication of infrastructure for disaster recovery purposes. In my forthcoming part 3, I will discuss emerging approaches in cloud architectures that unify redundancy mechanisms and significantly simplify the effort involved in implementing resilient business systems.
About the Author
Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company’s strategic technical direction. He is a recognized leader in developing distributed systems technologies and has extensive experience in datacenter and information technology operations.