An End User’s Cloud Security Question

I recently received an email with a question about the security of cloud computing environments.  The question comes from a knowledgeable user and boils down to ‘Isn’t my data safer on my systems?’  I thought this would be a great question to open up to the wider community.  Does anyone have any thoughts or feedback for ‘Gramps’ question below?

Joe, I’m not a college grad, but a 70 yr old grandfather, that began programming on a Color Computer using an audio tape recorder for storage.  I’ve written some corporate code for Owens Corning Fiberglas before I retired, so I’ve been around the keyboard for a while. <grin>  To make a point, notice how you’ve told me what your email address is, on your blog (see the about page.)  Hackers, and scammers are so efficient, you and I can’t even put our actual email out there.  Now, You are in high gear with putting almost your heart and soul on servers that can be anywhere on the planet… even where there are little or no laws (enforced) governing data piracy.  Joe, I’m not trying to pick a fight, no need to, but look at the Wikileaks > etc.  I guess I could cope with using cloud software for doing my things… but can you tell me you are willing to even leave your emails or data files out there too? Somehow, I just feel a whole lot safer having my critical stuff on my flash drive… Talk to me buddy… 

Jim ‘Gramps’ , Hillsboro OH

GD Star Rating
loading...

Redundancy in Data Storage: Part 1: RAID Levels

I recently read Joe Onisick’s piece, “Have We Taken Data Redundancy Too Far?”  I think Joe raises a good point, and this is a natural topic to dissect in detail after my previous article about cloud disaster recovery and business continuity.  I, too, am concerned by the variety of data redundancy architectures used in enterprise deployments and the duplication of redundancy on top of redundancy that often results.  In a series of articles beginning here, I will focus on architectural specifics of how data is stored, the performance implications of different storage techniques, and likely consequences to data availability and risk of data loss.

The first technology that comes to mind for most people when thinking of data redundancy is RAID, which stands for Redundant Array of Independent Drives.  There are a number of different RAID technologies, but here I will discuss just a few.  The first is mirroring, or RAID-1, which is generally employed with pairs of drives.  Each drive in a RAID-1 set contains the exact same information.  Mirroring generally provides double the random access read performance of a single disk, while providing approximately the same sequential read performance and write performance.  The resulting disk capacity is the capacity of a single drive.  In other words, half the disk capacity is sacrificed.

RAID-1, or Mirroring

RAID-1, or Mirroring;
Courtesy Colin M. L. Burnett

A useful figure of merit for data redundancy architectures is MTTDL, or Mean Time To Data Loss, which can be calculated for a given storage technology using the underlying MTBF, Mean Time Between Failures, and MTTR, Mean Time To Repair/Restore redundancy.  All “mean time” metrics really specify an average rate over an operating lifetime; in other words, if the MTTDL of an architecture is 20 years, there is a 1/20 = approximately 5% chance in any given year of suffering data loss.  Similarly, MTBF specifies the rate of underlying failures.   MTTDL includes only failures in the storage architecture itself, and not the risk of a user or application corrupting data.

For a two-drive mirror set, the classical calculation is:

This is a common reason to have hot-spares in drive arrays; allowing an automatic rebuild significantly reduces MTTR, which would appear to also significantly increase MTTDL.  However…

While hard drive manufacturers claim very large MTBFs, studies such as this one have consistently found numbers closer to 100,000 hours.  If recovery/rebuilding the array takes 12 hours, the MTTDL would be very large, implying an annual risk of data loss of less than 1 in 95,000.  Things don’t work this well in the real world, for two primary reasons:

  • The optimistic assumption that the risk of drive failure for two drives in an array is uncorrelated.  Because disks in an array were likely sourced at the same time and have experienced similar loading, vibration, and temperature over their working life, they are more likely to fail at the same time.  Also, some failure modes have a risk of simultaneously eliminating both disks, such as a facility fire or a hardware failure in the enclosure or disk controller operating the disks.
  • It is also assumed that the repair will successfully restore redundancy if a further drive failure doesn’t occur.  Unfortunately, a mistake may happen if personnel are involved in the rebuild.  Also, the still-functioning drive is under heavy load during recovery and may experience an increased risk of failure.  But perhaps the most important factor is that as capacities have increased, the Unrecoverable Read Error rate, or URE, has become significant.  Even without a failure of the drive mechanism, drives will permanently lose blocks of data at this specified (very low) rate, which generally varies between 1 error per 1014 bits read for low-end SATA drives to 1 per 1016 for enterprise drives.  Assuming that the drives in the mirror are 2 TB low-end SATA drives, and there is no risk of a rebuild failure other than by unrecoverable read errors, the rebuild failure rate is 17%.
RAID 1+0: Mirroring and Striping

RAID 1+0: Mirroring and Striping;
Courtesy MovGP

With the latter in mind, the MTTDL becomes:

When the rebuild failure rate is very large compared to 1/MTBF:

In this case, MTTDL is approximately 587,000 hours, or a 1 in 67 risk of losing data per year.

RAID-1 can be extended to many drives with RAID-1+0, where data is striped across many mirrors.  In this case, capacity and often performance scales linearly with the number of stripes.  Unfortunately, so does failure rate.  When one moves to RAID-1+0, the MTTDL can be determined by dividing the above by the number of stripes.  A ten drive (five stripes of two-disk mirrors) RAID-1+0 set of the above drives would have a 15% chance of losing data in a year (again without considering correlation in failures.)  This is worse than the failure rate of a single drive.

RAID-5RAID-6

RAID-5 and RAID-6;
Courtesy Colin M. L. Burnett

Because of the amount of storage required for redundancy in RAID-1, it is typically only used for small arrays or applications where data availability and performance are critical.  RAID levels using parity are widely used to trade-off some performance for additional storage capacity.

RAID-5 stripes blocks across a number of disks in the array (minimum 3, but generally 4 or more), storing parity blocks that allow one drive to be lost without losing data.  RAID-6 works similarly (with more complicated parity math and more storage dedicated to redundancy) but allows up to two drives to be lost.  Generally, when a drive fails in a RAID-5 or RAID-6 environment, the entire array must be reread to restore redundancy (during this time, application performance usually suffers.)

While SAN vendors have attempted to improve performance for parity RAID environments, significant penalties remain.  Sequential writes can be very fast, but random writes generally entail reading neighboring information to recalculate parity.  This burden can be partially eased by remapping the storage/parity locations of data using indirection.

For RAID-5, the MTTDL is as follows:

Again, when the RFR is large compared to 1/MTBF, the rate of double complete drive failure can be ignored:

However, here RFR is much larger as it is calculated over the entire capacity of the array.  For example, achieving an equivalent capacity to the above ten-drive RAID-1+0 set would require 6 drives with RAID-5.  The RFR here would be over 80%, yielding little benefit from redundancy, and the array would have a 63% chance of failing in a year.

Properly calculating the RAID-6 MTTDL requires either Markov chains or very long series expansions, and there is significant difference in rebuild logic between vendors.  However, it can be estimated, when RFR is relatively large, and an unrecoverable read error causes the array to entirely abandon using that disk for rebuild, as:

Evaluating an equivalent, 7-drive RAID-6 array yields an MTTDL of approximately 100,000 hours, or a 1 in 11 chance of array loss per year.

The key things I note about RAID are:

  • The odds of data loss are improved, but not wonderful, even under favorable assumptions.
  • Achieving high MTTDL with RAID requires the use of enterprise drives (which have a lower unrecoverable error rate).
  • RAID only protects against independent failures.  Additional redundancy is needed to protect against correlated failures (a natural disaster, a cabinet or backplane failure, or significant covariance in disk failure rates.)
  • RAID only provides protection of the data written to the disk.  If the application, users, or administrators corrupt data, RAID mechanisms will happily preserve that corrupted data.  Therefore, additional redundancy mechanisms are required to protect against these scenarios.

Because of these factors, additional redundancy is required in conventional application deployments, which I will cover in subsequent articles in this series.

Images in this article created by MovGP (RAID-1+0, public domain) and Colin M. L. Burnett (all others, CC-SA) from Wikipedia.

This series is continued in Redundancy in Data Storage: Part 2: Geographical Replication.

About the Author

Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company’s strategic technical direction.  He is a recognized leader in developing new technologies and has extensive experience in datacenter operations and distributed systems.

GD Star Rating
loading...

Cisco Unified Computing System Value: A Customer’s Perspective

For those who do not know me, I’ve been blogging since April 2010.  My primary subject has been Cisco UCS and everything around it.  Now, when I say “everything around it”, I really mean “everything around it as it pertains to my organization”.  If you want to read a lot, you can check out my blog, http://itvirtuality.wordpress.com.  Be sure to start with the early April postings.  However, I am going to summarize how we came to purchase Cisco UCS, what our business drivers were, and what my current impressions of the product after being in production for a bit over 90 days.

Some history:  My organization exclusively used rack mounts until a friend of a C-level executive suggested we go blade.  It was a fair suggestion.  The only reason we’ve stayed rack mount so long is that the dollar savings just wasn’t there.  We own our real estate; we are the power company, etc.   All those areas where it is easy to quantify dollar savings either just don’t exist for us, or they aren’t significant.

So we started down the road of researching blades.  In the beginning, blades were scary to us.  There were so many different options to configure that it was a bit numbing.  It was at this juncture that we decided to look around and see if there was something else in the market that could simplify things for us.  It just happens that the press was still writing about Cisco UCS so naturally we homed in on it.

It wasn’t love at first sight.  We are not a Cisco shop, it was a new product, and we were a bit dubious on some of the claims.  So we did some serious research into it.  I personally read the Project California book, all the published manuals and tech-notes available at that time, and more.  I did the same for HP.  Just to be sure we were doing our due diligence, we were asked to consider IBM and a few others.  After about four months of research, we put together a technical specifications sheet just to see what the differences were.  Then we listed out all our pain points, strategic direction, and business challenges/opportunities and finally scored each vendor’s offerings in regards to the aforementioned items.  In almost every case, Cisco UCS came out on top.

Our first major comparison was price.  Believe it not; based on our 5yr estimates, Cisco was the most cost-effective.  Part of what made UCS the price leader was its affects on our data center.  Pretty much a cable once type of deal.  To top it off, we did not (and will not for the foreseeable future) need to purchase any additional SAN or network ports.  By choosing any of the other blade brands, I would have needed to purchase additional ports.

Once we completed a cost analysis, we focused on our pain points.  The top two were cable management and complexity.  We (meaning my team) hate cabling and managing those cables.  We are server administrators, not cable administrators.  Moving, adding, or removing a server means at least an extra hour of work undoing nicely bundled cables in the server cabinets and data center cable trays, tracing the appropriate cable (we don’t trust labels 100%), and then cleaning up.  We also have to ensure that we have cables of various lengths in stock so we can accommodate changes fairly quickly.  Otherwise, we have to order the appropriate lengths.  With UCS, just run the cables once and be done with it.

As for complexity, with UCS one pretty much does the bulk of the management through a single interface and at one device level (fabric interconnects).  If we went with HP, we would need to manage the blade, manage the SAN switch in the chassis, manage the network switch in the chassis, etc.  If your shop is already understaffed and overworked, why add more complexity and/or work into the environment?

On to strategic initiatives…A major strategic initiative of ours requires multi-tenancy support.  We envision ourselves are being service providers to other government entities in our area.  We arguably have the best data center around in our area (as far as local governments compare).  In the last two years we have upgraded our PDUs, UPS/Batteries, Air Conditioning, and physical security systems.  While not perfect, we found the RBAC capabilities of UCS to be ahead of the others.

A second major initiative focuses on DR.  We are in the process of building out a DR site.  Once completed, we will be able to replicate our UCS configuration (Service Profiles and such) and data over to the DR site.  Not only will we have a DR site, having a mirror UCS configuration will provide for a nice dev environment.

So now that I have been in production for a while now, what do I think of UCS?  Overall, I am happy.  I haven’t had any outages related to UCS.  Performance has been better than expected and the system gets easier to use as time goes by.  Like all new systems, you just need to get used to it.

Support has generally been excellent.  I use the word “generally” because I have noticed that sometimes a tech will give direction that will solve the issue at hand, but that direction may cause other problems.  It’s as if the technicians don’t always see the “big picture”.  Once we inform the tech that we are not so sure about the direction, a new solution will be provided that does the trick.

So what hasn’t been perfect?  Integrating into our environment was not easy.  I mentioned early on that we are not a Cisco shop.  This had ramifications in network design that we were not aware of, but we believed that the UCS product was the right way to go and redesigned part of our network to accommodate it.

We have also opened up a number of tech support tickets for errors that ended up being cosmetic only.  In other words, we were not having real problems, just erroneous errors being reported.  Most of these have been fixed in subsequent firmware upgrades.

A caveat to would be buyers: I think Cisco oversells some of the capabilities.  For example, Cisco markets that one set of fabric interconnects can support up to forty chassis.  While there may be a few customers who can do this, I am not so sure that this is a supported configuration.  Go read the current release notes and you will see that only fourteen chassis are currently supported.  The forty is a future feature and each major firmware release ups the number of chassis supported.

The same can also be said for the much ballyhooed Palo adapter.  Cisco throws around the capability to provision up to 128 vNICs.  The real number right now is 58 and it is based on the number of cables that are used to connect the chassis IOM to the fabric interconnect.  If you use two cables per IOM, you are limited to 28 vNICs.  Again, the 128 is a future feature.

Given all that we now know, all that we have gone through, all that has occurred…I would not hesitate to purchase UCS again.  I think it is a great system that has clearly been designed to overcome today’s data center challenges (cabling, cooling, etc).  It’s a system that has been designed to grow, both in features and capacity, without major modifications.  And it is a system designed for the future..

GD Star Rating
loading...

Virtualizing the PCIe bus with Aprius

One of the vendors that presented during Gestalt IT’s Tech Field day 2010 in San Jose was Aprius (http://gestaltit.com/field-day/) (http://www.aprius.com/.)  Aprius’s product virtualizes the PCIe I/O bus and pushes that PCIe traffic over 10GE to the server.  In Aprius’s model you have an Aprius appliance that houses multiple off-the-shelf PCIe cards and a proprietary Aprius initiator which resides in the server.  The concept is to be able to not only share PCIe devices to multiple servers but also allow the use of multiple types of PCIe cards on servers with limited slots.  Additionally there would be some implications for VMware virtualized servers as you could potentially utilize VMware Direct-Path I/O to present these cards directly to a VM.  Aprius’s main competitor is Xsigo which provides a similar benefit using a PCIe appliance containing proprietary PCIe cards and pushing the I/O over standard 10G Ethernet or Infiniband to the server NIC.  I look at the PCIe I/O virtualization space as very niche with limited use cases, let’s take a look at this in reference to Aprius,

With the industry moving more and more toward x64 server virtualization using VMware, HyperV, and Zen hardware compatibility lists come very much into play.  If a card is not on the list it most likely won’t work and is definitely not supported.  Aprius skates around this issue by using a card that appears transparent to the operating system and instead presents only the I/O devices assigned to a given server via the appliance.  This means that the Aprius appliance should work with any given virtualization platform, but support will be another issue.  Until Aprius is on an the Hardware Compatibility List (HCL) for any given hypervisor I wouldn’t recommend to my customers for virtualization.  Additionally the biggest benefit I’d see for using Aprius in a virtualization environment would be passing VMs PCIe devices that aren’t traditionally virtualized, think fax-modem etc.  This still wouldn’t be possible with the Aprius device because those cards aren’t on the virtualization HCL.

The next problem with these types of products is that the industry is moving to consolidate storage, network and HPC traffic on the same wire.  This can be done with FCoE, iSCSI, NFS, CIFS, etc. or any combination you choose.  That move is minimizing the I/O card requirements in the server and the need for specialized PCIe devices is getting smaller every day.  With less PCIe devices needed for any given server, what is the purpose of a PCIe aggregator?

Another use case of Aprius’s technology they shared with us was sharing a single card, for example 10GE NIC among several servers as a failover path rather than buying redundant cards per server.  This seems like a major stretch. This adds an Aprius appliance as a point of failure to your redundant path, and still requires an Aprius adapter in each server instead of the redundant NIC.

My main issue with both Aprius and Xsigo is that they both require me to put their boxes in my data path as a single additional point of failure.  You’re purchasing their appliance and their cards and using that to aggregate all of your server I/O leaving their appliance as a single point of failure for multiple servers I/O requirements. I just can’t swallow that, unless I have some 1-off tye of need that can’t be solved any other way.

The question I neglected to ask Aprius’s CEO during the short period he joined us is whether the company was started with the intent to sell a product, or the intent to sell a company.  My thinking is that the real answer is they’re only interested in selling enough appliances to get the company as a whole noticed and purchased.  The downside of that is they don’t seem to have enough secret sauce that can’t be easily copied to be valuable as an acquisition.

The technology both Aprius and Xsigo market would really only be of use if purchased by a larger server vendor with a big R&D budget and some weight with the standards community. It could then be used to push a PCIeoE standard to drive adoption.  Additionally the appliances may have a play within that vendors blade architecture as a way of minimizing required blade components and increasing I/O flexibility, i.e. a PCIe slot blade/module that could be shared across the chassis.

Summary:

Aprius seems to be a fantastic product with a tiny little market that will continue to shrink. This will never be a mainstream data center product but will fit the bill for niche issues and 1-off deployments.  In their shoes my goal would be to court the server vendors and find a buyer before the technology becomes irrelevant, or copied. Their only competition I’m aware of in this space is Xsigo and I think they have a better shot based on deployment model. They’re proprietary card in each server becomes a non-issue if a server vendor buys them and builds them into the system board.

GD Star Rating
loading...

Disaster Recovery and the Cloud

It goes without saying that modern business relies on information technology.  As a result, it is essential that operations personnel consider the business impact of outages and plan accordingly.  As an illustration, Virgin Blue recently experienced a twenty-hour outage in its reservation system that resulted in losses of up to $20 million dollars.  The cloud provides both considerable opportunities and significant challenges relating to disaster recovery.

In general, organizations must currently build multiple levels of redundancy into their systems to reach high-availability targets and to protect themselves from catastrophic outages during a natural or man-made disaster.  A disaster recovery strategy requires that data and critical application infrastructure be duplicated at a separate location, away from the primary datacenter.  Cutting over to a disaster recovery site is usually not instantaneous and redundancy is often lost during the contingency operating plan.  For this reason, site-local redundancy mechanisms – such as high availability network systems, failover for portions of the application stack, and SAN-level redundancy are also required to achieve availability goals.  Public clouds often further complicate disaster recovery planning, as the organization’s critical systems may now be spread across their own infrastructure and a multitude of outside vendors, each with their own data model and recovery practices.

Business requirements and application criticality should guide the approach chosen for business continuity.  Consider the concepts of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). The RPO of a system is the specified amount of data that may be lost in the event of a failure, while the RTO of a system is the amount of time that it will take to bring the system back online after a failure.  In general, site-local mechanisms will provide near-instantaneous RPO and RTO, while disaster recovery systems often will have an RPO of several hours or days of information, and an RTO measured in tens of minutes. Through increasingly sophisticated (and costly) infrastructures, these times can be reduced but not entirely eliminated.

Timeline illustrating concepts of RPO and RTO

Illustration of RTO and RPO in a backup system

Dedicated redundancy infrastructure, both site-local and for disaster recovery purposes, must be regularly tested.  Additionally, it is essential to ensure that the disaster recovery environment is compatible with the existing infrastructure and capable of running the critical application.  This is an area where change management procedures are important, to ensure that critical changes to the production infrastructure are made in the standby environment as well.  Otherwise, the standby environment may not be able to correctly run the application when the disaster recovery plan is activated.

The primary factor that determines RTO and RPO is the approach used to move data to the contingency site.  The easiest and lowest cost approach is tape backup.  In this case, the RPO is the time between successive backups moved off-site (perhaps a week or more) and the RTO is the amount of time necessary to retrieve the backups, restore the backups, and activate the contingency site.  This may be a significant amount of time, especially if personnel are not readily available during the disaster scenario.  Alternatively, a hot contingency site may be maintained, and database log-shipping or volume snapshotting/replication can be used to send business data to the secondary site.  These systems are costly, but readily attain an RTO of under an hour, and an RPO of perhaps one day.  With substantial investment and complexity, RPO can even be reduced to the range of minutes.  However, organizations have often been surprised to find that the infrastructure doesn’t work when it is called upon, often because of the complexity of the infrastructure and the difficulties involved in testing a standby site.

When procuring IaaS (Infrastructure as a Service) or SaaS (Software as a Service), it is essential for the organization to perform due diligence regarding what disaster recovery mechanisms the service vendor uses. The stakes are too high to trust service level agreements alone (in the case of a catastrophic failure during a disaster, will the vendor be solvent and will the compensation received be sufficient to compensate for business losses?).

Disaster Recovery as a Service, or DRaaS, is an emerging category for organizations that wish to control their own infrastructure but not maintain the disaster recovery systems themselves.  With a DRaaS offering, an IT organization does not directly build a contingency site, but instead relies on a vendor to do so on a dedicated or utility computing infrastructure.  The cloud’s advantages in elasticity and cost-reduction are significant benefits in a disaster recovery scenario, and service offerings allow organizations to outsource portions of contingency planning to vendors with expertise in the area.  However, many of the complexities remain and it is essential to perform the due diligence to ensure that the contingency plan will work and provide a sufficient level of service if called upon.

Finally, there are emerging technologies that combine site-local redundancy and disaster recovery into a unified system.  For example, distributed synchronous multi-master databases allow an application to be spread across multiple locations, including cloud availability zones, with the application active and processing transactions in all of them.  A specified portion of the system can be lost without any downtime or recovery effort.  These emerging systems offer the prospect of dramatically reducing costs and minimizing the risk of contingency sites not functioning properly.

About the Author

Michael Lyle (@MPLyle) is CTO and co-founder of Translattice, and is responsible for the company’s strategic technical direction.  He is a recognized leader in developing new technologies and has extensive experience in datacenter operations and distributed systems.

GD Star Rating
loading...