Danny's Lab

Engineering the World

The big RAID mistake

Published on: Aug 9, 2012
Reading time: 5 minutes

You come back from lunch one day and get back onto your computer only to find it unresponsive. After some time of fiddling with cables and making sure your coworkers aren't playing a prank on you, the sinking feeling starts to settle in. Your hard disk has crashed and you've lost all your data.

Maybe this has happened to you or maybe it's been a nightmare you've been playing in your head. Either way, you've learned your lesson: redundancy. So you buy a couple extra hard drives and a brand new computer that sports "hardware RAID 1 - mirroring". And now you're safe! Or are you?

Pretty much everyone goes through this, even many IT professionals. However it takes some experience to realize all you've really bought yourself is a false sense of security. By doing the above, you've actually created additional risk that you didn't have before:

  • The introduction of a new "single point of failure"
  • The loss of data agility (ie. being able to recover quickly and/or easily)
  • Only minimally mitigated hardware failures.

Of course RAID is by no means a backup solution, and by "risk" I'm referring to downtime incurred recovering from your backup system (typically a lengthy process), in addition to any incremental data you may lose between backup sessions. ie. work hours lost.

Before we continue, let's review some of the common RAID levels that people use (0, 1, 5):

RAID 0 - Striping. No redundancy is provided in this mode. This is purely for performance.

RAID 1 - Mirroring. You have complete redundancy for every drive added to the system.

RAID 5 - Striping with distributed parity. Here your data is spread across at least 3 drives, and you're allowed at most 1 drive failure at a time. Typically this is chosen instead of RAID1 to make better use of disk space.

The mistake people make is to follow all the marketing hype and believe that these RAID levels are the only risk associated with the system and that the hard disks are the greatest point of failure. Let me go through each point in more detail.

The Single Point of Failure

The RAID standard unfortunately says nothing about how the data is stored on the disk. What this means is that every implementation of RAID will be done differently and will NOT be compatible with each other. The problem can get worse with hardware RAID solutions, as even the same model of controller may not implement (or support) algorithms from a previous version (I say this from experience).

The upshot of this is that if you care about recovering from a failure due to your RAID controller, you must buy an identical spare RAID controller at your original time of purchase and keep it handy. If one fails, then you must try to get another spare and most importantly test it! If it is not interchangeable, then you'll have to buy two new sets of RAID controllers and swap out as soon as possible.

Loss of Data Agility

By agility, I mean the ability to move your hard drives to other hardware. Let's say you have a motherboard with an integrated RAID controller. If that motherboard fails for any reason and you must replace it, you've just lost your RAID controller. If you don't have a spare IDENTICAL motherboard, then you've also quite likely lost all your data and must now recover from backups.

Even if you're using RAID 1 mirroring (the simplest of the redundant RAID levels), most hardware RAID controllers insist on formatting the hard drive in a special way. This means if you've lost your RAID controller, you cannot expect to take the hard drive and place it into another computer and expect it to work.

Only minimally mitigated hardware failures

Most people, when buying drives for a RAID set will typically buy their hard drives all together: same make model, etc. The hardware RAID controllers will even tell you to do this in the manual, often not even supporting divergent disks. The problem here is that we're attempting to mitigate a hardware failure. Hardware failures can come from two primary sources: minor defects at time of manufacture that can contribute to the longevity of the device and the running environment (heat, stress placed on the platters, etc.). There's little you can do about the environmental factors because of the nature of a live redundant system requires that multiple drives be running in the same environment. However by purchasing drives from the same vendor, model, etc., you've effectively guaranteed they'll be from the same lot, which means they'll likely have the same manufacturing defects. So once one drive fails, you can expect the other one to fail quite soon as well (in fact the likelihood of failure actually increases! These are not independent variables.

Solution

So with the RAID situation so bleak, what can you do to ensure the safety of your data and minimal down time? Quite simply: software RAID. Linux software RAID in particular handles all of these situations quite elegantly:

  • While you do still have a single hard drive controller, these are commodity items that are interchangeable and easily replaced. So you've removed that single point of failure.
  • Linux software RAID can be used on any computer that can run Linux. If using RAID1, you can take a single drive out of a set and into another computer and it will be recognized and accessible readily. Your data is just as agile as it was without RAID.
  • Linux software RAID can support any combination of disks. Your redundancy will only be as large as your smallest disk of course, but disk space is cheap, your data isn't, and typically the difference in capacity between vendors is negligible. My recommendation here is to buy disks from different manufacturers, or at least through different vendors so that you'll have some hope of getting them from different lots.

Your primary risk factor then, as it relates to down-time, is the amount of time it takes Linux to rebuild a RAID set after a failure... the longer this window, the more exposed you are to a second failure resulting in disruption. I don't have these numbers, and it's largely dependent on your equipment, as many RAID controllers really are a software implementation anyway. In a RAID 1 configuration, I would expect the difference to be small.

I can't speak for other implementations of software RAID, but I do know from experience that if data integrity and low downtime are your primary motivators, you'll find Linux Software RAID mirroring to be quite effective.

Additional Resources

For more information, please read these articles: