The big RAID mistake

You come back from lunch one day and  get back onto your computer only to find it unresponsive.  After some time of fiddling with cables and making sure your coworkers aren’t playing a prank on you, the sinking feeling starts to settle in.  Your hard disk has crashed and you’ve lost all your data.

Maybe this has happened to you or maybe it’s been a nightmare you’ve been playing in your head.  Either way, you’ve learned your lesson: redundancy.  So you buy a couple extra hard drives and a brand new computer that sports “hardware RAID 1 – mirroring”.  And now you’re safe!  Or are you?

Pretty much everyone goes through this, even many IT professionals.  However it takes some experience to realize all you’ve really bought yourself is a false sense of security.  By doing the above, you’ve actually created additional risk that you didn’t have before:

  • The introduction of a new “single point of failure”
  • The loss of data agility (ie. being able to recover quickly and/or easily)
  • Only minimally mitigated hardware failures.

Of course RAID is by no means a backup solution, and by “risk” I’m referring to downtime incurred recovering from your backup system (typically a lengthy process), in addition to any incremental data you may lose between backup sessions. ie. work hours lost.

Before we continue, let’s review some of the common RAID levels that people use (0, 1, 5):

RAID 0 – Striping.  No redundancy is provided in this mode.  This is purely for performance.

RAID 1 – Mirroring.  You have complete redundancy for every drive added to the system.

RAID 5 – Striping with distributed parity.  Here your data is spread across at least 3 drives, and you’re allowed at most 1 drive failure at a time.  Typically this is chosen instead of RAID1 to make better use of disk space.

The mistake people make is to follow all the marketing hype and believe that these RAID levels are the only risk associated with the system and that the hard disks are the greatest point of failure.  Let me go through each point in more detail.

The Single Point of Failure

The RAID standard unfortunately says nothing about how the data is stored on the disk.  What this means is that every implementation of RAID will be done differently and will NOT be compatible with each other.  The problem can get worse with hardware RAID solutions, as even the same model of controller may not implement (or support) algorithms from a previous version (I say this from experience).

The upshot of this is that if you care about recovering from a failure due to your RAID controller, you must buy an identical spare RAID controller at your original time of purchase and keep it handy.  If one fails, then you must try to get another spare and most importantly test it!  If it is not interchangeable, then you’ll have to buy two new sets of RAID controllers and swap out as soon as possible.

Loss of Data Agility

By agility, I mean the ability to move your hard drives to other hardware.  Let’s say you have a motherboard with an integrated RAID controller.  If that motherboard fails for any reason and you must replace it, you’ve just lost your RAID controller.  If you don’t have a spare IDENTICAL motherboard, then you’ve also quite likely lost all your data and must now recover from backups.

Even if you’re using RAID 1 mirroring (the simplest of the redundant RAID levels), most hardware RAID controllers insist on formatting the hard drive in a special way.   This means if you’ve lost your RAID controller, you cannot expect to take the hard drive and place it into another computer and expect it to work.

Only minimally mitigated hardware failures

Most people, when buying drives for a RAID set will typically buy their hard drives all together: same make model, etc.  The hardware RAID controllers will even tell you to do this in the manual, often not even supporting divergent disks.  The problem here is that we’re attempting to mitigate a hardware failure.  Hardware failures can come from two primary sources: minor defects at time of manufacture that can contribute to the longevity of the device and the running environment (heat, stress placed on the platters, etc.).  There’s little you can do about the environmental factors because of the nature of a live redundant system requires that multiple drives be running in the same environment.  However by purchasing drives from the same vendor, model, etc., you’ve effectively guaranteed they’ll be from the same lot, which means they’ll likely have the same manufacturing defects.  So once one drive fails, you can expect the other one to fail quite soon as well (in fact the likelihood of failure actually increases!  These are not independent variables.

Solution

So with the RAID situation so bleak, what can you do to ensure the safety of your data and minimal down time?  Quite simply: software RAID.  Linux software RAID in particular handles all of these situations quite elegantly:

  • While you do still have a single hard drive controller, these are commodity items that are interchangeable and easily replaced.  So you’ve removed that single point of failure.
  • Linux software RAID can be used on any computer that can run Linux.  If using RAID1, you can take a single drive out of a set and into another computer and it will be recognized and accessible readily.  Your data is just as agile as it was without RAID.
  • Linux software RAID can support any combination of disks.  Your redundancy will only be as large as your smallest disk of course, but disk space is cheap, your data isn’t, and typically the difference in capacity between vendors is negligible.  My recommendation here is to buy disks from different manufacturers, or at least through different vendors so that you’ll have some hope of getting them from different lots.

Your primary risk factor then, as it relates to down-time, is the amount of time it takes Linux to rebuild a RAID set after a failure… the longer this window, the more exposed you are to a second failure resulting in disruption.  I don’t have these numbers, and it’s largely dependent on your equipment, as many RAID controllers really are a software implementation anyway.  In a RAID 1 configuration, I would expect the difference to be small.

I can’t speak for other implementations of software RAID, but I do know from experience that if data integrity and low downtime are your primary motivators, you’ll find Linux Software RAID mirroring to be quite effective.

Additional Resources

For more information, please read these articles:

 

2 thoughts on “The big RAID mistake

  1. Hi,
    Good article, I was looking for disk mirroring ‘without raid’, when I stumbled across it.

    However, my search was because of something I think you fail to mention: the appalling support for linux software raid on boot. There is no much ‘black magic’ arround getting working, often with conflicting or outdated instructions.

    After much struggling with ubuntu 10.04 I implemented raid 1 for my boot device. Then, during the upgrade to 12.04 a couple of years ago, I struggled to make the upgrade boot. Now I’ve just updated to 14.04 and I cannot make the new array boot. I can build a new array ok, but then of course I’ve lost all my settings and installed software, and all the detritus of years of developing on the same machine.

    The real problem is that as a developer, my ‘data’ is not a bunch of documents or spreadsheets that I could keep on a separate raid array while booting from a non-raid disk. MY data is the operating system and the applications and drivers I’ve tweaked. I need THOSE to be on a raid-like system so I don’t lose the environment when I lose a disk.

    All the well-stated points you make about single point of failure apply, IMHO, to linux raid also – it’s just that the failure moves from hardware to lack of documentation and support.

    My point is this: I’m wondering if the time (48 hours and counting) that I’m losing trying to upgrade my linux raid array is more or less than the time I would lose if I didn’t bother with raid and used a ‘slow’ backup system…

  2. Yes, RAID on boot is a little weird and complex. I personally try to keep my boot partition as simple as possible. I don’t know what kind of hardware you use, but I’ve found little need for driver tweaks or other types of customization for Linux in a long, long time. I have, however, used configuration management tools in the past to help keep track of system changes as well as installed packages. cfengine 2 was the best one for me. I’ve looked into things like puppet and chef but I find them rather deficient. However, the advantage with those types of tools is that you can then place all those changes in a repository, which can be RAIDed. Due to the complexity of most configuration management tools and my lack of need of such complexity, lately I’m just using etckeeper (https://joeyh.name/code/etckeeper/).

    Sadly, Linux desktop just stopped going in a useful direction for me about 10 years ago. I find I’m much happier now with a Mac as my front-end and just using Linux headless. But that’s a whole other story.

    In summary, my approach is:

    • yes keep your bootup partitions as simple non-RAIDed journaled filesystems.
    • Use configuration management tools, and keep that in a version controlled repository
    • Use RAID for your home directory and any other critical data partitions.

Leave a Reply

Your email address will not be published. Required fields are marked *