Use Linux software RAID? Cron RAID scrubbing or lose your data.

Feb 23, 2009 14:15

This post is bought to you by the fun of unnecessary wasted time and work rebuilding a server after a double-disk RAID array failure. RAID scrubbing is essential - and is finally supported by Linux's software RAID, but not used without explicit user action.



I had a disk fail in the backup server last week. No hassle - replace it, trigger a rebuild, and off I go. Unfortunately, during the rebuild another disk was flagged as faulty, rendering the array useless as it had a half-rebuilt spare and a second failed drive.

You'd think the chances of this were pretty low, but the trouble is that the second failed drive will have developed just a couple of defective sectors (a SMART check confirms this) that weren't detected because those sectors weren't being read. Until the drive was sequentially read during the rebuild, that is.

To reduce the chance of this, you can periodically verify your arrays and if bad sectors are discovered, attempt to force them to be remapped (by rewriting them from redundant data) or failing that fail the drive. Unfortunately, Linux's software RAID doesn't do this automatically.

A simple shell script like this, dropped in /etc/cron.weekly and flagged executable, will save you a LOT of hassle:

#!/bin/bash
for f in /sys/block/md? ; do
echo check > $f/md/sync_action
done

Make sure to TEST YOUR EMAIL NOTIFICATION from mdadm, too. If a drive fails and you never get notified, you're stuffed.

linux, work

Previous post Next post
Up