on drives: cattibrie

cattibrie

on drives

Jan 16, 2007 19:34

my friend Henry's post about RAID failure reminded me of something that happened at a job i had down in southern california:

we had an old AS/400 system (Picture 4 refrigerators side by side - the boot "disk" was a reel-to-reel tape) that kept all of our accounting/invoicing/financial data. everything.

I went in to run a full backup one sunday. when a full backup is run, the whole system becomes inaccessible until the backup completes. I start one set of tapes, wait for them to fill and for it to prompt me to put more in after about 4 hours, wait to make sure the backup was continuing, then go home.

so i put in the second set of tapes, wait, looks good, get ready to go. one hand on the doorknob, the other on the lightswitch. power goes out. fuck. i could leave now and nobody would know i was still there when the power went out.

i stayed. UPS kicked in, but the display stated that it only had about 10 minutes of juice. i waited for about 5, it still said 10 minutes, i called my boss. he left the bar and came in to see what we could do.

essentially we could do nothing but either wait for the power to come back on, or the UPS to fail. we wanted to safely shut down the AS/400 but the backup was still running, and there was no way to stop it.

boss makes executive decision and hits the switch on the AS/400, shutting it down. at the time, we didn't think about the fact that a) the UPS STILL said 10 minutes left, after about 30 and b) hitting the switch was just as bad as letting the UPS fail.

at any rate, about 10 minutes later, the power comes back on. we wait a bit, then switch the monster back on. boots up, all looks good. we wait about 20 more minutes to make sure all seems well, then we leave.

i'm the first one in monday morning. i come in to check the machine... flashing amber lights and bad beeping noises. we had 4 drive arrays with multiple disks, some live, one backup, so if one fails, it auto-writes to the backup drive and we keep trucking, replace the bad drive on the fly later. except this time, it had started writing to the spare... spare failed.

we lost a total of 3 drives that time, but two on the same array meant unable to retrieve data. and the way the damn thing worked, that meant reinstalling the OS from the reel-to-reel and restoring from our last full backup, which (due to the power outage) was two weeks old. we completely lost two weeks worth of financial data, because we hadn't realized that the backup procedures we had in place were woefully inadequate.