Setting up an email alert for a linux software raid failure

Recently I had a drive fail in a software RAID1 array on a CentOS 5.x server. I decided to write a simple cronjob/bash script that would nag me if it detected that things weren’t running correctly. This should also work for almost any other Linux disto as well.

I used vi to edit these files, you’ll have to look elsewhere if you need assistance on using vi.

Here is the entry for the root’s crontab (added while logged in as root using “crontab -e”):

*/15 * * * * sh /root/raidstat.sh > /dev/null 2>&1

And here is the contents of /root/raidstat.sh:

#!/bin/sh 
TEST=`cat /proc/mdstat | grep -o "\[UU\]" | wc -w` 
if [ "$TEST" == "3" ]; then 
    # RAID OK
    MDSTAT=`cat /proc/mdstat` 
else 
    # RAID NOT OK - send out email
    MDSTAT=`cat /proc/mdstat` 
    echo "$MDSTAT" | mail -s "*WARNING* - RAID FAILURE DETECTED ON " 
fi

Put simply, this script searches for 3 instances of “[UU]” (which indicates 3 RAID1 software arrays). If it doesn’t find 3, then that indicates there is a problem with 1 or all of my raid arrays. The output of “cat /proc/mdstat” is then emailed out to me, so I can determine what is wrong before I even log into the server.

The cron job will repeat this email alert every 15mins until the issue is resolved.

Here is an example of my output for “cat /proc/mdstat”, showing the 3 arrays (and 1 in progress of recovery). Since only 2 instances of “[UU]” are present, I will get emails until the array is rebuilt and 3 instances are found.

Personalities : [raid1]
md0 : active raid1 xvdb1[1] xvda1[0]
      104320 blocks [2/2] [UU]

md1 : active raid1 xvdb2[1] xvda2[0]
      2096384 blocks [2/2] [UU]

md2 : active raid1 xvdb5[2] xvda5[0]
      484086528 blocks [2/1] [U_]
      [=====>...............]  recovery = 26.7% (129286388/484086528) finish=90.2min speed=65553K/sec

unused devices: 

Software RAID5 arrays will have more U’s in the status, you’ll have to adjust accordingly in the script. If you have a mix of RAID5 and RAID1, I suggest using 2 copies of the script, one for each RAID level searching only for the specific # of [UU] or [UUUUU] instances.

There will always be 1 U for each drive present and functioning in the array. An underscore (such as [U_]) indicates that a drive is missing (in the case of [U_] the second device is missing. [_U] would indicate the first device missing).

In my case, I was able to restart my server and the 2nd drive came back up, so I was able to re-add the 2nd device back into the array for it to rebuild. A similar process would be followed after replacing a disk completely (you’ll have to search elsewhere for a full replacement scenario).

I was able to bring my missing device back with the following command, which started the rebuild process using the existing disk:

mdadm --manage /dev/md2 --add /dev/xvdb5

Hope this helps someone else out there.

Paul

About Paul Reed