When a RAID device fails, it is necessary to remove the hard drive containing the failed device from the array and replace it with a new hard drive. With Linux software RAID this is actually fairly simple using the mdadm command.
First lets look at an existing RAID 1 setup with a pair of RAID devices configured. The computer used in this example is emstools2b.
The fdisk -l command shows your RAID devices and the disk partitions that make them up.
[root@emstools2b ~]# fdisk -l Disk /dev/sda: 120.0 GB, 120034123776 bytes 255 heads, 63 sectors/track, 14593 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 33 265041 fd Linux raid autodetect /dev/sda2 34 14593 116953200 fd Linux raid autodetect Disk /dev/sdb: 120.0 GB, 120034123776 bytes 255 heads, 63 sectors/track, 14593 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 * 1 33 265041 fd Linux raid autodetect /dev/sdb2 34 14593 116953200 fd Linux raid autodetect Disk /dev/md1: 119.7 GB, 119759962112 bytes 2 heads, 4 sectors/track, 29238272 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md1 doesn't contain a valid partition table Disk /dev/md0: 271 MB, 271319040 bytes 2 heads, 4 sectors/track, 66240 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md0 doesn't contain a valid partition table
In this case there are two hard drives, /dev/sda and /dev/sdb. There are two RAID 1 devices created from these hard drives, /dev/md0 which is 271MB and used for /boot, and /dev/md1 which is 119.7GB and is formatted as an LVM Volume Group used for the rest of the Linux file systems. The df -h command shows the rest of this information.
Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-rootVol 1.4G 292M 1.1G 22% / /dev/mapper/VolGroup00-varVol 1.9G 250M 1.6G 14% /var /dev/mapper/VolGroup00-usrVol 3.8G 2.1G 1.6G 57% /usr /dev/mapper/VolGroup00-usrlocalVol 1.9G 36M 1.8G 2% /usr/local /dev/mapper/VolGroup00-tmpVol 4.8G 138M 4.4G 4% /tmp /dev/mapper/VolGroup00-homeVol 961M 18M 895M 2% /home /dev/mapper/VolGroup00-optVol 48G 180M 45G 1% /opt /dev/md0 251M 25M 214M 11% /boot tmpfs 2.0G 0 2.0G 0% /dev/shm
You can use the mdadm command to view the status of a RAID device.
[root@emstools2b ~]# mdadm -D /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Sun Jan 7 08:58:58 2007 Raid Level : raid1 Array Size : 116953088 (111.54 GiB 119.76 GB) Device Size : 116953088 (111.54 GiB 119.76 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Jan 8 08:59:06 2008 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5 Events : 0.8 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 18 1 active sync /dev/sdb2
This shows the details of the device including the current status and the component devices that make up the /dev/md1 RAID array.
The mdadm command can be used to simulate the failure of a RAID device. Let’s use this command to fail the /dev/sdb2 device of the /dev/md1 array.
[root@emstools2b ~]# mdadm -f /dev/md1 /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md1
Note that when a RAID device fails, whether manually like this or a true failure, the mdmonitor service detects the failure and send an email to root.
This is an automatically generated mail message from mdadm running on emstools2b.cisco.com A Fail event had been detected on md device /dev/md1. It could be related to component device /dev/sdb2. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 sda1[0] sdb1[1] 264960 blocks [2/2] [UU] md1 : active raid1 sdb2[2](F) sda2[0] 116953088 blocks [2/1] [U_] unused devices:
The mdadm command can show the status of the failed drive and indicate which device has failed.
[root@emstools2b ~]# mdadm -D /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Sun Jan 7 08:58:58 2007 Raid Level : raid1 Array Size : 116953088 (111.54 GiB 119.76 GB) Device Size : 116953088 (111.54 GiB 119.76 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Jan 8 09:14:37 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5 Events : 0.24 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 0 0 1 removed 2 8 18 - faulty spare /dev/sdb2
The actions you need to take to recover are:
- Remove the damaged device from the array.
- Remove any other devices that are located on the same physical drive as the failed device and which are components of any RAID array.
- Replace the defective hard drive.
- Create the RAID partition on the physical hard drive.
- Create the new RAID devices.
- Add the RAID devices into the array.
See the document Configuring Software RAID 1 Arrays With Linux for details of this procedure.
Remove the device from the array using the mdadm command. Also remove the other RAID devices located on this physical hard drive that are part of any other RAID array.
[root@emstools2b ~]# mdadm -r /dev/md1 /dev/sdb2 mdadm: hot removed /dev/sdb2
At this time you can remove the defective physical hard drive and replace it with a new one and create the required RAID devices. Each RAID device must be the same physical size as the other device in the array it will be added to. Then simply use the mdadm command again to add the new device into the array.
[root@emstools2b ~]# mdadm -a /dev/md1 /dev/sdb2 mdadm: re-added /dev/sdb2
The mdadm command can be used to monitor the rebuilding progress of the array. The rebuild begins as soon as the device is added into the array; no other commands are required to cause that to happen.
[root@emstools2b ~]# mdadm -D /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Sun Jan 7 08:58:58 2007 Raid Level : raid1 Array Size : 116953088 (111.54 GiB 119.76 GB) Device Size : 116953088 (111.54 GiB 119.76 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Jan 8 09:16:06 2008 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 21% complete UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5 Events : 0.42 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 2 8 18 1 spare rebuilding /dev/sdb2