Posts Tagged mdadm

Linux Software RAID and LVM

I wanted to create a RAID 5 array with four 1tb drives. I only have three, but I have a 750gb and 320gb drive lying around. I figured there was probably a way to combine them into a 1tb drive that I could use with the others.

Using Linux’s LVM, I can create a logical partition from the two smaller drives as big as the 1tb drive.

$ pvdisplay

Lists all physical volumes managed by LVM. First we have to create the physical volumes for LVM. I prefer to create the volumes out of partitions, although you can do it from raw drives too.

First let’s use fdisk to create “Linux LVM” partitions on the two drives.

$ fdisk /dev/sdb
Press "n" to create a new partition
Press "t" to set the partition type, and enter "8e" for Linux LVM.

When creating the partition, accepting the defaults will make it use the whole drive. I want to use the entire 750 drive and only part of the 320 drive, so that in total it has the same number of blocks as the 1tb drives. So I first created the “Linux RAID” partitions on the 1tb drives so I could see how many cylinders it listed, which ended up being 121601. So I created a partition the full size of the 750 drive (91201 cylinders), then created a 121601-91201 cylinder partition on the 320 drive.

$ fdisk -l /dev/sdb
Disk /dev/sdb: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       30400   244187968+  8e  Linux LVM
/dev/sdb2           30401       38913    68380672+  83  Linux
 
$ fdisk -l /dev/sdj
Disk /dev/sdj: 750.1 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
   Device Boot      Start         End      Blocks   Id  System
/dev/sdj1               1       91201   732572001   8e  Linux LVM

Now that I have the two partitions ready, I moved on to LVM setup.

$ pvcreate /dev/sdb1
$ pvcreate /dev/sdj1

This sets up the two partitions as physical volumes for LVM.

Next is creating a logical volume group:

$ vgcreate vg_tb /dev/sdb1 /dev/sdj1

This creates a volume group called “vg_tb” using the two physical volumes sdb1 and sdj1.

Let’s take a look at what we have so far:

$ pvdisplay
  --- Physical volume ---
  PV Name               /dev/sdb1
  VG Name               vg_tb
  PV Size               232.88 GB / not usable 832.50 KB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              59616
  Free PE               59616
  Allocated PE          0
  PV UUID               70hPKX-n11U-RcB6-0Kyt-1SOP-ni7E-2Y9hcE

  --- Physical volume ---
  PV Name               /dev/sdj1
  VG Name               vg_tb
  PV Size               698.64 GB / not usable 2.34 MB
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              178850
  Free PE               178850
  Allocated PE          0
  PV UUID               PzFb9b-lapG-KdT3-78nh-Gq75-F0Lo-I3xCrl

  --- Physical volume ---
  PV Name               /dev/sda2
  VG Name               VolGroup00
  PV Size               111.60 GB / not usable 2.86 MB
  Allocatable           yes
  PE Size (KByte)       32768
  Total PE              3571
  Free PE               1
  Allocated PE          3570
  PV UUID               1wM65Z-3QGd-vDiq-mq1R-YhEE-Ackp-hh3g13

$ vgdisplay
  --- Volume group ---
  VG Name               vg_tb
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               931.51 GB
  PE Size               4.00 MB
  Total PE              238466
  Alloc PE / Size       0 / 0
  Free  PE / Size       238466 / 931.51 GB
  VG UUID               olg9GP-x1sC-sFAD-TgWY-KIIx-YWNt-kL763n

  --- Volume group ---
  VG Name               VolGroup00
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               111.59 GB
  PE Size               32.00 MB
  Total PE              3571
  Alloc PE / Size       3570 / 111.56 GB
  Free  PE / Size       1 / 32.00 MB
  VG UUID               IZ25LV-oMOG-DwKK-QuFN-bqqp-ClUe-d71k5l

So far so good. The next step is to create a logical volume in the new volume group:

$ lvcreate vg_tb -n onetb -l 100%VG
  Logical volume "onetb" created

This creates a new logical volume called “onetb” in the “vg_tb” group using 100% of the group’s available space. Now let’s take a look at the list of logical volumes:

$ lvdisplay
  --- Logical volume ---
  LV Name                /dev/vg_tb/onetb
  VG Name                vg_tb
  LV UUID                sQKaq9-D6Mv-p8it-vWGW-O7DX-FGmC-cl5FSh
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                931.51 GB
  Current LE             238466
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

  --- Logical volume ---
  LV Name                /dev/VolGroup00/LogVol00
  VG Name                VolGroup00
  LV UUID                G5hlbb-tA3S-qhTS-03us-f9dl-1Vxy-9vDSU5
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                109.62 GB
  Current LE             3508
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

  --- Logical volume ---
  LV Name                /dev/VolGroup00/LogVol01
  VG Name                VolGroup00
  LV UUID                7NZrY9-1wSJ-4fRp-VPnM-V07u-9rj8-vmtxTx
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             62
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1

You can ignore all of the VolGroup00 things, those are the auto-created volumes from when I installed Fedora.

At this point, I have a new device at /dev/vg_tb/onetb which is the same size as my 1tb drives, and I can use it exactly as I would use the 1tb partition at /dev/sdf1.

Now it’s time to create the RAID 5 array from these four volumes.

$ mdadm -v --create /dev/md1 --chunk=128 --level=5 --raid-devices=4 /dev/sdf1 /dev/sdh1 /dev/sdi1 /dev/vg_tb/onetb
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: size set to 976756608K
mdadm: array /dev/md1 started.

The array will begin syncing, and you can watch it by running:

$ watch -n 1 "cat /proc/mdstat"
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 dm-2[4] sdi1[2] sdh1[1] sdf1[0]
      2930269824 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_]
      [=>...................]  recovery =  5.5% (54017536/976756608) finish=593.1min speed=25925K/sec
 
md0 : active raid5 sdg1[0] sdc1[3] sde1[2] sdd1[1]
      2197715712 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
 
unused devices: <none>
</none>

While this is syncing, we can create the ext3 filesystem.

$ mke2fs -j -b 4096 -m 0 -E stride=32,stripe-width=96 /dev/md1

This creates an ext2 filesystem with journaling (ext3), the block size is 4kb, and 0% of the blocks are reserved for the superuser. The stride is calculated as the raid block size / ext2 block size (128k / 4k = 32). The stripe width is calculated as the stride value times the number of data disks in the array. In a 4-disk RAID 5 array, there are three data disks, one being for the parity data.

This will take some time, and will significantly slow down the sync process. Mine dropped from 25mb/s to around 1mb/s. I figure I’ll let it create the filesystem so I can start copying data to it right away, and let it finish its sync on its own time.

While you’re at it, you should set up munin to monitor the SMART data from all the drives as well as the status of the array using my munin-raid-monitor plugin.

Additional Reading:

, , , , ,

No Comments

Another raid5 scare, and how to recover an apparently trashed array

This morning after waking up to lots of thunder and lightning, I got a text message saying my raid5 array had failed. Only this time, 2 of the 3 drives were missing. Since both of those drives were actually mounted via a vblade share (on a different physical machine), I assumed that the other server had freaked out during a power surge. I quickly rebooted the machine to bring back the vblade shares, but then the trouble started.

At some point, the array was “started” but had two faulty drives. I tried –remove and –add to remove and re-add the “faulty” drives. This had the effect of bringing the array back “online” with all the drives as spares. I removed the drives again, and tried the trick I used last time:

mdadm --assemble -f /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2

However, this also didn’t work. It showed the array with /dev/sda2 and /dev/etherd/e4.2 as spares, and e4.1 was nowhere to be seen. At this point I was a little more than worried that I had done something to trash the array. That’s when a google search led me to this handy command:

mdadm -E /dev/sda2

This prints out the superblock information that is present on the hard drive. This told me that the e4.2 drive had not been damaged, since I was able to see information there. Also, the UUIDs on all three drives still matched. However, the bottom section of the report differed on all the drives.

A few google searches later, and I came across this:

mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.2 /dev/etherd/e4.1

Using the –assume-clean flag tells mdadm not to write any data to the drives, or to start initializing the array. However, what I didn’t realize was that it would reset the UUIDs. That command brought the array back online, at least according to /proc/mdstat, but when I tried to mount it, it couldn’t figure out the filesystem.

That’s when I realized that the *order* in which you specify the drives to the –create command actually matters. I re-ran the command like this:

mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2

The array came back online, and I was able to mount it!

So while RAID 5 protects against a single hard drive failing, it does not protect against me running stupid commands on the array. I’m going to have to start backing up my raid arrays onto other drives…

Useful Links

, ,

No Comments