Simon Mott - All about me

Shoehorning Proxmox into mdadm RAID1

by Simon

My home server has always been somewhat of a Frankenstein’s Monster when it comes to hardware. Until about 2 years ago, it consisted of a large JBOD array of several different drive sizes with no kind of redundancy what-so-ever – for both OS and Data disks. That all changed though when I decided to invest in some 3TB disks and build a “proper” RAID array for my data; Which coincidentally was around the same time my data storage array lost a drive resulting in about 800GB of data-loss – I was unable to source a decent hardware RAID card at the time so I went with Software RAID (mdadm) instead.

The chassis which houses my home server happens to be an old X-Case model that has 10 internal bays, a couple of 5.25″ for Optical drives and a 5.25″ to 3.5″ caddy that seems to be designed for a floppy drive (it has a cut-out at the front for access). I use the floppy slot to house my OS disk and the rest are for my data disks which are all in RAID6 for redundancy. This gives me a total of 11 hard drives in this chassis in its default configuration.

When I rebuilt my server to make use of Proxmox, I put in a known-working drive as the OS drive; It was an old drive at the time but didn’t have much in the way of power on hours so I figured it’d last me a while. Here I am a few years later and the drive is still going, but SMART data is showing some figures that’re making me worried it’s going to start having problems soon so I set about replacing it.

Step 1 – Replace the problem drive

Sure, I could have just bought a new drive and re-installed my server from scratch, but frankly I couldn’t be bothered with all the niggly applications and configurations I’d need to get right to make it all work nicely again.

Besides, where’s the fun in that?

I bought a 3.5″ to Dual 2.5″ adaptor off eBay (In-Win 3RAMZF078700) which, as you might guess, should nicely house a couple of 2.5″ drives where my old 3.5″ OS disk used to live. Having already made use of mdadm on this system, and now having two drives, I should be able to use them in RAID1 for my OS disk.

I chose to get two HGST Travelstar 7K1000 drives because a friend uses a lot of them and has (so far) had no problems with them. They also spin at 7200RPM so should get reasonable performance in RAID1 given I’m limited to essentially a single drive’s performance over more conventional 2.5″ drives which operate at 5400RPM.

There were a couple of things I needed to consider before embarking on this project though – most of which were potential physical constraints inside the chassis itself.

Do I have enough SATA data ports?

My SuperMicro X8DTL-3F motherboard has 6 SATA ports and 8 SAS ports. I learned the hard way that you can only boot from the SATA ports, and that the on-board LSI SAS controller does not support drives larger than 2TB. For my configuration this leaves me with just the 6 on board SATA ports so I invested in an LSI 9211-8i HBA (It’s actually a cross-flashed Dell H200A – http://blog.michael.kuron-germany.de/2014/11/crossflashing-dell-perc-h200-to-lsi-9211-8i/comment-page-1/)

The 9211-8i in theory supports 8 drives – combine the 6 SATA ports on the motherboard with the 8 on the HBA and I have theoretically got 14 ports available to use with my 2 boot drives (RAID1) and 10 data drives (RAID6)

I can only fit 11 disks in the chassis (12 if you count the 2.5″ disks instead) so I should have enough.

Do I have enough SATA power connectors?

My PSU is an EVGA 650 GQ which has 9 native SATA power ports and a further 3 molex connectors to which I can attach some 2xSATA splitters. I use one of the molex connectors for cooling though so I only get 4 SATA connectors from molex and 9 from native. This should be enough for my purposes.

Will the dual 2.5″ bracket even fit where a 3.5″ disk would normally go?

The HGST Travelstar disks come in at 9.5mm in height. The disk that sits on the bottom of the bracket has enough clearance for a 12mm drive, but the top one was a potential problem as it could sit on the bracket taller than a standard 3.5″ drive.

I’d fitted the 2.5″ drives and stood the bracket next to two different 3.5″ models and as you can see, its about the same height. It is slightly taller than the drive on the right (WD800JD) But is about the same height as the one on the left (WD10EADS)

I do have open-air above my drives in this chassis so height restrictions are not a problem for me; However if you wanted to use this bracket in a more conventional 3.5″ bay, you should not have any issues using two 9.5mm drives.

The placement of these drives in the “floppy” bay unfortunately means there is no airflow from case fans so I did have to consider the gap between the drives to allow for cooling – which as you can see should be ample clearance.

Step 2 – Cloning the original

As my plan was to go from a single 3.5″ drive to two 2.5″ drives, that does beg the question as to how I get the data onto the smaller drives. I could have skipped this step and jumped right into Step 4 to assemble the RAID array using both the old and one of the new drives, to later replace the old drive with the remaining new one.

Instead, I opted to use a tool called CloneZilla which will clone the original drive onto the new one (although you could have just used dd) whilst leaving the original untouched should I need to revert (or re-clone) it at a later date. This was a personal choice but I like the idea of having the safety net should anything go wrong further down the line then, at least I could start again if needed.

First we connect one of the new HGST drives up and temporarily plug in the old OS drive so that we can clone it. Next I booted Clonezilla and followed the on-screen instructions clone the drives.

You might need to zero the destination disk first so that everything clones over nicely.

It’s very important that you make sure your destination drive is correct because CloneZilla will overwite anything thats already there

Once the clone is complete, you should be able to disconnect the old drive and boot from the new one. If all goes well then your system should look the same from the OS’s point of view other than the fact you’re now on a larger disk (In my case anyway, I went from 160GB to 1TB).

Step 3 – Fixing the partitions on the larger disk

Before we delve too deep into this, I realize my partitions are not standard for Proxmox these days in that I have a /boot partition outside of the main root partition (for me /dev/sdd2 is /boot). Nonetheless this should not affect the overall process if your partitions are different – you just need to skip the parts that’re specific to me and adjust according to your configuration.

Naturally your mileage may vary depending on how your system is set up. Don’t just blindly paste commands into a terminal as you could cause irreversible data-loss 🙂 – Make sure you understand what you’re doing!

Now that we’ve a cloned the old OS disk onto one of the new (larger) disks, I thought now would be a good time to fix the primary data partition (where the “pve” LV lives) so that it takes up the entirety of the disk. I went from a 160GB to 1TB drive so to do this I simply loaded parted and took a look at the partitions whilst telling it to display sector count instead of the default – It showed me a warning about the GPT size being wrong and offered to correct it for me.

root@yuzuki:~# parted /dev/sdd
GNU Parted 3.2
Using /dev/sdd
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit s
(parted) print free
Warning: Not all of the space available to /dev/sdd appears to be used, you can fix the GPT
to use all of the space (an extra 1640943360 blocks) or continue with the current setting? 
Fix/Ignore? Fix
Model: ATA HGST HTS721010A9 (scsi)
Disk /dev/sdd: 1953525168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start       End          Size         File system  Name     Flags
        34s         2047s        2014s        Free Space
 1      2048s       4095s        2048s                     primary  bios_grub
 2      4096s       1048575s     1044480s     ext3         primary  boot, esp
 3      1048576s    312580095s   311531520s                primary  lvm
        312580096s  1953525134s  1640945039s  Free Space

Here we can see the current starting sector for our last partition (partition 3, aka /dev/sdd3) is 1048576 and the last usable sector of free space is 1953525134. We should simply be able to remove the partition table (It wont remove data) and then re-create it with a larger size. As long as the starting position is the same then everything should line-up.

(parted) rm 3
Error: Partition(s) 3 on /dev/sdd have been written, but we have been unable to inform the
kernel of the change, probably because it/they are in use.  As a result, the old partition(s)
will remain in use.  You should reboot now before making
further changes.
Ignore/Cancel? i
(parted) mkpart primary 1048576 1953525134
(parted) toggle 3 lvm
(parted) print
Model: ATA HGST HTS721010A9 (scsi)
Disk /dev/sdd: 1953525168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start     End          Size         File system  Name     Flags
 1      2048s     4095s        2048s                     primary  bios_grub
 2      4096s     1048575s     1044480s     ext3         primary  boot, esp
 3      1048576s  1953525134s  1952476559s               primary  lvm

(parted) quit
Information: You may need to update /etc/fstab.

Finally we need to make sure that LVM is aware of the new disk size lest we forget about it later on

root@yuzuki:~# pvdisplay /dev/sdd3
  --- Physical volume ---
  PV Name               /dev/sdd3
  VG Name               pve
  PV Size               148.55 GiB / not usable 3.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              38028
  Free PE               4095
  Allocated PE          33933
  PV UUID               6p74nl-SxZp-4xu7-j3lH-ACsg-Hs9I-5rNwpq

root@yuzuki:~# pvresize /dev/sdd3
  Physical volume "/dev/sdd3" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
  
root@yuzuki:~# pvdisplay /dev/sdd3
  --- Physical volume ---
  PV Name               /dev/sdd3
  VG Name               pve
  PV Size               931.01 GiB / not usable 312.50 KiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              238339
  Free PE               204406
  Allocated PE          33933
  PV UUID               6p74nl-SxZp-4xu7-j3lH-ACsg-Hs9I-5rNwpq

If you didn’t want to implement RAID1 then this is where you’d stop – We’ve moved our OS to a new disk without reinstalling.

I however do want to implement RAID1 so in Step 4 I’ll walk through how to prepare the second RAID member and begin the process of moving towards having a complete array.

Step 4 – Preparing for RAID

The purpose of this article is to document how I implemented software RAID1 on a live system without having to reinstall. In the previous steps, I took a little detour in that I cloned the original disk onto one of the new replacement disks. Primarily this was to preserve the original data should anything not go according to plan. The following steps are where you’d start if you wanted to skip the cloning process and work on the source disk without a “spare” copy of your data.

It’s important that you identify which companion disk you intend to use for your array. A useful tool to use here would be smartctl. With this tool you can query each disk until you find which ones you want to work with (Especially useful if you have a lot of disks).

root@yuzuki:~# smartctl -i /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.35-2-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JR1000********
LU WWN Device Id: 5 000cca 8c8d62953
Firmware Version: JB0OA3U0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar  6 21:16:45 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

In my case, /dev/sdd is the non-RAID disk which currently houses my OS and /dev/sde is the blank disk I want to initialize my array onto. Firstly we need to copy the current partition table from /dev/sdd to /dev/sde. As the partition table is GPT, we should use sgdisk to do this.

root@yuzuki:~# sgdisk -R=/dev/sde /dev/sdd
The operation has completed successfully.

Once this command has completed, we should now change the type code for /dev/sde2 and /dev/sde3 to reflect that these partitions are members of software RAID. You can do this with any partition tool that supports GPT, but in my case I’ll use sgdisk to change the types.

root@yuzuki:~# sgdisk -t 2:fd00 /dev/sde
The operation has completed successfully.

root@yuzuki:~# sgdisk -t 3:fd00 /dev/sde
The operation has completed successfully.

This should nicely result in /dev/sde having the same partition structure as /dev/sdd (but with both partition types as Linux RAID)

root@yuzuki:~# fdisk -l /dev/sde

Disk /dev/sde: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: A4EF094E-5546-4DC1-A262-2B08B9C586B1

Device       Start        End    Sectors  Size Type
/dev/sde1     2048       4095       2048    1M BIOS boot
/dev/sde2     4096    1048575    1044480  510M Linux RAID
/dev/sde3  1048576 1953525134 1952476559  931G Linux RAID

Step 5 – Initialize RAID

We’re now at the stage where both our disks should be nearly identically partitioned and we can actually start creating our arrays. We’ll first do this in a degraded state which will allow us to move data around.

In my case /dev/sdd2 is /boot and /dev/sdd3 houses my LVM data (ultimately containing / and swap).

I already have /dev/md0 as my primary data storage RAID, so /dev/md1 will be for /boot and /dev/md2 for everything else.

Lets start by creating both mdadm devices, but using the currently empty disk (/dev/sde) – I chose to ignore the notice displayed as my version of grub has support for mdadm out of the box.

root@yuzuki:~# mdadm --create /dev/md1 --level=1 --raid-disks=2 missing /dev/sde2
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.

root@yuzuki:~# mdadm --create /dev/md2 --level=1 --raid-disks=2 missing /dev/sde3
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md2 started.

Next we need to ensure that the new arrays are started at boot. Typically this is done by adding them to mdadm.conf. The location of this file may vary depending on your OS but on debian its in /etc/mdadm. I simply take a look at the 2 arrays and add them to the end of the config file

root@yuzuki:~# mdadm --detail --scan /dev/md{1,2} >> /etc/mdadm/mdadm.conf

We should now be in a position where you have /dev/md1 and /dev/md2 and they should both start automatically on-boot. Next we’ll look at making use of them!

Step 6 – Configure grub and move /boot

We should now have our degraded arrays which we can now immediately start to use. I’ll begin by creating a new file system on /dev/md1 for use with /boot and then move data to it. Obviously if your /boot is not a separate partition but is included in your root partition then you don’t need to worry about this part.

root@yuzuki:~# mkfs.ext4 /dev/md1
mke2fs 1.42.12 (29-Aug-2014)
Creating filesystem with 521920 1k blocks and 130560 inodes
Filesystem UUID: 37e651bc-ffc1-41d8-a9c7-99ca9273b829
Superblock backups stored on blocks: 
        8193, 24577, 40961, 57345, 73729, 204801, 221185, 401409

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done 

root@yuzuki:~# mkdir /mnt/tmp
root@yuzuki:~# mount /dev/md1 /mnt/tmp/
root@yuzuki:~# cp -ax /boot/* /mnt/tmp/
root@yuzuki:~# umount /mnt/tmp 
root@yuzuki:~# rmdir /mnt/tmp

Once all this is done, lets edit /etc/fstab to tell our system to mount /boot from the new array. Your fstab will be different to mine, but find where /boot is mounted and change it to /dev/md1 (I kept my original UUID around just in-case)

root@yuzuki:~# cat /etc/fstab 
# <file system> <mount point> <type> <options> <dump> <pass>
#UUID=f6f3e14e-a2fe-468f-80ee-8824dab51e36 /boot ext3 defaults 0 1

/dev/md1 /boot ext4 defaults 0 1
/dev/pve/root / ext4 errors=remount-ro 0 1
/dev/data/wdred /mnt/storage ext4 defaults,barrier=0 0 1
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0

It’s critical you make sure there are no errors here, getting this wrong could result in an un-bootable system!

At this point I chose to reboot my system to ensure it’s booting correctly and running /boot from the correct place

root@yuzuki:~# mount | grep boot
/dev/md1 on /boot type ext4 (rw,relatime,stripe=4,data=ordered)

mdadm needs to be told to start all arrays on-boot (As we intend to host / in an array) so edit /etc/default/mdadm and change INITRDSTART to ‘all’ (Mine was set to ‘none’ by default)

root@yuzuki:~# grep ^INITRDSTART /etc/default/mdadm
INITRDSTART='all'

Followed by some grub and initrd changes

root@yuzuki:~# echo 'GRUB_DISABLE_LINUX_UUID=true' >> /etc/default/grub
root@yuzuki:~# echo raid1 >> /etc/modules
root@yuzuki:~# echo raid1 >> /etc/initramfs-tools/modules

And then make sure grub (boot loader) exists on the 2 new RAID disks followed by a grub update and RAM disk update

root@yuzuki:~# grub-install /dev/sdd
Installing for i386-pc platform.
Installation finished. No error reported.
root@yuzuki:~# grub-install /dev/sde
Installing for i386-pc platform.
Installation finished. No error reported.

root@yuzuki:~# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.35-2-pve
Found initrd image: /boot/initrd.img-4.4.35-2-pve
Found memtest86+ image: /memtest86+.bin
Found memtest86+ multiboot image: /memtest86+_multiboot.bin
done

root@yuzuki:~# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-4.4.35-2-pve

Now we’re in the final stretch!

Step 7 – Assembling the arrays

If everything has gone to plan, our system should now be running /boot from /dev/md1 which means the old /boot partition on /dev/sdd2 is no longer needed. We should set its type to Linux RAID and add it to our array

root@yuzuki:~# sgdisk -t 2:fd00 /dev/sdd
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.

root@yuzuki:~# mdadm --add /dev/md1 /dev/sdd2
mdadm: added /dev/sdd2

root@yuzuki:~# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] [raid1] 
md2 : active raid1 sde3[1]
      976107200 blocks super 1.2 [2/1] [_U]
      bitmap: 1/8 pages [4KB], 65536KB chunk

md1 : active raid1 sdd2[2] sde2[1]
      521920 blocks super 1.2 [2/1] [_U]
      [===========>.........]  recovery = 55.2% (288896/521920) finish=0.0min speed=144448K/sec
      
md0 : active raid6 sdh1[0] sdc1[6] sda1[7] sdg1[5] sdf1[4] sdi1[2] sdb1[1]
      14650670080 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      
unused devices: <none>

And viola, we now have (when its recovered) a working software RAID for /boot

Next we’ll add /dev/md2 to LVM and make the space available to our pve volume group so that we can simply pvmove the remaining data to our RAID array.

root@yuzuki:~# pvcreate /dev/md2
  Physical volume "/dev/md2" successfully created
root@yuzuki:~# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/sdd3
  VG Name               pve
  PV Size               931.01 GiB / not usable 312.50 KiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              238339
  Free PE               204406
  Allocated PE          33933
  PV UUID               6p74nl-SxZp-4xu7-j3lH-ACsg-Hs9I-5rNwpq
   
  "/dev/md2" is a new physical volume of "930.89 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/md2
  VG Name               
  PV Size               930.89 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               PzVPqV-Rn1V-ASA5-Q4Ni-cXhS-LUCu-QZUTiC
  
  
root@yuzuki:~# vgextend pve /dev/md2
  Volume group "pve" successfully extended

You might notice that /dev/sdd3 and /dev/md2 aren’t the same size according to LVM – I suspect this is because of RAID metadata overheads. Thankfully because I am moving from smaller to larger disks, I have enough Free PE that I can move without issue. If however you have no Free PE’s and your disks are not equal in size like shown above, you might need to shrink one of your Logical Volumes and fix the file system later.

I’m happy that I can proceed here so I move the contents of /dev/sdd3 to /dev/md2 within LVM

root@yuzuki:~# pvmove /dev/sdd3 /dev/md2
  /dev/sdd3: Moved: 100.0%

The pvmove itself will take a while depending on how much data there is to move (Allocated PE) but once that is done, all we have left to do is remove the old disk from LVM

root@yuzuki:~# vgreduce pve /dev/sdd3
  Removed "/dev/sdd3" from volume group "pve"

root@yuzuki:~# pvremove /dev/sdd3
  Labels on physical volume "/dev/sdd3" successfully wiped

And then to add the old partition to our corresponding mdadm array

root@yuzuki:~# mdadm --add /dev/md2 /dev/sdd3
mdadm: added /dev/sdd3

root@yuzuki:~# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md2 : active raid1 sdd3[2] sde3[1]
      976107200 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  1.3% (12725760/976107200) finish=133.5min speed=120174K/sec
      bitmap: 2/8 pages [8KB], 65536KB chunk

md1 : active raid1 sdd2[2] sde2[1]
      521920 blocks super 1.2 [2/2] [UU]
      
md0 : active raid6 sdh1[0] sdb1[6] sda1[7] sdg1[5] sdf1[4] sdi1[2] sdc1[1]
      14650670080 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      
unused devices: <none>

A final reboot to make sure everything comes up correctly and if so, its just a matter of waiting for RAID recovery to complete.