Files
wiki/Replacing_Failed_Disk.md

7.8 KiB

title, description, published, date, tags, editor, dateCreated
title description published date tags editor dateCreated
Replacing Failed Disk Guide on removing an old yeller from a BtrFS RAID 1 array (for a new yeller) true 2022-04-30T20:10:40.002Z btrfs, storage, nas, filesystem markdown 2022-04-04T16:25:48.663Z

One of the old 3TB yellers has started playing dirty. We do not negotiate with terrorist - a pair of 8TB's were called for reinforcement on that very same day.

Below, I will write this page as I replace the failing, followed by the non failing disk, for the BtrFS RAID1 array on Takahe.

If all goes well, this will be a nice, cozy page. If I cause catastrophic data loss (again), this shall be a monument of my failure.

Do NOT use this method to replace a filesystem with errors! it will copy them over and they will be unrecoverable! {.is-danger}

Crossing Disk Serial with Device Name

Ever so pretentious, smartd will name a disk by it's serial - see this example below:

➜  ~ systemctl status smartd
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
     Loaded: loaded (/usr/lib/systemd/system/smartd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-04-04 08:01:55 IDT; 11h ago
       Docs: man:smartd(8)
             man:smartd.conf(5)
   Main PID: 1014 (smartd)
     Status: "Next check of 2 devices will start at 19:31:55"
      Tasks: 1 (limit: 4915)
        CPU: 85ms
     CGroup: /system.slice/smartd.service
             └─1014 /usr/sbin/smartd -n

Apr 04 17:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 17:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 17:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 17:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 18:01:56 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 18:01:56 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 18:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 18:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 19:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 19:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors

That's wonderful, honey. But who is /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY?

btrfs sure as hell doesn't know:

➜  ~ btrfs filesystem show /Red-Vol
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
        Total devices 2 FS bytes used 2.21TiB
        devid    1 size 2.73TiB used 2.21TiB path /dev/sdc
        devid    2 size 2.73TiB used 2.21TiB path /dev/sdb

udevadm to the rescue! I even looped it nicely for ya :)

➜  ~ for disk in $(btrfs filesystem show /Red-Vol/ | awk '{print $NF}' | grep "/dev"); do echo $disk && udevadm info --query=all --name=$disk | grep ID_SERIAL; done
/dev/sdc
E: ID_SERIAL=WDC_WD30EFRX-68EUZN0_WD-WCC4N3YN0903
E: ID_SERIAL_SHORT=WD-WCC4N3YN0903
/dev/sdb
E: ID_SERIAL=WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY
E: ID_SERIAL_SHORT=WD-WCC4N7UEPSDY

A-ha! /dev/sdb, you bastard!

Crossing Device Name With devid (pointless)

But wait, there's more! The btrfs replace command expects the devid (or the device name which we already know, making this section utterly insignificant, but what the heck).

To find it, check btrfs filesystem show [mountpoint]:

➜  ~ btrfs filesystem show /Red-Vol/                    
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
	Total devices 3 FS bytes used 2.21TiB
	devid    1 size 2.73TiB used 2.21TiB path /dev/sdc
	devid    2 size 2.73TiB used 2.21TiB path /dev/sdb

A-ha! devid 2, you bastard!

Replacing The Bastard

Now, run btrfs replace: ➜ ~btrfs replace start 2 /dev/sda /Red-Vol/ -f

The -f was thrown in because I have chosen to format the new disk with BtrFS beforehand. I have chosen to format the new disk with Btrfs beforehand because I am very stupid. {.info}

Now, all that is left is watching in panic:

➜  ~ btrfs replace status /Red-Vol             
1.4% done, 0 write errs, 0 uncorr. read errs

Will it work? will it destroy ALL my data?

We shall see.

Resizing The Bastards

Success! Now, assuming we are replacing with larger disks (go big or go home, shmub), you will have to resize the disks. First, see your devid's with btrfs filesystem show:

➜  ~ btrfs filesystem show /Red-Vol/
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
	Total devices 2 FS bytes used 2.21TiB
	devid    1 size 7.28TiB used 2.21TiB path /dev/sdb
	devid    2 size 2.73TiB used 2.21TiB path /dev/sda

Now, run btrfs filesystem resize [devid]:max [mountpoint]:

➜  ~ btrfs filesystem resize 1:max /Red-Vol
Resize device id 1 (/dev/sdb) from 7.28TiB to max
➜  ~ btrfs filesystem show /Red-Vol/       
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
	Total devices 2 FS bytes used 2.21TiB
	devid    1 size 7.28TiB used 2.21TiB path /dev/sdb
	devid    2 size 2.73TiB used 2.21TiB path /dev/sda

➜  ~ btrfs filesystem resize 2:max /Red-Vol
Resize device id 2 (/dev/sda) from 2.73TiB to max
➜  ~ btrfs filesystem show /Red-Vol/       
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
	Total devices 2 FS bytes used 2.21TiB
	devid    1 size 7.28TiB used 2.21TiB path /dev/sdb
	devid    2 size 7.28TiB used 2.21TiB path /dev/sda

Finally, to see your changes, remount the filesystem:

➜  ~ mount -o remount,rw /Red-Vol
➜  ~ btrfs filesystem show /Red-Vol/ 
Label: none  uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
	Total devices 2 FS bytes used 2.21TiB
	devid    1 size 7.28TiB used 2.21TiB path /dev/sdb
	devid    2 size 7.28TiB used 2.21TiB path /dev/sda

Hurrah!

Mounting The Bastards

Do not go there. You know what you did. {.is-warning}

The best method to mount your new pool is by (one) of your disk's UUID - which is always unique.

Finding the UUID is easy with blkid:

➜  ~ blkid | grep /dev/sda
/dev/sda: UUID="c2d98db0-b903-4cc2-947c-4c4c944da026" UUID_SUB="19f4df76-f50b-48c2-ad4b-1f71936440cd" BLOCK_SIZE="4096" TYPE="btrfs"

Now, go fish:

➜  ~ cat /etc/fstab
...
...
...
UUID=c2d98db0-b903-4cc2-947c-4c4c944da026	 /Red-Vol/         btrfs  defaults,compress=zstd:11     0  0
#    ^ This friendo right here from blkid
...
...
...

Or you can go by just the id, which is how OpenSUSE did it. I do not know why but I know they know better, you know?

...
...
...
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0PDFR0H /Red-Vol/	   btrfs  defaults,compress=zstd:11	0  0
...
...
...

Now, reboot and hope for the best.

Keep An Eye On The Bastards

Now, we add the disk(s) we replaced to smartd. Edit /etc/smartd.conf and add the disk:

#DEVICESCAN
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0PDFR0H        -a
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0SHFR0H        -a

Uncommenting DEVICESCAN also works, but we do not trust it.

Balance The Bastards & Scrub The Bastards

You're not assuming nothing went wrong, are you?

Anyway, if you got this far, run btrfs balance start [mountpoint]. If that checks out, run btrfs scrub start [mountpoint]. Each of these will take many, many hours.

Enjoy the rest of your day.