Files
wiki/Replacing_Failed_Disk.md

197 lines
7.8 KiB
Markdown

---
title: Replacing Failed Disk
description: Guide on removing an old yeller from a BtrFS RAID 1 array (for a new yeller)
published: true
date: 2022-04-30T20:10:40.002Z
tags: btrfs, storage, nas, filesystem
editor: markdown
dateCreated: 2022-04-04T16:25:48.663Z
---
One of the old 3TB yellers has started playing dirty.
We do not negotiate with terrorist - a pair of 8TB's were called for reinforcement on that very same day.
Below, I will write this page as I replace the failing, followed by the non failing disk, for the BtrFS RAID1 array on Takahe.
If all goes well, this will be a nice, cozy page. If I cause catastrophic data loss (again), this shall be a monument of my failure.
> Do **NOT** use this method to replace a filesystem with errors! it ***will*** copy them over and they ***will*** be unrecoverable!
{.is-danger}
# Crossing Disk Serial with Device Name
Ever so pretentious, `smartd` will name a disk by it's serial - see this example below:
```zsh
➜ ~ systemctl status smartd
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
Loaded: loaded (/usr/lib/systemd/system/smartd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-04-04 08:01:55 IDT; 11h ago
Docs: man:smartd(8)
man:smartd.conf(5)
Main PID: 1014 (smartd)
Status: "Next check of 2 devices will start at 19:31:55"
Tasks: 1 (limit: 4915)
CPU: 85ms
CGroup: /system.slice/smartd.service
└─1014 /usr/sbin/smartd -n
Apr 04 17:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 17:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 17:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 17:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 18:01:56 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 18:01:56 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 18:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 18:31:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
Apr 04 19:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 2 Currently unreadable (pending) sectors
Apr 04 19:01:55 Takahe smartd[1014]: Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY [SAT], 4 Offline uncorrectable sectors
```
That's wonderful, honey.
But who is `/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY`?
`btrfs` sure as hell doesn't know:
```zsh
➜ ~ btrfs filesystem show /Red-Vol
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 2 FS bytes used 2.21TiB
devid 1 size 2.73TiB used 2.21TiB path /dev/sdc
devid 2 size 2.73TiB used 2.21TiB path /dev/sdb
```
`udevadm` to the rescue! I even looped it nicely for ya :)
```zsh
➜ ~ for disk in $(btrfs filesystem show /Red-Vol/ | awk '{print $NF}' | grep "/dev"); do echo $disk && udevadm info --query=all --name=$disk | grep ID_SERIAL; done
/dev/sdc
E: ID_SERIAL=WDC_WD30EFRX-68EUZN0_WD-WCC4N3YN0903
E: ID_SERIAL_SHORT=WD-WCC4N3YN0903
/dev/sdb
E: ID_SERIAL=WDC_WD30EFRX-68EUZN0_WD-WCC4N7UEPSDY
E: ID_SERIAL_SHORT=WD-WCC4N7UEPSDY
```
A-ha! `/dev/sdb`, you bastard!
# Crossing Device Name With devid (pointless)
But wait, there's more!
The `btrfs replace` command expects the `devid` (or the device name which we already know, making this section utterly insignificant, but what the heck).
To find it, check `btrfs filesystem show [mountpoint]`:
```zsh
➜ ~ btrfs filesystem show /Red-Vol/
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 3 FS bytes used 2.21TiB
devid 1 size 2.73TiB used 2.21TiB path /dev/sdc
devid 2 size 2.73TiB used 2.21TiB path /dev/sdb
```
A-ha! `devid 2`, you bastard!
# Replacing The Bastard
Now, run `btrfs replace`:
`➜ ~btrfs replace start 2 /dev/sda /Red-Vol/ -f`
> The `-f` was thrown in because I have chosen to format the new disk with BtrFS beforehand. I have chosen to format the new disk with Btrfs beforehand because I am very stupid.
{.info}
Now, all that is left is watching in panic:
```zsh
➜ ~ btrfs replace status /Red-Vol
1.4% done, 0 write errs, 0 uncorr. read errs
```
Will it work? will it destroy ALL my data?
We shall see.
# Resizing The Bastards
Success! Now, assuming we are replacing with larger disks (go big or go home, shmub), you will have to resize the disks.
First, see your `devid`'s with `btrfs filesystem show`:
```
➜ ~ btrfs filesystem show /Red-Vol/
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 2 FS bytes used 2.21TiB
devid 1 size 7.28TiB used 2.21TiB path /dev/sdb
devid 2 size 2.73TiB used 2.21TiB path /dev/sda
```
Now, run `btrfs filesystem resize [devid]:max [mountpoint]`:
```
➜ ~ btrfs filesystem resize 1:max /Red-Vol
Resize device id 1 (/dev/sdb) from 7.28TiB to max
➜ ~ btrfs filesystem show /Red-Vol/
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 2 FS bytes used 2.21TiB
devid 1 size 7.28TiB used 2.21TiB path /dev/sdb
devid 2 size 2.73TiB used 2.21TiB path /dev/sda
➜ ~ btrfs filesystem resize 2:max /Red-Vol
Resize device id 2 (/dev/sda) from 2.73TiB to max
➜ ~ btrfs filesystem show /Red-Vol/
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 2 FS bytes used 2.21TiB
devid 1 size 7.28TiB used 2.21TiB path /dev/sdb
devid 2 size 7.28TiB used 2.21TiB path /dev/sda
```
Finally, to see your changes, remount the filesystem:
```
➜ ~ mount -o remount,rw /Red-Vol
➜ ~ btrfs filesystem show /Red-Vol/
Label: none uuid: c2d98db0-b903-4cc2-947c-4c4c944da026
Total devices 2 FS bytes used 2.21TiB
devid 1 size 7.28TiB used 2.21TiB path /dev/sdb
devid 2 size 7.28TiB used 2.21TiB path /dev/sda
```
Hurrah!
# Mounting The Bastards
> Do not go there. You know what you did.
{.is-warning}
The best method to mount your new pool is by (one) of your disk's `UUID` - which is always unique.
Finding the `UUID` is easy with `blkid`:
```zsh
➜ ~ blkid | grep /dev/sda
/dev/sda: UUID="c2d98db0-b903-4cc2-947c-4c4c944da026" UUID_SUB="19f4df76-f50b-48c2-ad4b-1f71936440cd" BLOCK_SIZE="4096" TYPE="btrfs"
```
Now, go fish:
```
➜ ~ cat /etc/fstab
...
...
...
UUID=c2d98db0-b903-4cc2-947c-4c4c944da026 /Red-Vol/ btrfs defaults,compress=zstd:11 0 0
# ^ This friendo right here from blkid
...
...
...
```
Or you can go by just the `id`, which is how OpenSUSE did it. I do not know why but I know they know better, you know?
```
...
...
...
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0PDFR0H /Red-Vol/ btrfs defaults,compress=zstd:11 0 0
...
...
...
```
Now, reboot and hope for the best.
# Keep An Eye On The Bastards
Now, we add the disk(s) we replaced to `smartd`. Edit `/etc/smartd.conf` and add the disk:
```conf
#DEVICESCAN
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0PDFR0H -a
/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A0SHFR0H -a
```
Uncommenting `DEVICESCAN` also works, but we do not trust it.
# Balance The Bastards & Scrub The Bastards
You're not assuming nothing went wrong, are you?
Anyway, if you got this far, run `btrfs balance start [mountpoint]`. If that checks out, run `btrfs scrub start [mountpoint]`. Each of these will take many, many hours.
Enjoy the rest of your day.