There was a fail event reported on rosemary from one side of a pair of 60GB SSDs, which hold Rosemary_Data. Typical of my installations this mirror set holds the stuff the system needs beyond the os install: /home, databases, certificates, repositories, mail, samba, local bin, etc. Its a mirror set with an encrypted container, containing a btrfs filesystem. The older versions of these setups contain separate btrfs subvolumes for the different directories, newer ones have only one subvolume for that, and another for snaps. This is an old one.
Rosemary doesn’t have an extensive set of services – really only the /home and the databases. No real need for much of anything else. The local bin comes out of the repo anyway, there is no mail, no repository, no certs. However, without that volume the system won’t come up in a usable way. So lesson one learned here was when you get a fail event, attend to it. I let it go for a few days, because I knew I was going to have to pull the case out of the rack mount to get at the SSDs.
Day before yesterday, Saturday, I paid attention. It looked like the failed ssd was still physically present, and identified itself properly. Often I get transient errors, and I’ve gotten into the habit of trying a simple re-add: mdadm re-add /dev/mdx /dev/sdy. If that doesn’t work, then I try mdadm –zero-superblock /dev/sdy followed by mdadm add /dev/mdx /dev/sdy. I tried that, it seemed to work, it started a re-sync; but when I came back later the array was stopped, and had not completed the re-sync, and only the new “spare” was in the array.
Lesson 2, probably if there is an option it is probably better not to re add failed media. I’ve done it many times. Sometimes it works, sometimes it doesn’t. Here it appears the system got a failure on the good media before it had completed re-syncing the bad one back in.
This was an odd recovery, because /etc and /root were still intact. The three main things that needed to be recovered were /home, samba, and databases. The local bin was lost, because it was on the rosemary data volume, but it was easy to get it back from the repo, directly into /usr/local/bin and I just abandoned the idea of having it in rosemary data at all.
I decided that I would just replace the media, and create a new array with new media, and then recover the array from the previous night’s backups. It so happened I had a pair of brand new 60Gb ssds that I had bought on sale, because increasingly it is hard to find these little ssds – everybody wants to buy enormous ones, which I understand. So when I saw some cheap little ones, I got them for the stash. Good thing.
Pulled the case, removed the old one, put in the new ones. Restored the case in the rack.
mdadm –create /dev/md60 –level=mirror -n 2 /dev/sde /dev/sdf
cryptsetup luksFormat /dev/md60
cryptsetup luksOpen /dev/md60 rosemary_backup
mkfs.btrfs -L rosemary_data /dev/mapper/rosemary_data
mount /dev/mapper/rosemary_data /mnt/rosemarydata
btrfs sub create /mnt/rosemarydata/home
btrfs sub create /mnt/rosemarydata/mysql
btrfs sub create /mnt/rosemarydata/samba
btrfs sub create /mnt/rosemarydata/system
I neglected to add the root keyfile as a luks key, which came back to haunt me later, which I eventually got to a reboot.
I didn’t recreate localbin or certs. Not so much using the svn certificates stuff now that everything is on letsencrypt. And putting localbin on the data volume proved to be a bad idea, as discussed. Then change the uuids in startup, crypttab and mdadm.conf.
Getting back /home was only a matter of rsyncing from /backups/rosemary/home to /mnt/rosemarydata/home.
The biggest problem was trying to recover the databases. I had a completely empty /var/lib/mysql, so the mysql server would not start at all, which meant I couldn’t create the databases with mysqladmin, and of course I couldn’t populate them unless I could get the server up.
I tried doing mysql_install_db, but ran into various problems. In the end, I decided to completely apt remove mysql and reinstall it, and while I was at it I switched from mysql to mariadb. There were some problems, many little problems, involving /etc/my* and /etc/alternatives/my*.
I also had some trouble with apparmor, and had to copy an apparmor file from the net which enabled the startup to send the notify back to systemd. Then /etc/init.d/apparmor reload.
Eventually, I got a brand new mariadb server installed, got my few changes into /etc/my* and got the server up. Then I was able to mysql_install_db -u mysql, and mysql-secure-installation (I ended by not putting in a root password, I think that was interfering with the debian stuff on startup, I added the password later, and I had to change the password for debian-sys-maint to match what was in the /etc/my*).
Then I was able to mysqladmin -v create all the databases, and then for each one:
bunzip2 -c -k <database>.sql.bz2 | mysql <database> > /tmp/dbrestore<database>
I did not restore the mysql database this way. Perhaps I should have done. But I put in most of the users and grants by hands, because I didn’t have a grants file. If I ever do this again, I think it is worth trying to do the mysql database the same way.
After I got back up, I had picked up an extra default route on the vpn interface, which took a while to track down, and had me chasing my tail for a while.
Had a little trouble getting samba back up because I didn’t have the /var/lib/samba. I ended up removing and reinstalling samba. I have changed the rosemary backup so it backs up /var/lib/samba.
The main problems:
1. Had mysqldump copies, so recovery required rebuilding the databases, which ended up requiring complete re-installation of mysql and starting over with an empty mysql.
2. I didn’t have a grants file on rosemary. I have added percona and put it into the script.
3. I had /usr/local/bin in a btrfs subvolume, which was unnecessary as I just got it from the repo anyway. It is better not to have /usr/local/bin in a btrfs volume.
4. I didn’t have a copy of /var/lib/samba anywhere. It wasn’t captured in the backup scripts.
5. After I got up, I couldn’t restart couchpotato server, because it was running out of /home/dee/git and I had excluded git from what got backed up. I have fixed that.