Tarragon Rebuild 2019

This server, on Amazon, hosts my website and a dozen others, provides mail service for several people’s email including my own with postfix, dovecot, opendkim, amavis, spamassassin and clamd, provides contacts and calendar service using radicale, provides vpn service with openvpn, provides a tor relay, provides nextcloud service, and hosts my svn repository.

The server was last rebuilt in 2017. Long, long ago when I built the first version of it, I was most familiar with Red Hat/Fedora, and since then it has been easiest just to upgrade it with Fedora, always grumbling to myself that someday I’m going to change it. The problem with being on Fedora, of course, is that Fedora changes every 6 months, so I’m constantly behind. And after a year I’m at end of life. This is dumb for a server that I don’t want to be messing with all the time.

So this time I changed it, but only to Centos. Because it runs a lot of services and other people rely on it, I don’t like to have it down much. I thought a move within the Red Hat family might be less stressful than moving to Debian or Arch.

In September Centos 8 was released, and silly me, I though I would be able to move to Centos 8, and be good until 2029 or something. But the Centos folks didn’t manage to get a new free Centos community AMI out there within the last month, so in the end I went ahead with Centos 7. My Amazon reserved instance is only good for another year anyway, so I’ll probably end up having to revisit this next year.

A word about objectives. A big one was to get off of the Fedora treadmill. It makes no sense for this server box to be on a distro that has the specific aim of adopting cutting edge stuff, issuing a new release every six months, and having releases only last a year. This isn’t a knock on Fedora (though I have a lot of knocks on Fedora ready, if you want to hear them). It is just the wrong distro for this purpose.

A second objective was to change the way I managed the “tarragon data” drive (see this post, and this one), basically the drive that contains all the personality of tarragon other than the kernel. In particular, I wanted to alter the way I use snapshots to keep backups. Previously all the different components were in their own btrfs subvolumes (/home, databases, webdata, repositories, certificates, websites, mail, local bin, opt, backup), and they were individually snapped and backed up. This worked, but I ended up with a very large number of snapshots. And this made it impossible to manage the overall size of the volume by adding and deleting additional btrfs devices. The btrfs device delete simply can’t reconcile that many snapshots. And furthermore, there is no reason to snap the stuff individually. Instead, I will now have one subvolume of live stuff with subdirectories for all the components, and I snap the whole subvolume, and keep the snaps for 30 days. So 30 snaps max, rather than 300.

This necessitated big changes in the backup scripts, which also enabled making some conceptual changes that make the backups more straightforward.

A third (fourth?), somewhat smaller but still important objective has to do with the authorization database I use for so many things (email access, httd access, radicale, nextcloud). That database still had (hangs head in shame again) some old MD5 hashed passwords. I had added blowfish passwords some time ago, but I couldn’t convert the last of the “password consuming” software, dovecot authentication, until I got a newer version of dovecot which would support bcrypt. I really wanted to get rid of those MD5 password hashes. Unlikely that serious crackers are targeting my box, but it is a matter of principle.

I built a careful plan, and I worked very hard on the restore script. I worked on it for over a month. I did sixteen full-scale rebuilds of the server onto new instances (plus 7 more that were abandoned early without doing the full restore). I learned a great deal about building new instances quickly. I ended up creating a 31 page journal of the journey, which contains a lot of detail about what I learned.

Here are some of the major points:

  • The EC2 feature “build more like this one”, together with adding “user-data” to the EC2 build, with my user and my ssh key saves a lot of steps.
  • The advantage of Centos 7 is stability, and that it will get maintenance releases for a long time to come. The big disadvantage, when coming from Fedora 24 was that there were a number of packages which were just too old in Centos 7, and I had to resort to adding third party or upstream repositories (for svn, php, mariadb, nextcloud, bugzilla and some others).
  • I had planned ahead of time not to attempt to copy things which could just as easily be reconstructed, and that was a good decisions. I rebuilt databases from mysql dumps. I checked out websites from svn where possible. I checked out the local bin and the backup scripts.
  • There was a lot of inconsistency in my handling of functions that had an http interface, but which aren’t really user websites. For example, bugzilla, nextcloud, subversion, git, rainloop, phpmyadmin, webalizer, maia, radicale, and mail configuration. This was a good opportunity to clean a lot of that up.
  • I was able to clean up all the certificate handling. Since moving to letsencrypt I had a lot of unnecessary machinery laying around from the old days which could be dispensed with.
  • Another important learned lesson was how to get the mysql database information across. It is a bad idea to move the eponymous mysql database of course. But I found a tool (percona toolkit) which enabled me to dump the grants information. So start with an empty /var/lib/mysql, let mysql/maria built a totally new environment, then use mysqladmin to create all the (empty) databases, pull the grants information out of the file created with the toolkit, and then source all the mysql dumps.
  • I quickly learned to move (disassociate and reassociate) ip addresses using the Amazon EC2 “elastic ip” feature to do that, so that having rebuilt a new instance, I move the “test” ip address to it and I have access to it via dns right away.
  • I should have extrapolated from that learning at the end, when I went live. But I didn’t. When I finally went live, I updated all the dns for all the domains with the new ip address. And then had problems because (a) I had failed to think about it ahead of time and reduce the TTL, so some people couldn’t get mail for many hours, and (b) the new ip address had no mail reputation, so several mail providers rejected mail from the new address. Then my friend said “…can’t you just use the same ip address?”. Duh. I was moving ip addresses back and forth the whole time, because Amazon makes it so easy, and then I failed to do that for going live!? Dumb. Stop the old instance and move its address to the new instance (and put the “test” address onto the old instance, so it can be reached). Easy.
  • Towards the end I decided it would be better to split the restore script up into 3 parts, instead of 2. Right now it does a lot of the script, does a reboot to pick up all the fstab entries for the tarragondata components, and then does a bunch more restoration. I need to separate out stuff that can really wait, such as reinstalling bugzilla, reinstalling some test websites with their databases. This makes the restore process a lot longer – especially on the go-live day, when the system is down for users. Installing all the perl modules for bugzilla takes an age, and doesn’t help any users. But I haven’t gone back and made this change to the script.

There were three major issues:

  • The first has already been mentioned, that I had a long TTL on the dns, and I stupidly tried to change the dns A record (instead of moving the ip address), so some folks had to wait a long time for the change to propagate.
  • Clamd signature loading. This reserved instance only has 3.75Gb of ram, and I want to stay on the reserved instance for all the 3 years I bought, so another whole year. But after the box has been running for a while and apache has blimped up to its full size, there isn’t enough free space left for clamd to process 6 million signatures into memory, and it fails trying to do its periodic update of the signatures. There just isn’t enough ram. This still isn’t completely handled, and right at this minute virus protection is (gulp) turned off. I think I have a separate post to write about this.
  • The nextcloud configuration has been troublesome to get right, once I adopted the upstream version; and I’m not sure still that I have it right. There are a lot of log records being generated.

I was really stressed when I went live with this change, particularly the first day when I was getting a lot of duplicate mail as a result of amavisd being unable to find the crashed virus scanner. But I have to say when all was said and done, I didn’t have that many big problems. Of course, it has only been a couple of days. There may still be beasts lurking waiting to bite me as soon as I become complacent.