I don’t know if I actually know enough to write this post. But I want to record what little I do know about this.
The symptom is that my tarragondata volume on this system, tarragon, claims to be out of space. This is a btrfs volume, about which there are other posts. It contains most of the dynamic parts of the system. The root volume ‘/’ is very small, about 20GB. Just enough to install the Centos code and keep a few little things. The great majority of the information needed to run the system is symlinked out of /, which includes /home, mail, databases, websites and their data, the repositories, certificates, local scripts etc.
This is a 180GB disk, and it currently is running about 55% full, i.e. almost 100GB used. Among the information on this disk are snapshots of all the tarragondata, every night for 30 days. This isn’t disaster backup/disk failure backup (which is elsewhere), this is “operator error” backup.
A couple of weeks ago I began to experience a new kind of failure. In the middle of the night, suddenly this btrfs volume would report that it was out of space – usage 100%, although the amount of storage in use was, still the roughtly 100GB that it normally uses. It manifestly was not actually out of space.
Continue reading Out of space on btrfs
After implementing the new tarragon the biggest problem I had involved the clamav package, and its loading of signatures. If clamd doesn’t come up and open its socket, then amavisd (the daemon who is consulted by postfix to handle all the checking of each piece of mail on input and output) will fail (assuming he is configured to do virus checking), This results in various problems. Amavis will mark the mail as “unchecked”, but worse, it will report failure back to postfix who gets confused and very often the message is delivered two or three times.
Clamd, the clamav daemon, now has over 6 million signatures. There are a lot of bad boys out there. The signatures are loaded by clamd from its database (in /var/lib/clamav) on startup, into memory. As a result, clamd has a large memory footprint, almost 800Mb on my system. The first issue, discovered before going live, was that systemd’s default parameters expect any daemon he starts to load within 90 seconds. If it fails to check in within that time, systemd considers it broken and terminates it. Clamd takes at least 3 minutes to load. I had to set a special TimeoutStartSec value in the systemd service script for clamd@.service.
Whew! I thought, boy I’m glad I figured that out. Hah!
Continue reading Clamd signatures and Apache memory
This server, on Amazon, hosts my website and a dozen others, provides mail service for several people’s email including my own with postfix, dovecot, opendkim, amavis, spamassassin and clamd, provides contacts and calendar service using radicale, provides vpn service with openvpn, provides a tor relay, provides nextcloud service, and hosts my svn repository.
The server was last rebuilt in 2017. Long, long ago when I built the first version of it, I was most familiar with Red Hat/Fedora, and since then it has been easiest just to upgrade it with Fedora, always grumbling to myself that someday I’m going to change it. The problem with being on Fedora, of course, is that Fedora changes every 6 months, so I’m constantly behind. And after a year I’m at end of life. This is dumb for a server that I don’t want to be messing with all the time.
Continue reading Tarragon Rebuild 2019
I now have 8 of these gateway boxes out there. This morning as I was checking backups on one of them, I observed that it took quite a long time to respond. I ran a top on it and was horrified to see that its memory use was 100% and so was its swap. Holy @#$%!@ Batman!
Most of the memory was being used by the lxpanel. And (hangs head in embarrassment) there were actually two lxpanels running – one for the console and one in the vnc window I launch at startup.
It seems the lxpanels leak. I don’t know how badly, but it doesn’t matter. These boxes are meant to run forever so even a tiny leak is eventually fatal.
Well this was simple. I will seldom, if ever, need to get into a graphical environment remotely, and if I do I can always start vnc from the command line. So I took out the startvnc from the startup script. And I have even LESS need for a graphical console since there is not even a monitor on these things. So I set the default systemd target to multi-user.target.
Did this on all the gateways that are running on pi-zeros. Those few running on bigger ubuntu boxes I didn’t really have the problem anyway.
After rebooting them they come up with no lxpanels. I’ll watch the memory use, but I think this will fix the problem.
I found out today something I am sure to forget.
In every zfs dataset there is an invisible directory (by invisible, I mean that it does NOT show up with ls -a) name .zfs. In side this directory are two subdirectories, shares and snapshots.
The snapshots subdirectory is a perfectly serviceable read-only access to all the snapshots. Viz:
Continue reading Invisible zfs snapshot directory
I want to be able to get to my wife’s mac, in another city. She is an unsophisticated user, and I’d like to be able to help her when she needs help, but I can’t ask her to do very much setup. I also want to be able to provide backup for her files.
The first step was to outfit her with one of the little gateway pis previously described. Once that was done, we managed, together, to enable me to get to her mac with ssh, by way of the pi tunnel. And we managed to set up an account on her mac under my name.
Continue reading Setting up a mac remotely
I realized as I was writing a new post that I had never documented the gateway pi undertaking.
This started when a friend in the mountains got a new internet service where the ISP would not allow him (and therefore me) access to his router. As a result I could no longer use ssh to connect to his systems.
I solved this problem by setting his systems up to use a tool called autossh, with which I could have his system start, monitor, and keep running an ssh daemon with reverse tunnels open to my system. I could then reach him by attaching through the reverse tunnels.
Continue reading Gateway pi
At one point I was getting low on space on tarragondata, so I added an additional physical device to the btrfs filesystem containing tarragondata.
[root@tarragon backup_scripts]# btrfs fi show
Label: 'tarragon_data' uuid: d6e4b6fc-8745-4e6e-b6b4-8548142b5154
Total devices 2 FS bytes used 92.04GiB
devid 1 size 120.00GiB used 120.00GiB path /dev/xvdf1
devid 2 size 30.00GiB used 30.00GiB path /dev/xvdg
This is fine, but there are a couple of problems. The main one is that I can no longer use the EC2 snapshot capability on tarragondata, which meant that the nightly EC2 snapshot feature I was using had to be deimplemented.
But now I am about to create a new tarragon instance, and it would be really helpful to be able to snapshot tarragondata (Amazon snapshot, not btrfs snapshot) and then create a new Amazon volume with a consistent snapshot for testing. Continue reading Adjusting the size of the tarragondata volume
On ubuntu it seems there is an automatic mdadm array check provided in /etc/cron.d/mdadm, automatically installed with mdadm. This invokes a utility /usr/share/mdadm/checkarray and the cron is set to run this on the first Sunday of every month at 12:57am. And it is set to do this check on all arrays at one time.
This is horrible! So with 5 arrays, totalling 25TB, when this sucker fires up it quickly saturates the i/o capacity of cinnamon, slows to a crawl and settles in to run forever.
I’ve commented that out, and added my own /etc/cron.d/dee_mdadm which doesn’t do all the goofy shenanigans to try to ensure the thing runs on a Sunday (WHY?! Because the guy who wrote it doesn’t work on Sunday?). Instead, my version simply runs on the first of the month, at 12:57am, and on each month it starts the consistency check on a different array. I have 5 arrays, so 3 are checked twice a year, and 2 are checked thrice. Checking just one at a time means there is a good chance it will be done before morning, at least for the small arrays.
I don’t really think the whole consistency check idea is doing me much good, but at least this doesn’t unaccountably bring the system to its knees on the first Sunday of every month.
I’ve used a variety of certificate providers over the years, Thawte, CA-Cert, Verisign, Comodo, Startcom. Until about six months ago I was using Startcom, and had spent a fair amount of energy setting that up for my own site (this one) as well as all the other sites I manage.
Then Wo-Sign acquired Startcom, and browsers starting distrusting Startcom. I ended up buying a cert from Comodo for this site.
But then I found out about Let’s Encrypt. Not only are they free, but they have this whole ACME auto update thing worked out, using various ACME clients. I’ve been using Certbot from EFF. Continue reading Updating certificates to “Let’s Encrypt” with ACME