Making MySQL serve UTF8 correctly

If the MySQL server decides that its default environment is UTF8, and that its client actually wants Latin1, it will translate the return values.

I’ve never before had to be careful of the distinction. Perhaps once in a blue moon I would have a record with a “real” quotation mark or a character with an accent, but if it didn’t work correctly, it was never much of a bother.

Once it became important, I had to understand what was happening. I have a table that has filenames in it, and some of those filenames contain characters (a acute, e grave, o umlaut, etc). The actual files on disk have the names encoded in utf8. The records in the database are also recorded in utf8. But the records were being translated by mysql from utf8 to latin1 as they came in. So “Mamá” was recorded on disk as Mam\xc3\xa1 in the directory, and in the database, but when I got the row into memory, the filename field said Mam\xe1. The difference between latin1 and utf8 for this purpose, is that all these many “western/latin special characters” were actually mapped in latin1 to values within the 256 characters available with 1 byte. So the first 128 in the latin1 codespace were ordinary ascii, and the high order 128 had as many of the western/latin diacriticals as possible crammed in there. And in latin1, e9 is a-acute.

But on the web these days, utf8 is much preferred. Latin1 is ok if all you want is the carefully selected subset of 128 characters that can be shoehorned into the high end of the code-point space. But utf8 is a far more general solution. Using a multibyte sequence to represent over a million characters and special symbols.

Turns out mysql has a bunch of variables to control character set and collating sequence. With phpmyadmin, one can look at database->Variables and see characters_set_database, character_set_filesystem, character_set_result, character_set_server, character_set_system, and a bunch more. Or in mysql client one can show varaiables like ‘%character_set%’;

My problem was that the server had come up believing that some of these were set to utf8 and some were set to latin1. I haven’t tried to figure out the logic of how it figures out its default – I don’t want it to default, I want to tell it what I want. So the solution was to add the directive: “character-set-server=utf8” to the mysql configuration file (on cinnamon it was /etc/mysql/mysql.conf.d/mysqld.conf. After restarting all of the relevant character_set_xxx variables come up as utf8.

Update: 8/9/17

I used these changes also on oregano and tarragon, but it results in a different problem for me on blogforacure data. The blogforacure database, built a long time ago, has lots of tables in latin1. There are not a lot of non-ascii characters, but there are a few. One frequent source is people typing double space after a period, which the ckeditor tries to preserve by creating a non-breaking space, which is hex A0 in latin 1. When the site reads this back, if mysql is told that the database is actually utf8, then it displays this as Â. So if I see a bunch of A circumflex in the output, it means I actually have latin 1 characters in the database, which I am interpreting as if they were utf-8.

Removing the specification of character-set-server=utf8 causes the negotiation to give the right result, and the latin1 non-breaking space appears correctly in the output.