Blog "about python 3"

Fri Jan 3 10:32:14 EST 2014

On Sat, Jan 4, 2014 at 1:57 AM, Roy Smith <roy at panix.com> wrote:
> I was doing a project a while ago importing 20-something million records
> into a MySQL database.  Little did I know that FOUR of those records
> contained astral characters (which MySQL, at least the version I was
> using, couldn't handle).
>
> My way of dealing with those records was to nuke them.  Longer term we
> ended up switching to Postgress.

Look! Postgres means you don't lose data!!

Seriously though, that's a much better long-term solution than
destroying data. But MySQL does support the full Unicode range - just
not in its "UTF8" type. You have to specify "UTF8MB4" - that is,
"maximum bytes 4" rather than the default of 3. According to [1], the
UTF8MB4 encoding is stored as UTF-16, and UTF8 is stored as UCS-2. And
according to [2], it's even possible to explicitly choose the
mindblowing behaviour of UCS-2 for a data type that calls itself
"UTF8", so that a vague theoretical subsequent version of MySQL might
be able to make "UTF8" mean UTF-8, and people can choose to use the
other alias.

To my mind, this is a bug with backward-compatibility concerns. That
means it can't be fixed in a point release. Fine. But the behaviour
change is "this used to throw an error, now it works". Surely that can
be fixed in the next release. Or surely a version or two of
deprecating "UTF8" in favour of the two "MB?" types (and never ever
returning "UTF8" from any query), followed by a reintroduction of
"UTF8" as an alias for MB4, and the deprecation of MB3. Or am I
spoiled by the quality of Python (and other) version numbering, where
I can (largely) depend on functionality not changing in point
releases?

ChrisA

[1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
[2] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html