Putting Unicode characters in JSON

Chris Angelico rosuav at gmail.com
Fri Mar 23 06:35:34 EDT 2018


On Fri, Mar 23, 2018 at 9:29 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Fri, 23 Mar 2018 18:35:20 +1100, Chris Angelico wrote:
>
>> That doesn't seem to be a strictly-correct Latin-1 decoder, then. There
>> are a number of unassigned byte values in ISO-8859-1.
>
> That's incorrect, but I don't blame you for getting it wrong. Who thought
> that it was a good idea to distinguish between "ISO 8859-1" and
> "ISO-8859-1" as two related but distinct encodings?
>
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>
> The old ISO 8859-1 standard, the one with undefined values, is mostly of
> historical interest. For the last twenty years or so, anyone talking
> about either Latin-1 or ISO-8859-1 (with or without dashes) is almost
> meaning the 1992 IANA superset version which defines all 256 characters:
>
>     "In 1992, the IANA registered the character map ISO_8859-1:1987,
>     more commonly known by its preferred MIME name of ISO-8859-1
>     (note the extra hyphen over ISO 8859-1), a superset of ISO
>     8859-1, for use on the Internet. This map assigns the C0 and C1
>     control characters to the unassigned code values thus provides
>     for 256 characters via every possible 8-bit value."
>
>
> Either that, or they actually mean Windows-1252, but let's not go there.
>

Wait, whaaa.......

Though in my own defense, MySQL itself seems to have a bit of a
problem with encoding names. Its "utf8" is actually "UTF-8 with a
maximum of three bytes per character", in contrast to "utf8mb4" which
is, well, UTF-8.

In any case, abusing "Latin-1" to store binary data is still wrong.
That's what BLOB is for.

ChrisA



More information about the Python-list mailing list