Putting Unicode characters in JSON

Thu Mar 22 21:05:34 EDT 2018

On Fri, Mar 23, 2018 at 11:39 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Fri, 23 Mar 2018 11:08:56 +1100, Chris Angelico wrote:
>> Okay. Give me a good reason for the database itself to be locked to
>> Latin-1. Make sure you explain how potentially saving the occasional
>> byte of storage (compared to UTF-8) justifies limiting the available
>> character set to the ones that happen to be in Latin-1, yet it's
>> essential to NOT limit the character set to ASCII.
>
> I'll better than that, I'll give multiple good reasons to use Latin-1.
>
> It's company policy to only use Latin-1, because the CEO was once
> employed by the Unicode Consortium, and fired in disgrace after
> embezzling funds, and ever since then he has refused to use Unicode.

You clearly can't afford to quit your job, so I won't mention that
possibility. (Oops too late.) But a CEO is not God, and you *can*
either dispute or subvert stupid orders. I don't consider this a
*good* reason. Maybe a reason, but not a good one.

> Compatibility with other databases, systems or tools that require Latin-1.

Change them one at a time. When you have to pass data to something
that has to receive Latin-1, you encode it to Latin-1. The database
can still store UTF-8. Leaving it at Latin-1 is not "good reason for
using Latin-1", so much as "we haven't gotten around to changing it
yet".

> The database has to send information to embedded devices that don't
> include a full Unicode implementation, but do support Latin-1.

Okay, that's a valid reason, if an incredibly rare one. You have to
specifically WANT an encoding error if you try to store something
that, later on, will cause problems. It's like asking for a 32-bit
signed integer type in Python, because you're using it for something
where it's eventually going to be sent to something that can't use
larger numbers. Not something that wants a core feature, usually.

> The data doesn't actually represent text, but Python 2 style byte-
> strings, and Latin-1 is just a convenient, easy way to get that that
> ensures ASCII bytes look like ASCII characters.

The OP is talking about JSON. Reason makes no sense in that context.
And if it really is a byte string, why store it as a Latin-1 string?
Store it as the type BLOB instead. Latin-1 is not "arbitrary bytes".
It is a very specific encoding that cannot decode every possible byte
value. Using Latin-1 to store arbitrary bytes is just as wrong as
using ASCII to store eight-bit data.

So, you've given me one possible reason that is EXTREMELY situational
and, even there, could be handled differently. And it's only valid
when you're working with something that supports more than ASCII and
no more than Latin-1, and moreover, you have the need for non-ASCII
characters. (Otherwise, just use ASCII, which you can declare as UTF-8
if you wish.)

ChrisA