Putting Unicode characters in JSON

Thu Mar 22 20:39:37 EDT 2018

On Fri, 23 Mar 2018 11:08:56 +1100, Chris Angelico wrote:

> On Fri, Mar 23, 2018 at 10:47 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Fri, 23 Mar 2018 07:09:50 +1100, Chris Angelico wrote:
>>
>>>> I was reading though, that JSON files must be encoded with UTF-8.  So
>>>> should I be doing string.decode('latin-1').encode('utf-8')?  Or does
>>>> the json module do that for me when I give it a unicode object?
>>>
>>> Reconfigure your MySQL database to use UTF-8. There is no reason to
>>> use Latin-1 in the database.
>>
>> You don't know that. You don't know what technical, compatibility,
>> policy or historical constraints are on the database.
> 
> Okay. Give me a good reason for the database itself to be locked to
> Latin-1. Make sure you explain how potentially saving the occasional
> byte of storage (compared to UTF-8) justifies limiting the available
> character set to the ones that happen to be in Latin-1, yet it's
> essential to NOT limit the character set to ASCII.

I'll better than that, I'll give multiple good reasons to use Latin-1.

It's company policy to only use Latin-1, because the CEO was once 
employed by the Unicode Consortium, and fired in disgrace after 
embezzling funds, and ever since then he has refused to use Unicode.

Compatibility with other databases, systems or tools that require Latin-1.

The database has to send information to embedded devices that don't 
include a full Unicode implementation, but do support Latin-1.

The data doesn't actually represent text, but Python 2 style byte-
strings, and Latin-1 is just a convenient, easy way to get that that 
ensures ASCII bytes look like ASCII characters.

>>> If that isn't an option, make sure your JSON files are pure ASCII,
>>> which is the common subset of UTF-8 and Latin-1.
>>
>> And that's utterly unnecessary, since any character which can be stored
>> in the Latin-1 MySQL database can be stored in the Unicode JSON.
>>
>>
> Irrelevant; if you fetch eight-bit data out of the database, it isn't
> going to be a valid JSON file unless (1) it's really ASCII, like I
> suggest; (2) you re-encode it to UTF-8; or (3) it was actually UTF-8 all
> along, despite being declared as Latin-1.

As Tobiah pointed out in his question, he's fetching the data from the 
database, calling decode('latin-1'), and placing the resulting Unicode 
string into the JSON. There's no need to explicitly encode the Unicode 
string to a UTF-8 byte string, in fact it is the wrong thing to do since 
JSON doesn't support it:

# The right way is to use a Unicode string.

py> json.dumps("Hello ü")
'"Hello \\u00fc"'

# The wrong way is to encode to UTF-8 first.

py> json.dumps("Hello ü".encode('utf-8'))
Traceback (most recent call last):
  ...
TypeError: b'Hello \xc3\xbc' is not JSON serializable

(Results in Python 3 -- Python 2 may be doing something shonky.)

-- 
Steve