Putting Unicode characters in JSON

Fri Mar 23 20:11:45 EDT 2018

On Fri, 23 Mar 2018 07:46:16 -0700, Tobiah wrote:

> If I changed my database tables to all be UTF-8 would this work cleanly
> without any decoding?

Not reliably or safely. It will appear to work so long as you have only 
pure ASCII strings from the database, and then crash when you don't:

py> text_from_database = u"hello wörld".encode('latin1')
py> print text_from_database
hello w�rld
py> json.dumps(text_from_database)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 195, in encode
    return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 7: 
invalid start byte

> Whatever people are doing to get these characters
> in, whether it's foreign keyboards, or fancy escape sequences in the web
> forms, would their intended characters still go into the UTF-8 database
> as the proper characters? Or now do I have to do a conversion on the way
> in to the database?

There is no way to answer that, because it depends on how you are getting 
the characters, what you are doing to them, and how you put them in the 
database.

In the best possible scenario, your process is:

- user input comes in as UTF-8;
- you store it in the database;
- which converts it to Latin-1 (sometimes losing data: see below)

in which case, changing the database field to utf8mb4 (NOT plain utf8, 
thanks to a ludicrously idiotic design flaw in MySQL utf8 is not actually 
utf8) should work nicely.

I mentioned losing data: if your user enters, let's say the Greek letters 
'αβγ' (or emojis, or any of about a million other characters) then Latin1 
cannot represent them. Presumably your database is throwing them away:

py> s = 'αβγ'  # what the user wanted
py> db = s.encode('latin-1', errors='replace')  # what the database 
recorded
py> json.dumps(db.decode('latin-1'))  # what you end up with
'"???"'

Or, worse, you're getting moji-bake:

py> s = 'αβγ'  # what the user wanted
py> json.dumps(s.encode('utf-8').decode('latin-1'))
'"\\u00ce\\u00b1\\u00ce\\u00b2\\u00ce\\u00b3"'

> We also get import data that often comes in .xlsx format.  What encoding
> do I get when I dump a .csv from that?  Do I have to ask the sender?  I
> already know that they don't know.

They never do :-(

In Python 2, I believe the CSV module will assume ASCII-plus-random-crap, 
and it will work fine so long as it actually is ASCII. Otherwise you'll 
get random-crap: possibly an exception, possibly moji-bake.

The sad truth is that as soon as you leave the nice, clean world of pure 
Unicode, and start dealing with legacy encodings, everything turns to 
quicksand.

If you haven't already done so, you really should start by reading Joel 
Spolsky's introduction to Unicode:

http://global.joelonsoftware.com/English/Articles/Unicode.html

and Ned Batchelder's post on dealing with Unicode and Python:

https://nedbatchelder.com/text/unipain.html

-- 
Steve