Putting Unicode characters in JSON

Fri Mar 23 11:28:31 EDT 2018

On Sat, Mar 24, 2018 at 1:46 AM, Tobiah <toby at tobiah.org> wrote:
> On 03/22/2018 12:46 PM, Tobiah wrote:
>>
>> I have some mailing information in a Mysql database that has
>> characters from various other countries.  The table says that
>> it's using latin-1 encoding.  I want to send this data out
>> as JSON.
>>
>> So I'm just taking each datum and doing 'name'.decode('latin-1')
>> and adding the resulting Unicode value right into my JSON structure
>> before doing .dumps() on it.  This seems to work, and I can consume
>> the JSON with another program and when I print values, they look nice
>> with the special characters and all.
>>
>> I was reading though, that JSON files must be encoded with UTF-8.  So
>> should I be doing string.decode('latin-1').encode('utf-8')?  Or does
>> the json module do that for me when I give it a unicode object?
>
>
>
> Thanks for all the discussion.  A little more about our setup:
> We have used a LAMP stack system for almost 20 years to deploy
> hundreds of websites.  The database tables are latin-1 only because
> at the time we didn't know how or care to change it.
>
> More and more, 'special' characters caused a problem.  They would
> not come out correctly in a .csv file or wouldn't print correctly.
> Lately, I noticed that a JSON file I was sending out was delivering
> unreadable characters.  That's when I started to look into Unicode
> a bit more.  From the discussion, and my own guesses, it looks
> as though all have to do is string.decode('latin-1'), and stuff
> that Unicode object right into my structure that gets handed to
> json.dumps().

Yep, this is sounding more and more like you need to go UTF-8 everywhere.

> If I changed my database tables to all be UTF-8 would this
> work cleanly without any decoding?  Whatever people are doing
> to get these characters in, whether it's foreign keyboards,
> or fancy escape sequences in the web forms, would their intended
> characters still go into the UTF-8 database as the proper characters?
> Or now do I have to do a conversion on the way in to the database?

The best way to do things is to let your Python-MySQL bridge do the
decoding for you; you'll simply store and get back Unicode strings.
That's how things happen by default in Python 3 (I believe; been a
while since I used MySQL, but it's like that with PostgreSQL). My
recommendation is to give it a try; most likely, things will just
work.

> We also get import data that often comes in .xlsx format.  What
> encoding do I get when I dump a .csv from that?  Do I have to
> ask the sender?  I already know that they don't know.

Ah, now, that's a potential problem. A CSV file can't tell you what
encoding it's in. Fortunately, UTF-8 is designed to be fairly
dependable: if you attempt to decode something as UTF-8 and it works,
you can be confident that it really is UTF-8. But ultimately, you have
to just ask the person who exports it: "please export it in UTF-8".

Generally, things should "just work" as long as you're consistent with
encodings, and the easiest way to be consistent is to use UTF-8
everywhere. It's a simple rule that everyone can follow. (Hopefully.
:) )

ChrisA