Unicode

Sun Sep 17 09:13:09 EDT 2017

Leam Hall wrote:

> On 09/17/2017 08:30 AM, Chris Angelico wrote:
>> On Sun, Sep 17, 2017 at 9:38 PM, Leam Hall <leamhall at gmail.com> wrote:
>>> Still trying to keep this Py2 and Py3 compatible.
>>>
>>> The Py2 error is:
>>>          UnicodeEncodeError: 'ascii' codec can't encode character
>>>          u'\xf6' in position 8: ordinal not in range(128)
>>>
>>> even when the string is manually converted:
>>>          name    = unicode(self.name)
>>>
>>> Same sort of issue with:
>>>          name    = self.name.decode('utf-8')
>>>
>>>
>>> Py3 doesn't like either version.
>> 
>> You got a Unicode *EN*code error when you tried to *DE* code. That's a
>> quirk of Py2's coercion behaviours, so the error's a bit obscure, but
>> it means that you (most likely) actually have a Unicode string
>> already. Check what type(self.name) is, and see if the problem is
>> actually somewhere else.
>> 
>> (It's hard to give more specific advice based on this tiny snippet,
>> sorry.)
>> 
>> ChrisA
>> 
> 
> Chris, thanks! I see what you mean.

I don't think so. You get a unicode from the database, 

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> db = sqlite3.connect(":memory:")
>>> cs = db.cursor()
>>> cs.execute("select 'foo';").fetchone()
(u'foo',)
>>>

and when you try to decode it (which is superfluous as you already have 
unicode!) Python does what you ask for. But to be able to decode it has to 
encode first and by default it uses the ascii codec for that attempt. For an 
all-ascii string

u"foo".encode("ascii") --> "foo"

and thus

u"foo".decode("utf-8)

implemented as

u"foo".encode("ascii").decode("utf-8") --> u"foo"

is basically a noop. However

u"äöü".encode("ascii") --> raises UnicodeENCODEError

and thus

u"äöü".decode("utf-8")

fails with that. Unfortunately nobody realizes that the encoding failed and 
thus will unsuccessfully try and specify other encodings for the decoding 
step

u"äöü".decode("latin1")  # also fails

Solution: if you already have unicode, leave it alone.

> The string source is a SQLite3 database with a bunch of names. Some have
> non-ASCII characters. The database is using varchar which seems to be
> utf-8, utf-16be or utf-16le. I probably need to purge the data.
> 
> What I find interesting is that utf-8 works in the Ruby script that
> pulls from the same database. That's what makes me think it's utf-8.
> 
> I've tried different things in lines 45 and 61.
> 
> https://gist.github.com/LeamHall/054f9915af17dc1b1a33597b9b45d2da
> 
> Leam