Unicode

Sun Sep 17 13:27:35 EDT 2017

On Mon, Sep 18, 2017 at 2:20 AM, leam hall <leamhall at gmail.com> wrote:
> On Sun, Sep 17, 2017 at 9:13 AM, Peter Otten <__peter__ at web.de> wrote:
>
>> Leam Hall wrote:
>>
>> > On 09/17/2017 08:30 AM, Chris Angelico wrote:
>> >> On Sun, Sep 17, 2017 at 9:38 PM, Leam Hall <leamhall at gmail.com> wrote:
>> >>> Still trying to keep this Py2 and Py3 compatible.
>> >>>
>> >>> The Py2 error is:
>> >>>          UnicodeEncodeError: 'ascii' codec can't encode character
>> >>>          u'\xf6' in position 8: ordinal not in range(128)
>> >>>
>> >>> even when the string is manually converted:
>> >>>          name    = unicode(self.name)
>> >>>
>> >>> Same sort of issue with:
>> >>>          name    = self.name.decode('utf-8')
>> >>>
>> >>>
>> >>> Py3 doesn't like either version.
>> >>
>> >> You got a Unicode *EN*code error when you tried to *DE* code. That's a
>> >> quirk of Py2's coercion behaviours, so the error's a bit obscure, but
>> >> it means that you (most likely) actually have a Unicode string
>> >> already. Check what type(self.name) is, and see if the problem is
>> >> actually somewhere else.
>> >>
>> >> (It's hard to give more specific advice based on this tiny snippet,
>> >> sorry.)
>> >>
>> >> ChrisA
>> >>
>> >
>> > Chris, thanks! I see what you mean.
>>
>> I don't think so. You get a unicode from the database,
>>
>> $ python
>> Python 2.7.6 (default, Oct 26 2016, 20:30:19)
>> [GCC 4.8.4] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import sqlite3
>> >>> db = sqlite3.connect(":memory:")
>> >>> cs = db.cursor()
>> >>> cs.execute("select 'foo';").fetchone()
>> (u'foo',)
>> >>>
>>
>> and when you try to decode it (which is superfluous as you already have
>> unicode!) Python does what you ask for. But to be able to decode it has to
>> encode first and by default it uses the ascii codec for that attempt. For
>> an
>> all-ascii string
>>
>> u"foo".encode("ascii") --> "foo"
>>
>> and thus
>>
>> u"foo".decode("utf-8)
>>
>> implemented as
>>
>> u"foo".encode("ascii").decode("utf-8") --> u"foo"
>>
>> is basically a noop. However
>>
>> u"äöü".encode("ascii") --> raises UnicodeENCODEError
>>
>> and thus
>>
>> u"äöü".decode("utf-8")
>>
>> fails with that. Unfortunately nobody realizes that the encoding failed and
>> thus will unsuccessfully try and specify other encodings for the decoding
>> step
>>
>> u"äöü".decode("latin1")  # also fails
>>
>> Solution: if you already have unicode, leave it alone.
>>
>
> Doesn't seem to work. The failing code takes the strings as is from the
> database. it will occasionally fail when a name comes up that uses
> a non-ascii character.
>
> Lines 44, 60, 66, 67.
>
> https://github.com/makhidkarun/py_tools/blob/master/lib/character.py

This doesn't make it easy:

https://github.com/makhidkarun/py_tools/blob/master/lib/character_tools.py#L40

Whatever exception occurs, you go to your fallback method. So if
something's going wrong, it's harder to figure out.

But the thing to do would be to check the types of everything that's
involved. You probably still have a mixture of text and bytes. It's
hard to pin down without actually running all the code, and with all
your "from X import *" lines, it's not easy to track down all the
code; I *think* that you will get different behaviour from SQLite3 vs
the list_from_file function, but I can't be certain.

As always, print is your friend. In this case, print(type(...)) will be helpful.

ChrisA