Python2.7 unicode conundrum

Richard Damon Richard at Damon-Family.org
Sun Nov 25 13:13:13 EST 2018


On 11/25/18 12:51 PM, Robert Latest via Python-list wrote:
> Hi folks,
> what semmingly started out as a weird database character encoding mix-up
> could be boiled down to a few lines of pure Python. The source-code
> below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
> third line of the hexdump). When just printed, the string "s" is
> displayed correctly as 'ä' (a umlaut), but the string representation
> shows that it seems to have been converted to latin-1 'e4' somewhere on
> the way.
> How can this be avoided?
>
> dh at jenna:~/python$ cat unicode.py
> # -*- encoding: utf8 -*-
>
> s = u'ä'
>
> print(s)
> print((s, ))
>
> dh at jenna:~/python$ hd unicode.py 
> 00000000  23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
> 00000010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
> 00000020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
> 00000030  69 6e 74 28 28 73 2c 20  29 29 0a 0a              |int((s,))..|
> 0000003c
> dh at jenna:~/python$ python unicode.py
> ä
> (u'\xe4',)
> dh at jenna:~/python$
>
>
>
Why do you say it has been convert to 'Latin'. The string prints as
being Unicode. Internally Python doesn't store strings as UTF-8, but as
plain Unicode (UCS-2 or UCS-4 as needed), and code-point E4 is the
character you want.

The encoding statement tells python how your source file is encoded.

-- 
Richard Damon




More information about the Python-list mailing list