Python2.7 unicode conundrum
Richard Damon
Richard at Damon-Family.org
Sun Nov 25 13:13:13 EST 2018
On 11/25/18 12:51 PM, Robert Latest via Python-list wrote:
> Hi folks,
> what semmingly started out as a weird database character encoding mix-up
> could be boiled down to a few lines of pure Python. The source-code
> below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
> third line of the hexdump). When just printed, the string "s" is
> displayed correctly as 'ä' (a umlaut), but the string representation
> shows that it seems to have been converted to latin-1 'e4' somewhere on
> the way.
> How can this be avoided?
>
> dh at jenna:~/python$ cat unicode.py
> # -*- encoding: utf8 -*-
>
> s = u'ä'
>
> print(s)
> print((s, ))
>
> dh at jenna:~/python$ hd unicode.py
> 00000000 23 20 2d 2a 2d 20 65 6e 63 6f 64 69 6e 67 3a 20 |# -*- encoding: |
> 00000010 75 74 66 38 20 2d 2a 2d 0a 0a 73 20 3d 20 75 27 |utf8 -*-..s = u'|
> 00000020 c3 a4 27 0a 0a 70 72 69 6e 74 28 73 29 0a 70 72 |..'..print(s).pr|
> 00000030 69 6e 74 28 28 73 2c 20 29 29 0a 0a |int((s,))..|
> 0000003c
> dh at jenna:~/python$ python unicode.py
> ä
> (u'\xe4',)
> dh at jenna:~/python$
>
>
>
Why do you say it has been convert to 'Latin'. The string prints as
being Unicode. Internally Python doesn't store strings as UTF-8, but as
plain Unicode (UCS-2 or UCS-4 as needed), and code-point E4 is the
character you want.
The encoding statement tells python how your source file is encoded.
--
Richard Damon
More information about the Python-list
mailing list