Python2.7 unicode conundrum

Sun Nov 25 13:09:22 EST 2018

On 25/11/2018 18:51, Robert Latest via Python-list wrote:
> Hi folks,
> what semmingly started out as a weird database character encoding mix-up
> could be boiled down to a few lines of pure Python. The source-code
> below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
> third line of the hexdump). When just printed, the string "s" is
> displayed correctly as 'ä' (a umlaut), but the string representation
> shows that it seems to have been converted to latin-1 'e4' somewhere on
> the way.

It's not being converted to latin-1. It's a unicode string, as evidences
by the 'u'.

u'\xe4' is a unicode string with one character, U+00E4 (ä)

> How can this be avoided?
> 
> dh at jenna:~/python$ cat unicode.py
> # -*- encoding: utf8 -*-
> 
> s = u'ä'
> 
> print(s)
> print((s, ))
> 
> dh at jenna:~/python$ hd unicode.py 
> 00000000  23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
> 00000010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
> 00000020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
> 00000030  69 6e 74 28 28 73 2c 20  29 29 0a 0a              |int((s,))..|
> 0000003c
> dh at jenna:~/python$ python unicode.py
> ä
> (u'\xe4',)
> dh at jenna:~/python$
> 
> 
>