[Tutor] Retain UTF-8 Character in Python List

Steven D'Aprano steve at pearwood.info
Mon Jun 1 13:22:22 CEST 2015


On Mon, Jun 01, 2015 at 09:39:03AM +0700, Boy Sandy Gladies Arriezona wrote:
> Hi, it's my first time in here. I hope you don't mind if I straight to the
> question.
> I do some work in python 2 and my job is to collect some query and then
> send it to java program via json. We're doing batch update in Apache
> Phoenix, that's why I collect those query beforehand.

In Python 2, regular strings "" are actually ASCII byte strings, and 
cannot include Unicode characters. If you try, you'll get something 
platform dependent, which may be UTF-8, but could be something else.

So for example:

py> s = "a©b"  # Not a Unicode string
py> len(s)  # Expecting 3.
4
py> for c in s: print c, repr(c)
...
a 'a'
� '\xc2'
� '\xa9'
b 'b'


Not what you want! Instead, you have to use Unicode strings, u"".

py> s = u"a©b"  # Unicode string
py> len(s)
3
py> for c in s: print c, repr(c)
...
a u'a'
© u'\xa9'
b u'b'
py> print s
a©b


Remember, the u is not part of the string, it is part of the delimiter:

ASCII byte string uses delimiters " " or ' '

Unicode string uses delimiters u" " or u' '



> My question is:
> *Can we retain utf-8 character in list without changing its form into \xXX
> or \u00XX?* The reason is because that java program insert it directly "as
> is" without iterating the list. So, my query will be the same as we print
> the list directly.

What do you mean, the Java program inserts it directly? Inserts it into 
what?


> Example:
> c = 'sffs © fafd'
> l = list()
> l.append(c)
> print l
> ['sffs \xc2\xa9 fafd']  # this will be inserted, not ['sffs © fafd']

Change the string 'sffs...' to a Unicode string u'sffs...' and your 
example will work.

*However*, don't be fooled by Python's list display:

py> mylist = [u'a©b']
py> print mylist
[u'a\xa9b']

"Oh no!", you might think, "Python has messed up my string and converted 
the © into \xa9 which is exactly what I don't want!"

But don't be fooled, that's just the list's default display. The string 
is actually still exactly what you want, it just displays anything which 
is not ASCII as an escape sequence. But if you print the string 
directly, you will see it as you intended:

py> print mylist[0]  # just the string inside the list
a©b



By the way, if you can use Python 3 instead of Python 2, you may find 
the Unicode handling is a bit simpler and less confusing. For example, 
in Python 3, the way lists are printed is a bit more sensible:

py> mylist = ['a©b']  # No need for the u' delimiter in Python 3.
py> print(mylist)
['a©b']

Python 2.7 will do the job, if you must, but it will be a bit harder and 
more confusing. Python 3.3 or higher is a better choice for Unicode.



-- 
Steve


More information about the Tutor mailing list