[Tutor] Retain UTF-8 Character in Python List
Steven D'Aprano
steve at pearwood.info
Mon Jun 1 13:22:22 CEST 2015
On Mon, Jun 01, 2015 at 09:39:03AM +0700, Boy Sandy Gladies Arriezona wrote:
> Hi, it's my first time in here. I hope you don't mind if I straight to the
> question.
> I do some work in python 2 and my job is to collect some query and then
> send it to java program via json. We're doing batch update in Apache
> Phoenix, that's why I collect those query beforehand.
In Python 2, regular strings "" are actually ASCII byte strings, and
cannot include Unicode characters. If you try, you'll get something
platform dependent, which may be UTF-8, but could be something else.
So for example:
py> s = "a©b" # Not a Unicode string
py> len(s) # Expecting 3.
4
py> for c in s: print c, repr(c)
...
a 'a'
� '\xc2'
� '\xa9'
b 'b'
Not what you want! Instead, you have to use Unicode strings, u"".
py> s = u"a©b" # Unicode string
py> len(s)
3
py> for c in s: print c, repr(c)
...
a u'a'
© u'\xa9'
b u'b'
py> print s
a©b
Remember, the u is not part of the string, it is part of the delimiter:
ASCII byte string uses delimiters " " or ' '
Unicode string uses delimiters u" " or u' '
> My question is:
> *Can we retain utf-8 character in list without changing its form into \xXX
> or \u00XX?* The reason is because that java program insert it directly "as
> is" without iterating the list. So, my query will be the same as we print
> the list directly.
What do you mean, the Java program inserts it directly? Inserts it into
what?
> Example:
> c = 'sffs © fafd'
> l = list()
> l.append(c)
> print l
> ['sffs \xc2\xa9 fafd'] # this will be inserted, not ['sffs © fafd']
Change the string 'sffs...' to a Unicode string u'sffs...' and your
example will work.
*However*, don't be fooled by Python's list display:
py> mylist = [u'a©b']
py> print mylist
[u'a\xa9b']
"Oh no!", you might think, "Python has messed up my string and converted
the © into \xa9 which is exactly what I don't want!"
But don't be fooled, that's just the list's default display. The string
is actually still exactly what you want, it just displays anything which
is not ASCII as an escape sequence. But if you print the string
directly, you will see it as you intended:
py> print mylist[0] # just the string inside the list
a©b
By the way, if you can use Python 3 instead of Python 2, you may find
the Unicode handling is a bit simpler and less confusing. For example,
in Python 3, the way lists are printed is a bit more sensible:
py> mylist = ['a©b'] # No need for the u' delimiter in Python 3.
py> print(mylist)
['a©b']
Python 2.7 will do the job, if you must, but it will be a bit harder and
more confusing. Python 3.3 or higher is a better choice for Unicode.
--
Steve
More information about the Tutor
mailing list