Problem processing Chinese

Fri Oct 14 03:24:13 EDT 2005

Anthony Liu wrote:

> I believe that topic related to Chinese processing was
> discussed before.  I could not dig out the info I want
> from the mail list archive.
> 
> My Python script reads some Chinese text and then
> split a line delimited by white spaces.  I got lists
> like
> 
> ['\xbc\xc7\xd5\xdf', '\xd0\xbb\xbd\xf0\xbb\xa2',
> '\xa1\xa2']
> 
> I had
> 
> #-*- coding: gbk -*-
> 
> on top of the script.
> 
> My Windows 2000 system's default language is Chinese
> (GB2312) and  displays Chinese perfectly.
> 
> I don't know how to configure python or what else I
> need to properly process such two-byte-character text.
> 
> Thanks.

Suppose you have a file with the following contents:

>>> file("chinese.txt").read()
'\xbc\xc7\xd5\xdf \xd0\xbb\xbd\xf0\xbb\xa2 \xa1\xa2'

Then it's best to open it via codecs -- of course you have to know the
encoding:

>>> codecs.open("chinese.txt", "r", "gbk").read()
u'\u8bb0\u8005 \u8c22\u91d1\u864e \u3001'

This may still look strange to you but it's the unicode string's repr().
If sys.stdout.encoding is properly set on your system you can just print it:

>>> u = codecs.open("chinese.txt", "r", "gbk").read()
>>> print u
记者 谢金虎 、

If that fails, provide the encoding explicitly:

>>> print u.encode("utf-8") # probably "gbk" instead of "utf-8" on your
system
记者 谢金虎 、

Because now you are in unicode all further operations are performed on
characters rather than bytes. Processing Chinese is no longer more
difficult than any language that confines itself to plain ASCII. 
But if you split your text into a list

>>> u.split()
[u'\u8bb0\u8005', u'\u8c22\u91d1\u864e', u'\u3001']

you probably think you are back to square one. That is because Python prints
the repr() of the list items (otherwise a comma would give the impression
that the list contains more items than it actually does). To get the actual
characters, choose an item explicitly

>>> items = u.split()
>>> print items[0]
记者

or convert the entire list to a string of your liking, e. g:

>>> print u"[%s]" % u", ".join(items)
[记者, 谢金虎, 、]

Peter