Unicode is driving me nuts!

Anthony Liu antonyliu2002 at yahoo.com
Fri Mar 12 18:19:04 EST 2004


I am trying to parse a huge Chinese corpus uing
python, and I am having a hard time handling the
Chinese characters.

I need to get some particular Chinese characters that
meet a certain standard one by one from the corpus.

Before I parse a sentence and try to locate the
character, I unicode the whole string I read in like
so:

str = unicode(raw_str, myencoding)

I used 'gbk' and 'cp936' encoding for example.

This works just fine with a small sample Chinese
document.

But when I attempted to run the script on the entire
corpus, I get the typical "incomplete multibyte
sequence error" or "UnicodeEncodeError: 'ascii' codec
can't encode characters in position 0-23: ordinal not
in range(128)"

I am at my wit's end, so frustrated at handling
non-ascii texts.

Any hint would be highly appreciated.

_________________________________________________________
Do You Yahoo!? 
完全免费的雅虎电邮,马上注册获赠额外60兆网络存储空间
http://cn.rd.yahoo.com/mail_cn/tag/?http://cn.mail.yahoo.com




More information about the Python-list mailing list