Unicode is driving me nuts!
Anthony Liu
antonyliu2002 at yahoo.com
Fri Mar 12 18:19:04 EST 2004
I am trying to parse a huge Chinese corpus uing
python, and I am having a hard time handling the
Chinese characters.
I need to get some particular Chinese characters that
meet a certain standard one by one from the corpus.
Before I parse a sentence and try to locate the
character, I unicode the whole string I read in like
so:
str = unicode(raw_str, myencoding)
I used 'gbk' and 'cp936' encoding for example.
This works just fine with a small sample Chinese
document.
But when I attempted to run the script on the entire
corpus, I get the typical "incomplete multibyte
sequence error" or "UnicodeEncodeError: 'ascii' codec
can't encode characters in position 0-23: ordinal not
in range(128)"
I am at my wit's end, so frustrated at handling
non-ascii texts.
Any hint would be highly appreciated.
_________________________________________________________
Do You Yahoo!?
完全免费的雅虎电邮,马上注册获赠额外60兆网络存储空间
http://cn.rd.yahoo.com/mail_cn/tag/?http://cn.mail.yahoo.com
More information about the Python-list
mailing list