a question about Chinese characters in a Python Program

John Machin sjmachin at lexicon.net
Tue Oct 21 04:44:45 EDT 2008


On Oct 21, 1:45 am, Paul Boddie <p... at boddie.org.uk> wrote:
> From the Wikipedia page, it appears that you need to convert GB2312
> values to EUC-CN by a relatively straightforward process, and can then
> output the resulting byte sequence in an ASCII compatible way,
> provided that you filter out all the byte values greater than 127:
> these filtered bytes would produce nonsense for anyone using a program
> not expecting EUC-CN. UTF-8 has some similar properties, but as I
> noted above, you wouldn't want to read most of the output if your
> program wasn't expecting UTF-8.

What the Wikipedia page doesn't say is that the number of people who
grok the concept of a GB2312 codepoint is vanishingly small, and the
number of people who would actually have GB2312 codepoints in a file
is smaller still. When people say their data is GB2312, they mean
"GB<something> encoded as EUC-CN". So the relatively straightforward
process is not required in practice.

I don't understand the point or value of filtering out all byte values
greater than 127:

If the data is really GB2312, this would throw out all the Chinese
characters.

If the GB<something> is, as is likely, really GBK aka cp936 (a
superset of GB2312), then the second byte of a Chinese character may
be in the ASCII range, and the result of the filter would comprise the
true ASCII characters plus some garbage ASCII characters.




More information about the Python-list mailing list