Encode exception for chinese text

John Machin sjmachin at lexicon.net
Fri May 19 08:40:53 EDT 2006


1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.
3. gb2312 is outdated. It is not really an "appropriate" charset for
anything much these days. You need to check out what your requirements
really are. The unknowing will cheerfully use "gb" to mean one or more
of those, or to mean "anything that's not big5" :-)
4. The slab of text you supplied is genuine unicode and encodes happily
into all those gb* charsets. It does *not* contain \u00a0.

I do hope some of this helps ....

Cheers,
John




More information about the Python-list mailing list