problem with cjkcodecs on Mandrake linux +++

Skip Montanaro skip at pobox.com
Wed Mar 17 09:59:54 EST 2004


    Anthony> s = 'abc'
    Anthony> unicode(s, 'gbk')
    Anthony> print s # prints 'abc'

    [fails]

Anthony,

The above is a bit nonsensical, since you didn't actually modify s.  I
assume you really meant:

    s = 'abc'
    s = unicode(s, 'gbk')
    print s

Remember the basic rule of Unicode?  If you don't know the encoding, you
don't know nuthin'.  Unicode objects themselves are encoding-neutral.  The
print statement has to encode s somehow (Unicode objects aren't displayed
directly), so it uses the system's default encoding, which from your earlier
messages appears to be "latin-1".

Perhaps you're confused by

    s = unicode(s, 'gbk')

This says, "Convert the string s to a Unicode object assuming the string is
encoded using the 'gbk' charset, then bind the resulting object to s." Note
the 'gbk' doesn't become an attribute of the Unicode object, so later on
when you try to print it

    print s

it needs to decide how to encode the object and for that it used the current
default encoding, typically "ascii".  In the case of "abc" that's no
problem.  For other code points in other character sets (I'm not sure I'm
using the terminology quite right there) you need to be explicit:

    print s.encode('gbk')

or use an appropriate system-wide default encoding.

Skip





More information about the Python-list mailing list