Some questions about decode/encode

"Martin v. Löwis" martin at v.loewis.de
Sun Jan 27 15:32:09 EST 2008


>> Is there any way to solve this better?
>> I mean if I shouldn't convert the GBK string to unicode string, what
>> should I do to make SAX work?
> 
> Decode it and then encode it to utf-8 before feeding it to the parser.

The tricky part is that you also need to change the encoding declaration
in doing so, but in this case, it should be fairly simple:

unicode_doc = original_doc.decode("gbk")
unicode_doc = unicode_doc.replace('gbk','utf-8',1)
utf8_doc = unicode_doc.encode("utf-8")

This assumes that the string "gbk" occurs in the encoding declaration
as

<?xml version="1.0" encoding="gbk"?>

If the encoding name has a different spelling (e.g. GBK), you need to
cater for that as well. You might want to try replacing the entire
XML declaration (i.e. everything between <? and ?>), or just the
encoding= parameter. Notice that the encoding declaration may include
' instead of ", and may have additional spaces, e.g.

<?xml         version = '1.0'
              encoding= 'gbK' ?>

HTH,
Martin



More information about the Python-list mailing list