Some questions about decode/encode

John Machin sjmachin at lexicon.net
Mon Jan 28 01:31:17 EST 2008


On Jan 28, 2:53 pm, glacier <rong.x... at gmail.com> wrote:
>
> Thanks,John.
> It's no doubt that you proved SAX didn't support GBK encoding.
> But can you give some suggestion on how to make SAX parse some GBK
> string?

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'

HTH,
John



More information about the Python-list mailing list