Some questions about decode/encode

glacier rong.xian at gmail.com
Mon Jan 28 06:10:25 EST 2008


On Jan 28, 2:31 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 28, 2:53 pm, glacier <rong.x... at gmail.com> wrote:
>
>
>
> > Thanks,John.
> > It's no doubt that you proved SAX didn't support GBK encoding.
> > But can you give some suggestion on how to make SAX parse some GBK
> > string?
>
> Yes, the same suggestion as was given to you by others very early in
> this thread, the same as I demonstrated in the middle of proving that
> SAX doesn't support a GBK-encoded input file.
>
> Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
> declaration doesn't have an unsupported encoding. Your handler will
> get data encoded as UTF-8. Recode that to GBK if needed.
>
> Here's a cut down version of the previous script, focussed on
> demonstrating that the recoding strategy works.
>
> C:\junk>type gbksax2.py
> import xml.sax, xml.sax.saxutils
> import cStringIO
> unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
> range(4))
> gbkstr = unistr.encode('gbk')
> print 'This is a GBK-encoded string: %r' % gbkstr
> utf8str = gbkstr.decode('gbk').encode('utf8')
> print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
> xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
> data>"""
> utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
> f = cStringIO.StringIO()
> handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
> xml.sax.parseString(utf8doc, handler)
> result = f.getvalue()
> f.close()
> start = result.find('<data>') + 6
> end = result.find('</data>')
> mydata = result[start:end]
> print "SAX output (UTF-8): %r" % mydata
> print "SAX output recoded to GBK: %r" %
> mydata.decode('utf8').encode('gbk')
>
> C:\junk>gbksax2.py
> This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'
> Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
> \xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
> SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
> \xe4\xb8\x83Z'
> SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'
>
> HTH,
> John

Thanks a lot John:)
I'll try it.



More information about the Python-list mailing list