Some questions about decode/encode

Sun Jan 27 16:50:15 EST 2008

On Jan 28, 7:47 am, "Mark Tolonen" <mark.e.tolo... at mailinator.com>
wrote:
> >"John Machin" <sjmac... at lexicon.net> wrote in message
> >news:eeb3a05f-c122-4b8c-95d8-d13741263374 at h11g2000prf.googlegroups.com...
> >On Jan 27, 9:17 pm, glacier <rong.x... at gmail.com> wrote:
> >> On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> >> wrote:
>
> >*IF* the file is well-formed GBK, then the codec will not mess up when
> >decoding it to Unicode. The usual cause of mess is a combination of a
> >human and a text editor :-)
>
> SAX uses the expat parser.  From the pyexpat module docs:
>
> Expat doesn't support as many encodings as Python does, and its repertoire
> of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
> (Latin1), and ASCII. If encoding is given it will override the implicit or
> explicit encoding of the document.
>
> --Mark

Thank you for pointing out where that list of encodings had been
cunningly concealed. However the relevance of dropping it in as an
apparent response to my answer to the OP's question about decoding
possibly butchered GBK strings is .... what?

In any case, it seems to support other 8-bit encodings e.g. iso-8859-2
and koi8-r ...

C:\junk>type gbksax.py
import xml.sax, xml.sax.saxutils
import cStringIO

unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
print 'unistr=%r' % unistr
gbkstr = unistr.encode('gbk')
print 'gbkstr=%r' % gbkstr
unistr2 = gbkstr.decode('gbk')
assert unistr2 == unistr

print "latin1 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-1').encode('utf8')
print "latin2 FF -> utf8 = %r" %
'\xff'.decode('iso-8859-2').encode('utf8')
print "koi8r FF -> utf8 = %r" % '\xff'.decode('koi8-r').encode('utf8')

xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""

asciidoc = xml_template % ('ascii', 'The quick brown fox etc')
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
latin1doc = xml_template % ('iso-8859-1', 'nil illegitimati
carborundum' + '\xff')
latin2doc = xml_template % ('iso-8859-2', 'duo secundus' + '\xff')
koi8rdoc = xml_template % ('koi8-r', 'Moskva' + '\xff')
gbkdoc = xml_template % ('gbk', gbkstr)

for doc in (asciidoc, utf8doc, latin1doc, latin2doc, koi8rdoc,
gbkdoc):
    f = cStringIO.StringIO()
    handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
    xml.sax.parseString(doc, handler)
    result = f.getvalue()
    f.close
    print repr(result[result.find('<data>'):])

C:\junk>gbksax.py
unistr=u'\u4e00W\u4e01X\u4e02Y\u4e03Z'
gbkstr='\xd2\xbbW\xb6\xa1X\x81 at Y\xc6\xdfZ'
latin1 FF -> utf8 = '\xc3\xbf'
latin2 FF -> utf8 = '\xcb\x99'
koi8r FF -> utf8 = '\xd0\xaa'
'<data>The quick brown fox etc</data>'
'<data>\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z</data>'
'<data>nil illegitimati carborundum\xc3\xbf</data>'
'<data>duo secundus\xcb\x99</data>'
'<data>Moskva\xd0\xaa</data>'
Traceback (most recent call last):
  File "C:\junk\gbksax.py", line 27, in <module>
    xml.sax.parseString(doc, handler)
  File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
    parser.parse(inpsrc)
  File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: unknown
encoding

C:\junk>