Daily WTF with XML, or error handling in SAX

mrkafk at gmail.com mrkafk at gmail.com
Sat May 3 16:50:10 EDT 2008


So I set out to learn handling three-letter-acronym files in Python,
and SAX worked nicely until I encountered badly formed XMLs, like with
bad characters in it (well Unicode supposed to handle it all but
apparently doesn't), using http://dchublist.com/hublist.xml.bz2 as
example data, with goal to extract Users and Address properties where
number of Users is greater than given number.

So I extended my First XML Example with an error handler:

# ========= snip ===========
from xml.sax import make_parser
from xml.sax.handler import ContentHandler
from xml.sax.handler import ErrorHandler

class HubHandler(ContentHandler):
    def __init__(self, hublist):
        self.Address = ''
        self.Users = ''
        hl = hublist
    def startElement(self, name, attrs):
        self.Address = attrs.get('Address',"")
        self.Users = attrs.get('Users', "")
    def endElement(self, name):
        if name == "Hub" and int(self.Users) > 2000:
            #print self.Address, self.Users
            hl.append({self.Address: int(self.Users)})

class HubErrorHandler(ErrorHandler):
    def __init__(self):
        pass
    def error(self, exception):
        import sys
        print "Error, exception: %s\n" % exception
    def fatalError(self, exception):
        print "Fatal Error, exception: %s\n" % exception

hl = []

parser = make_parser()

hHandler = HubHandler(hl)
errHandler = HubErrorHandler()

parser.setContentHandler(hHandler)
parser.setErrorHandler(errHandler)

fh = file('hublist.xml')
parser.parse(fh)

def compare(x,y):
    if x.values()[0] > y.values()[0]:
        return 1
    elif x.values()[0] < y.values()[0]:
        return -1
    return 0

hl.sort(cmp=compare, reverse=True)

for h in hl:
    print h.keys()[0], "   ", h.values()[0]
# ========= snip ===========

And then BAM, Pythonwin has hit me:


>>> execfile('ph.py')
Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)

Fatal Error, exception: hublist.xml:2247:11: not well-formed (invalid
token)


>>> ================================ RESTART ================================

Just before the "RESTART" line, Windows has announced it killed
pythonw.exe process (I suppose it was a child process).

WTF is happening here? Wasn't fatalError method in the HubErrorHandler
supposed to handle the invalid tokens? And why is the message repeated
many times? My method is called apparently, but something in SAX goes
awry and the interpreter crashes.





More information about the Python-list mailing list