[XML-SIG] stripping 8-bit ASCII from an XML stream using encode?

Kevin Altis altis@semi-retired.com
Fri, 4 Jan 2002 22:20:37 -0800


The code at the end of demonstrates a problem I'm having parsing XML files
downloaded from SourceForge. The specific issue is that there are characters
in the XML stream above ASCII 127; decimal 233 and 246 in the example file.
The cleanText function is supposed to strip the problem characters. Some of
the files contain non-printing ASCII characters below decimal 32, so I'm
trying to strip those manually after the XML is converted. Once the file is
parsed I'm using the fields in a GUI interface to the SourceForge tracker
database.

Here is the URL for the XML version of the Python Feature Requests:
http://sourceforge.net/export/sf_tracker_export.php?atid=355470&group_id=547
0

I thought that encode could be used to strip these characters, but I
sometimes get the following traceback.

    t = t.encode('ascii', 'ignore')
UnicodeError: ASCII decoding error: ordinal not in range(128)

I haven't done much XML processing, so this could be a FAQ, but I haven't
been able to find the answer so far. What is the proper way to strip the
8-bit values? Is there another issue at work here?

Thanks,

ka
---
Kevin Altis
altis@semi-retired.com
---
example code by Mark Pilgrim:

def cleanText(t, collapseWhitespace=0):
    t = t.encode('ascii', 'ignore')
    t = t.replace(chr(19), '')
    if collapseWhitespace:
        t = t.replace('\t', '').replace('\n', '')
    return t

def getText(node, collapseWhitespace=0):
    return cleanText("".join([c.data for c in node.childNodes if c.nodeType
== c.TEXT_NODE]), collapseWhitespace)

def doParse(xml):
    from xml.dom import minidom
    xml = cleanText(xml)
    xmldoc = minidom.parseString(xml)
    artifacts = xmldoc.getElementsByTagName('artifact')
    trackerDict = {}
    for a in artifacts:
        trackerDict[a.attributes["id"].value] = \
            {"summary":getText(a.getElementsByTagName("summary")[0],
collapseWhitespace=1),
             "detail":getText(a.getElementsByTagName('detail')[0])}
    return trackerDict

if __name__ == '__main__':
    # KEA
    # added code to download URL and save the file
    # comment out once you have the have the XML
    # the XML file is approximately 174K
    import urllib
    url =
'http://sourceforge.net/export/sf_tracker_export.php?atid=355470&group_id=54
70'
    fp = urllib.urlopen(url)
    xml = fp.read()
    fp.close()
    filename = r'Python_FeatureRequests.xml'
    op = open(filename, 'wb')
    op.write(xml)
    op.close()

    fsock = open(filename)
    xml = fsock.read()
    fsock.close()
    trackerDict = doParse(xml)
    import pprint
    pprint.pprint(trackerDict)