Demo/xml/roundtrip.py

Richard West rwest at opti.cgi.net
Sat Sep 8 16:02:02 EDT 2001


I'm trying to parse the ODP (dmoz.org) RDF files with Python but I
seem to be running into some problems.  I've compiled Expat 1.2 in but
I'm getting errors that I believe are related to the character set.
The sample testing document can be found here:

http://dmoz.org/rdf/structure.example.txt

I'm just trying to run the document through roundtrip.py to check it
out before I attempt to use the data.

As is, Python dumps a traceback on line 674:

<Topic r:id="Top/World">
  <tag catid="16"></tag>
  <d:Title>World</d:Title>
  Traceback (most recent call last):
  File "roundtrip.py", line 45, in ?
    parser.parse(sys.argv[1])
  File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 43, in
parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/local/lib/python2.1/xml/sax/xmlreader.py", line 123, in
parse
    self.feed(buffer)
  File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 92, in
feed
    self._err_handler.fatalError(exc)
  File "/usr/local/lib/python2.1/xml/sax/handler.py", line 38, in
fatalError
    raise exception
xml.sax._exceptions.SAXParseException: test.txt:675:2: not well-formed
(invalid token)


If I take line 674 completely out then it tracebacks again on line
678:

<Topic r:id="Top/World">
  <tag catid="16"></tag>
  <d:Title>World</d:Title>
  <narrow r:resource="Top/World/Chinese"></narrow>
  <narrow r:resource="Top/World/Deutsch"></narrow>
  <narrow r:resource="Top/World/Czech"></narrow>
  <narrow r:resource="Top/World/Bulgarian"></narrow>
  Traceback (most recent call last):
  File "roundtrip.py", line 45, in ?
    parser.parse(sys.argv[1])
  File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 43, in
parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/local/lib/python2.1/xml/sax/xmlreader.py", line 123, in
parse
    self.feed(buffer)
  File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 87, in
feed
    self._parser.Parse(data, isFinal)
UnicodeError: UTF-8 decoding error: invalid data



According to dmoz.org the files are UTF-8 encoded.  I've been using
Python for awhile now but I'm new at this whole xml and unicode stuff.
Can anyone point me in the right direction?



--Richard West





More information about the Python-list mailing list