Demo/xml/roundtrip.py
Richard West
rwest at opti.cgi.net
Sat Sep 8 16:02:02 EDT 2001
I'm trying to parse the ODP (dmoz.org) RDF files with Python but I
seem to be running into some problems. I've compiled Expat 1.2 in but
I'm getting errors that I believe are related to the character set.
The sample testing document can be found here:
http://dmoz.org/rdf/structure.example.txt
I'm just trying to run the document through roundtrip.py to check it
out before I attempt to use the data.
As is, Python dumps a traceback on line 674:
<Topic r:id="Top/World">
<tag catid="16"></tag>
<d:Title>World</d:Title>
Traceback (most recent call last):
File "roundtrip.py", line 45, in ?
parser.parse(sys.argv[1])
File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 43, in
parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.1/xml/sax/xmlreader.py", line 123, in
parse
self.feed(buffer)
File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 92, in
feed
self._err_handler.fatalError(exc)
File "/usr/local/lib/python2.1/xml/sax/handler.py", line 38, in
fatalError
raise exception
xml.sax._exceptions.SAXParseException: test.txt:675:2: not well-formed
(invalid token)
If I take line 674 completely out then it tracebacks again on line
678:
<Topic r:id="Top/World">
<tag catid="16"></tag>
<d:Title>World</d:Title>
<narrow r:resource="Top/World/Chinese"></narrow>
<narrow r:resource="Top/World/Deutsch"></narrow>
<narrow r:resource="Top/World/Czech"></narrow>
<narrow r:resource="Top/World/Bulgarian"></narrow>
Traceback (most recent call last):
File "roundtrip.py", line 45, in ?
parser.parse(sys.argv[1])
File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 43, in
parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.1/xml/sax/xmlreader.py", line 123, in
parse
self.feed(buffer)
File "/usr/local/lib/python2.1/xml/sax/expatreader.py", line 87, in
feed
self._parser.Parse(data, isFinal)
UnicodeError: UTF-8 decoding error: invalid data
According to dmoz.org the files are UTF-8 encoded. I've been using
Python for awhile now but I'm new at this whole xml and unicode stuff.
Can anyone point me in the right direction?
--Richard West
More information about the Python-list
mailing list