XML file parsing with SAX

Uche Ogbuji uche.ogbuji at gmail.com
Sat Apr 23 15:48:49 EDT 2005


On Sat, 2005-04-23 at 15:20 +0200, Willem Ligtenberg wrote:
> I decided to use SAX to parse my xml file.
> But the parser crashes on:
>   File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
>     raise exception
> xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference
> 
> This is caused by:
> <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
> "NCBI_Entrezgene.dtd">
> 
> If I remove it, it parses normally.
> I've created my parser like this:
> import sys
> from xml.sax import make_parser
> from handler import EntrezGeneHandler
> 
> fopen = open("mouse2.xml", "r")
> ch = EntrezGeneHandler()
> saxparser = make_parser()
> saxparser.setContentHandler(ch)
> saxparser.parse(fopen)
> 
> And the handler is:
> from xml.sax import ContentHandler
> 
> class EntrezGeneHandler(ContentHandler):
> 	"""
> 	A handler to deal with EntrezGene in XML
> 	"""
> 	
> 	def startElement(self, name, attrs):
> 		print "Start element:", name
> 
> So it doesn't do much yet. And still it crashes...
> How can I tell the parser not to look at the DOCTYPE declaration.
> On a website:
> http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/1/
> it states that the SAX parsers are not validating, so this error shouldn't
> even occur?

Just because it's not validating doesn't mean that the parser won't try
to read the external entity.

Maybe you're looking for 

"""
feature_external_ges
        Value: "http://xml.org/sax/features/external-general-entities" 
        true: Include all external general (text) entities. 
        false: Do not include external general entities. 
        access: (parsing) read-only; (not parsing) read/write
"""

Quote from:

http://docs.python.org/lib/module-xml.sax.handler.html

But you're on pretty shaky ground in any XML 1.x toolkit using a bogus
DTDecl in this way.  Why go through the hassle?  Why not use a catalog,
or remove the DTDecl?


-- 
Uche Ogbuji                               Fourthought, Inc.
http://uche.ogbuji.net                    http://fourthought.com
http://copia.ogbuji.net                   http://4Suite.org
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerworks/edu/x-dw-x-xmlcss2-i.html
XML Output with 4Suite & AMara - http://www.xml.com/pub/a/2005/04/20/py-xml.html
Use XSLT to prepare XML for import into OpenOffice Calc - http://www.ibm.com/developerworks/xml/library/x-oocalc/
Schema standardization for top-down semantic transparency - http://www-128.ibm.com/developerworks/xml/library/x-think31.html




More information about the Python-list mailing list