[XML-SIG] parsers and XML

travish travish@realtime.net
Mon, 7 Aug 2000 18:52:44 -0500 (CDT)


Hi... I was taking a look at some of the docs, code, and examples,
and was a bit surprised about a number of things.  Below are some
comments, problems, diffs, etc.  You may already know some of this.

a) most of the XML "parsers" act appear to be lexers

b) none of the examples are of sufficient/substantial complexity
   (e.g. recursive nesting, deep/complex hierarchy)

   If anyone has suggestions on what kind of parser to use as a back end
   (yapps?  kjParsing?  etc.) I'd be interested to hear it.

c) SGMLOP's description is substantially misleading:

 http://www.garshol.priv.no/download/xmltools/prod/sgmlop.html

 sgmlop is meant to behave identically with the sgmllib and xmllib
 modules and replace them invisibly if it is present, so that one
 does not have to change any code to use them. The saxlib package
 has a SAX 1.0 driver for sgmlop.

This does not appear to be correct.  See diffs below.
In addition, it passes in a complete string of text rather than
string, offset, length to the character data callback.
You may thank Zope for the info leading to the working SGMLOP driver.

d) xmltok: no driver drv_xmltok
e) XMLtoolkit: no module named XMLFactory
f) xmldc: no module named xml_dc
g) Relative speeds on my hardware:
	sgmlop (C module)      4.15
	pyexpat (C module)     2.60
	xmlproc (python)       1.21
	xmllib (python)        1.00


Here you go:

--- drv_pyexpat.py.orig Mon Jul 17 11:17:56 2000
+++ drv_pyexpat.py      Mon Jul 17 12:39:25 2000
@@ -34,6 +34,7 @@
         self.parser.EndElementHandler = self.endElement
         self.parser.CharacterDataHandler = self.characters
         self.parser.ProcessingInstructionHandler = self.processingInstruction
+        self.unfed_so_far = 1
 
     def startElement(self,name,attrs):
         # Backward compatibility code, for older versions of the 
@@ -118,12 +119,18 @@
         self.parser.EndElementHandler = self.endElement
         self.parser.CharacterDataHandler = self.characters
         self.parser.ProcessingInstructionHandler = self.processingInstruction
+        self.unfed_so_far = 1
     
     def feed(self,data):
+        if self.unfed_so_far:
+            self.doc_handler.startDocument()
+            self.unfed_so_far = 0
+
         if not self.parser.Parse(data):
             self.__report_error()
 
     def close(self):
+        self.doc_handler.endDocument()
         if not self.parser.Parse("",1):
             self.__report_error()
         self.parser = None
--- drv_sgmlop.py.orig  Mon Jul 17 12:57:21 2000
+++ drv_sgmlop.py       Mon Jul 17 16:39:39 2000
@@ -7,28 +7,20 @@
 from xml.parsers import sgmlop
 from xml.sax import saxlib,saxutils
 import urllib
-
-class DHWrapper:
-
-    def __init__(self,real_dh):
-        self.real_dh=real_dh
-
-    def __getattr__(self,attr):
-        return getattr(self.real_dh,attr)
-
-    def startElement(self,name,attrs):
-        self.real_dh.startElement(name,saxutils.AttributeMap(attrs))
         
 class Parser(saxlib.Parser):
 
     def __init__(self):
         saxlib.Parser.__init__(self)
         self.parser = sgmlop.XMLParser()
+        self.unfed_so_far = 1
     
-    def setDocumentHandler(self, dh):
-        #self.parser.register(DHWrapper(dh), 1)

-        #self.parser.register(DHWrapper(dh))
-        self.parser.register(dh)
+    def setDocumentHandler(self, dh):
+        # setup callbacks
+        self.finish_starttag = dh.startElement
+        self.finish_endtag = dh.endElement
+        self.handle_data = dh.data
+        self.parser.register(self)
         self.doc_handler=dh
 
     def parse(self, url):
@@ -64,8 +56,13 @@
 
     def reset(self):
         self.parser=sgmlop.XMLParser()
+        self.unfed_so_far = 1
     
     def feed(self,data):
+        if self.unfed_so_far:
+            self.doc_handler.startDocument()
+            self.unfed_so_far = 0
+
         self.parser.feed(data)
 
     def close(self):