[XML-SIG] Mr. Nitpicker looks at saxlib

Fredrik Lundh fredrik@pythonware.com
Thu, 28 May 1998 22:58:25 +0100


(sorry, but I've got distracted.  Darn those paying customers ;-)

Here's some comments on saxlib, based on the HOWTO document,
a quick look at the sources, and some experiences from the sgmlop-
based coreXML parser I've written for our RDE and MIOW projects.

(I really should have looked closer at the sources, and read the SAX
spec again, but will probably not get around to do that before the
weekend...  feel free to flame away if I've misunderstood every-
thing)

important issues
-------------------

1. Performance #1: Should the "characters" method really take start/length
  arguments?

  I suppose this is a direct mapping of the Java SAX spec, but it has one
  serious drawback: the string slicing operator copies the string, which
  means that you'll end up with an extra string copy when you use fast
  parsers like sgmlop and pyexpat:

    - parser copies data into a python string
    - driver calls "characters" with string, start=0, and length=len(string)
    - user-defined class does string[offset:offset+length], which copies
      the string again
    (- the user class does self.data = self.data + string[...], which copies
      the string yet another time. sigh...)

  I'd say we might as well get rid of those two arguments, and leave it
  to the parser to slice and dice.

  Or if you insist, you could at least change start/length to start/end...

2. Usability: There's no "feed" method.  While it is perfectly valid to assume
  threading for Java, I don't think this is a valid requirement for Python code.
  Since sgmlop, xmllib, and pyexpat all support incremental parsing (and since
  our stuff is event-driven...), it would be good if saxlib exposed these
  methods in some way.

somewhat important issues
--------------------------------

3. Performance: Is the AttributeList class really necessary?  Wouldn't
   it be enough to use a good ole dictionary?

4. Performance and usability: sgmllib and xmllib currently allows you to
   implement a "static DTD" via start_xxx, end_xxx, and do_xxx methods.
   While this cannot be used to handle all kinds of DTD's, it sure makes
   it easier to implement simple parsers.

   consider:

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':            self.this_title = attrs.get('title', "")
            self.this_number = attrs.get('number', "")

        # If it's the start of a writer element, note that fact
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""

    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number
   vs.

   def start_comic(self, attrs):
        self.this_number = attrs.get("number", "")

   def start_writer(self, attrs):
        self.inWriterContent = 1
        self.writerName = ""

   def end_writer(self):
        self.inWriterContent = 0
        if self.search_name = self.writerName:
            print 'Found:', self.this_title, self.this_number

   or even:

   def start_comic(self, number="", **attrs):
        self.this_number = number

   (etc)

   This also makes it possible to speed things up (the parser can cache
   the bound methods to minimize the number of lookups and extra
   comparisions)

5. Usability: the coreXML parser exposed the internal tag stack used to
   check that elements are properly closed.  The result is that you can
   write things like:

   def startElement(self, name, attrs):
        if self.tags[-2:] == ["comic", "writer"]:
            ...

   which is, IMHO, pretty cool.

6. Usability: htmllib (!) provides save_bgn and save_end methods in the
   baseclass which implements that self.data = self.data + ... stuff that
   everyone has to implement anyway...  should saxlib provide something
   similar?

   def start_writer(self, attrs):
        self.save_bgn()

   def end_writer(self):
        writer = self.save_end()
        ...

7. Should the API be tweaked to adhere to the Python style guidelines?
   That is, should startElement be start_element instead?

        http://www.python.org/doc/essays/styleguide.html

8. Shipping. while it's obvious that saxlib with all drivers and utilities should
   be included in the big everything-in-a-single-package XML add-on, I'm not
   sure everything that could fit into that package should be distributed with
   the Python core (at least if Guido still adhers to the "if I cannot hack it, I
   don't want it in the core" principle).

   But I think saxlib+xmllib+sgmlop should be part of the standard library in
   future releases.  What do you think?

9. Should sgmlop perhaps be renamed to xmlop?

10. May I go home now?

Cheers /F
fredrik@pythonware.com
http://www.pythonware.com