[XML-SIG] Mr. Nitpicker looks at saxlib
Fredrik Lundh
fredrik@pythonware.com
Thu, 28 May 1998 22:58:25 +0100
(sorry, but I've got distracted. Darn those paying customers ;-)
Here's some comments on saxlib, based on the HOWTO document,
a quick look at the sources, and some experiences from the sgmlop-
based coreXML parser I've written for our RDE and MIOW projects.
(I really should have looked closer at the sources, and read the SAX
spec again, but will probably not get around to do that before the
weekend... feel free to flame away if I've misunderstood every-
thing)
important issues
-------------------
1. Performance #1: Should the "characters" method really take start/length
arguments?
I suppose this is a direct mapping of the Java SAX spec, but it has one
serious drawback: the string slicing operator copies the string, which
means that you'll end up with an extra string copy when you use fast
parsers like sgmlop and pyexpat:
- parser copies data into a python string
- driver calls "characters" with string, start=0, and length=len(string)
- user-defined class does string[offset:offset+length], which copies
the string again
(- the user class does self.data = self.data + string[...], which copies
the string yet another time. sigh...)
I'd say we might as well get rid of those two arguments, and leave it
to the parser to slice and dice.
Or if you insist, you could at least change start/length to start/end...
2. Usability: There's no "feed" method. While it is perfectly valid to assume
threading for Java, I don't think this is a valid requirement for Python code.
Since sgmlop, xmllib, and pyexpat all support incremental parsing (and since
our stuff is event-driven...), it would be good if saxlib exposed these
methods in some way.
somewhat important issues
--------------------------------
3. Performance: Is the AttributeList class really necessary? Wouldn't
it be enough to use a good ole dictionary?
4. Performance and usability: sgmllib and xmllib currently allows you to
implement a "static DTD" via start_xxx, end_xxx, and do_xxx methods.
While this cannot be used to handle all kinds of DTD's, it sure makes
it easier to implement simple parsers.
consider:
def startElement(self, name, attrs):
# If it's a comic element, save the title and issue
if name == 'comic': self.this_title = attrs.get('title', "")
self.this_number = attrs.get('number', "")
# If it's the start of a writer element, note that fact
elif name == 'writer':
self.inWriterContent = 1
self.writerName = ""
def endElement(self, name):
if name == 'writer':
self.inWriterContent = 0
if self.search_name == self.writerName:
print 'Found:', self.this_title, self.this_number
vs.
def start_comic(self, attrs):
self.this_number = attrs.get("number", "")
def start_writer(self, attrs):
self.inWriterContent = 1
self.writerName = ""
def end_writer(self):
self.inWriterContent = 0
if self.search_name = self.writerName:
print 'Found:', self.this_title, self.this_number
or even:
def start_comic(self, number="", **attrs):
self.this_number = number
(etc)
This also makes it possible to speed things up (the parser can cache
the bound methods to minimize the number of lookups and extra
comparisions)
5. Usability: the coreXML parser exposed the internal tag stack used to
check that elements are properly closed. The result is that you can
write things like:
def startElement(self, name, attrs):
if self.tags[-2:] == ["comic", "writer"]:
...
which is, IMHO, pretty cool.
6. Usability: htmllib (!) provides save_bgn and save_end methods in the
baseclass which implements that self.data = self.data + ... stuff that
everyone has to implement anyway... should saxlib provide something
similar?
def start_writer(self, attrs):
self.save_bgn()
def end_writer(self):
writer = self.save_end()
...
7. Should the API be tweaked to adhere to the Python style guidelines?
That is, should startElement be start_element instead?
http://www.python.org/doc/essays/styleguide.html
8. Shipping. while it's obvious that saxlib with all drivers and utilities should
be included in the big everything-in-a-single-package XML add-on, I'm not
sure everything that could fit into that package should be distributed with
the Python core (at least if Guido still adhers to the "if I cannot hack it, I
don't want it in the core" principle).
But I think saxlib+xmllib+sgmlop should be part of the standard library in
future releases. What do you think?
9. Should sgmlop perhaps be renamed to xmlop?
10. May I go home now?
Cheers /F
fredrik@pythonware.com
http://www.pythonware.com