[XML-SIG] Re: Mr. Nitpicker looks at saxlib

Lars Marius Garshol larsga@ifi.uio.no
29 May 1998 00:14:22 +0200


* Fredrik Lundh
| 
| feel free to flame away if I've misunderstood everything

*straps on flamethrower*

| 1. Performance #1: Should the "characters" method really take start/length
|   arguments?
| 
|   I suppose this is a direct mapping of the Java SAX spec, 

It is.

|   but it has one serious drawback: the string slicing operator
|   copies the string, 

Just a thought: should that be changed, since strings are supposed to
be mutable anyway and this is such a common operation?

|   which means that you'll end up with an extra string copy when you
|   use fast parsers like sgmlop and pyexpat

However, it has the advantage that parsers like xmlproc don't have to
do that extra string copy, since what xmlproc gives you is a
restricted view of the internal xmlproc data buffer. I timed the
parser before and after I implemented this and the speed increase was
significant.

Is it impossible to do something similar in sgmlop to avoid this
overhead?

We should also think about where speed is most critical: in slow
Python parsers or fast C parsers...

|   Or if you insist, you could at least change start/length to
|   start/end...

Well, that means JPython integration goes down the drain and
introduces a subtle, but important difference in the API that's likely
to cause major pain.

I should know. In the first SAX translation I got this backwards in
the xmlproc driver, but correctly in the others. Nobody ever
complained, but it bit me, and it took me a while to figure out where
the problem was.

IMHO, having two so similar APIs when many people are likely to be
using both is just asking for trouble.
 
| 2. Usability: There's no "feed" method.  While it is perfectly valid
|   to assume threading for Java, I don't think this is a valid
|   requirement for Python code.  Since sgmlop, xmllib, and pyexpat
|   all support incremental parsing (and since our stuff is
|   event-driven...), it would be good if saxlib exposed these methods
|   in some way.

Well, the only parsers that can't support this are nsgmls wrappers
(which will happen at some point) and XML-Toolkit. However, adding it
means extending SAX and knowing that some parsers will not support it.

I'm thinking of extending the Parser interface with some more methods
that are not part of SAX 1.0 anyway, so perhaps we can do this in a
more controlled fasion. My plan was to keep saxlib pure, but to add a
number of optional methods in a subclass of saxlib.Parser in saxexts
and implement these in all parser drivers.

How about this:

# --- Experimental extension to Parser interface

class ExtendedParser(saxlib.Parser):
    "Experimental unofficial SAX level 2 extended parser interface."

    def get_parser_name(self):
        "Returns a single-word parser name."
        raise saxlib.SAXException("Method not supported.",None)

    def get_parser_version(self):
        """Returns the version of the imported parser, which may not be the
        one the driver was implemented for."""
        raise saxlib.SAXException("Method not supported.",None)

    def get_driver_version(self):
        "Returns the version number of the driver."
        raise saxlib.SAXException("Method not supported.",None)        
    
    def is_validating(self):
        "True if the parser is validating, false otherwise."
        raise saxlib.SAXException("Method not supported.",None)

    def is_dtd_reading(self):
        """True if the parser is non-validating, but conforms to the spec by
        reading the DTD."""
        raise saxlib.SAXException("Method not supported.",None)

    def reset(self):
        "Makes the parser start parsing afresh."
        raise saxlib.SAXException("Method not supported.",None)
    
    def feed(self,data):
        "Feeds data to the parser."
        raise saxlib.SAXException("Method not supported.",None)  

    def get_stack(self):
        "Returns the current element stack."
        raise saxlib.SAXException("Method not supported.",None)        

If we decide to do this I'll write a more formal specification for
this interface.

I'm also thinking that ParserFactory should have four lists of
parsers: the current list, a list of HTML parsers, a list of SGML
parsers and a list of extended SAX drivers.
 
| 3. Performance: Is the AttributeList class really necessary?  Wouldn't
|    it be enough to use a good ole dictionary?

Using dictionaries would mean losing attribute type information. This
is important to be able to identify the different attribute types. DOM
will have to do this and many other kinds of system as well.

At present, only xmlproc supports this, but by the sound of it Pyexpat
will also be validating at some point.

Also, the current Python version of AttributeList has the added
advantage that it can be used like this:

print "<%s" % name
for attr in attrs:
    print " %s=%s" % (attr,attrs[attr])

Then there's JPython integration and all that. IMHO AttributeList
should stay. Those who want all-out maximum overdrive __raw__ speed
above all else should not use SAX (or even Python) anyway.
 
| 4. Performance and usability: sgmllib and xmllib currently allows
|    you to implement a "static DTD" via start_xxx, end_xxx, and
|    do_xxx methods.  While this cannot be used to handle all kinds of
|    DTD's, it sure makes it easier to implement simple parsers.

There is of course very good sense in this. I think the best way to go
about this is to add it as a separate layer on top of SAX and not as a
part of the SAX interface required to be implemented by parsers. 

That would slow us down, and many important SAX users like DOM (and my
sax2obj) would never use it at all. I think the best solution would be
to make a DispatchDocHandler subclass of DocumentHandler and let those
who want to use this instead.
 
|    This also makes it possible to speed things up (the parser can
|    cache the bound methods to minimize the number of lookups and
|    extra comparisions)
 
I agree, but this shouldn't be the responsibility of parser drivers,
but rather of a single class. DispatchDocHandler can very well be
implemented in this way.

| 5. Usability: the coreXML parser exposed the internal tag stack used
|    to check that elements are properly closed.

This was cool, I agree, and much more so in Python than in Java. All
parsers must keep this information anyway, and in special cases like
the nsgmls wrapper where it is not available the driver can keep track
of the stack behind the scenes.
 
| 6. Usability: htmllib (!) provides save_bgn and save_end methods in
|    the baseclass which implements that self.data = self.data +
|    ... stuff that everyone has to implement anyway...  should saxlib
|    provide something similar?

Well, this could be implemented in the extended drivers and might have
some advantages, but I personally don't really want this feature. Any
other opinions?

| 7. Should the API be tweaked to adhere to the Python style
|    guidelines?

IMHO: no. It's too late now. I've got lots of code that uses this
style, you've got code that in it, the DOM has it, the tutorial uses
it, Paul Prescod has used it and probably many others.

saxlib has been downloaded 85 times in the past month so there's
probably quite a bit of code built on it already. I'm sorry, but I
think this particular train has left.

| 8. Shipping. [...] I think saxlib+xmllib+sgmlop should be part of
|    the standard library in future releases.  What do you think?

OK for me. :-)
 
| 9. Should sgmlop perhaps be renamed to xmlop?

XML is SGML, so if you want to support both I think you should keep
the name. If you decide to skip SGML support altogether a name change
would be more appropriate.
 
| 10. May I go home now?

Certainly. *puts away flamethrower*

Thank you ever so much for the feedback. I've wanted feedback on most
of the issues you raise here, but never got it. I'm glad to see it
come now, even if it is a little late.

-- 
"These are, as I began, cumbersome ways / to kill a man. Simpler, direct, 
and much more neat / is to see that he is living somewhere in the middle /
of the twentieth century, and leave him there."     -- Edwin Brock

 http://www.stud.ifi.uio.no/~larsga/      http://birk105.studby.uio.no/