[XML-SIG] SAX Namespaces

Paul Prescod paul@prescod.net
Thu, 06 Jul 2000 14:53:17 -0500


"Fred L. Drake, Jr." wrote:
> 
> Greg Stein wrote:
>  >   iv)  {(URI, localname) : (qname, value), ...}
>  >
>  > Using (iv) means that the passed attribute dictionary is immediately usable.
>  > The other forms require some initial processing, yet provide no value-add.
> 
> Paul Prescod writes:
>  > It is immediately usable as a dictionary, but it must be converted to a
>  > list for apps that want to iterate over attributes. Examples include
> ...
>  > 1. Many (most?) apps turn the dictionary into a list immediately.
> 
>   An unexpected observation!  When we were working on Grail, the lists
> of attributes returned by sgmllib/htmllib were a substantial
> nuissance, and we *really* wanted dictionaries.  

Agreed. I've always said that dictionary-based lookup is important and
must be provided.

To get that, you would use code like this:

def startElement( self, name, attrs ):
	attrs=sax.AttributeList( attrs )
	a=attrs["abc"]
	b=attrs["def"]

The question is not which mode should be available, but which should be
default.

> The problem of
> looping over the attributes to get the ones we wanted was sufficient
> to fork the modules from the standard library and create the code
> that's in the later versions of Grail (see the grail/src/sgml
> directory in the Grail CVS tree at SourceForge), which was *much*
> easier to work with.

I would have suggested you use a wrapper approach rather than forking
the codebase!

>   Perhaps there's a split here between general tools that work on
> arbitrary XML and "applications" that don't care about the XML but
> only need to extract the information to solve some specific problem?
> That actually seems fairly likely to me, on first thought.

I agree. We need both list-indexing and name-based indexing. The
question is which should be default and which should require a method
call or object construction. In the pre-namespace world I was so
in-favor of name-based indexing that I actually changed PyExpat to use
dictionaries myself. 

But in the post-namespace world, it isn't clear what to index upon,
because it depends on what the application is interested in. A lot will
care about localname/URI pairs. A lot will care about rawnames. A few
(e.g. search engines) may want lists of attributes with a particular
namespace.

> The usage pattern we observed in Grail was that we'd set up default
> values in locals, loop over the attributes list to set up locals, and
> then use the locals while doing whatever we needed to do.  It was a
> real pain if we needed to branch on one attribute and then only use
> some others in one branch or another; we still had to loop and extract
> first, and then do the application work.  We couldn't branch on the
> one that mattered, and then get the others only as needed.  Unless we
> looped more than once, which is heinous.

I would encapsulate this behavior in the wrapper class. That's what I
meant by "lazy indexing." You ask for one attribute and an internal
dictionary "remembers" where it found it. You ask for another and it
remembers where it found that. If you only ask by URN/localname pair
then you don't incur the cost of indexing by qname and if you only ask
by qname then you don't incur the cost of indexing by pair.

I haven't benchmarked this or any strategy. It depends on the
application. If you find that attributes are slow in your application,
you could benchmark and replace the AttributeList class with something
that is more appropriate for it.

> Lists of
> attributes seem really hard to work with.

It depends on what you are trying to do. There are vast classes of
applications that I would expect to use the AttributeList class. I'm
just trying to allow applications to NOT use it if they don't want it.

Here is one way of looking at it. Let's say that there are four popular
APIs out there:

SAX
DOM
Pyxie
QP_xml

Let's say an equal number of people use all four. Finally, let's say
that all but SAX are built on SAX. 

If this is the case, then 75% of all Python SAX users are going to be
using APIs that immediately copy attributes into a list (which all of
the non-SAX APIs do today) and 25% are going to be using the
dictionary-structure directly. And even some subset of the 25% may find
that the dictionary is sub-optimal because it is indexed based on the
wrong property or properties.

-- 
 Paul Prescod - Not encumbered by corporate consensus
The distinction between the real twentieth century (1914-1999) and the
calenderical one (1900-2000) is based on the convincing idea that the 
century's bouts of unprecented violence, both within nations and between 
them, possess a definite historical coherence -- that they constitute,
to 
put it simply, a single story. 
	- The Unfinished Twentieth Century, Jonathan Schell 
		Harper's Magazine, January 2000