[XML-SIG] Expat strategy

Paul Prescod paul@prescod.net
Mon, 31 Jan 2000 02:46:59 -0600


Jack Jansen wrote:
> 
> A suggestion to whoever is going to implement this: if we're going to
> include a private version of expat it's probably a good idea to change
> all the C global symbols. Expat is pretty popular, and I've been
> bitten a few times by global symbol name clashes where Python used one
> version of a library and a package used (or embedded in an application
> in which Python was also embedded) had incorporated a different version.

1. Exports

There was some debate about what would happen if we statically linked
pyexpat to xmlparse.dll. I am confident that we could, on most
reaasonable platforms, export only the symbols Python needs to bootstrap
and not all of Expat's static symbols. It is routine on Windows to
statically link to a C library without worrying about conflicts with
"open". 

Perl's expat.dll exports exactly two names: _boot_XML__Parser__Expat and
_boot_XML__Parser__Expat. BTW, it's 112K.

Anyhow, I count 49 exported symbols and all of them begin with the
prefix XML_ so they can be safely renamed with 49 #defines if we decide
it is necessary. That's ugly but safe and effective.

2. API

We had talked of embedding SAX directly in PyExpat but in retrospect I
don't think that there is any need to do so. We can layer SAX 1 and 2 on
top of a transliterated Expat API without any loss of performance. This
is true because of Expat's handler architecture. Even if you layer
xmllib on top of sax 1 on top of another implementation of xmllib on top
of another layer of sax 2 on top of expat, you get high performance if
the "handler" is the same method at all levels. In other words, we can
"wrap" expat at the Python level without doing any proxying of events.

I'm only mentioning xmllib to emphasize the point that the number of
layers doesn't matter because you don't lose performance in the layers.
I'm not proposing that we layer xmllib on top of Expat.

If you pass a method "foo" to xmllib as finish_starttag and it passes it
to sax 2 as SAX2_StartElement and it passes it to SAX1 as
SAX1_StartElement which passes it to Expat as XML_SetElementHandler, you
still only get one Python function call per element in the document.

So let's expose the raw Expat API and build SAX 1 and SAX 2 layers on
top of it. 

3. Error handling

PyExpat is one of a very few modules in the library to use setjmp. It
uses it for error handling and I'm not sure if there is any way around
it so I won't advocate its removal unless someone can propose a better
way.  I'm not clear how to signal to expat that it should quit parsing
other than through setjmp/longjmp.

In general, though, error handling doesn't seem to work for me:

>>> from xml.parsers.pyexpat import ParserCreate, ErrorString
>>> p=ParserCreate()
>>> p.foo="abc"
Traceback (innermost last):
  File "<stdin>", line 1, in ?
SystemError: error return without exception set
>>> p.StartElementHandler=junk
>>> p.Parse( "<a></a>" )
0
>>> from xml.parsers.pyexpat import ParserCreate, ErrorString
>>> p=ParserCreate()
>>> def junk2(a,b):
...     print a,b
...     assert 0
...
>>> p.StartElementHandler=junk2
>>> print p.Parse( "<abc><def></def></abc>", 1 )
abc []
def []
1

Errors in the Python do not appropriately abort the process, despite 
the setjmp/longjmp. I am guessing that this is due to the fact 
that the call goes across Windows DLL boundaries. If that's really all
it is then it will work better once we statically link to expat. I'd
still rather not use setjmp/longjmp if there was a way around it...

if (rv == NULL) {
	if (self->jmpbuf_valid)
		longjmp(self->jmpbuf, 1);
	My_WriteStderr("Exception in CharacterDataHandler()\n");
	PyErr_Clear();
}

One funny thing is the code after the longjmp. I guess maybe its a
fallback for when the long-jump doesn't work. It doesn't seem to work on
Windows, though.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
The new revolutionaries believe the time has come for an aggressive 
move against our oppressors. We have established a solid beachhead 
on Friday. We now intend to fight vigorously for 'casual Thursdays.'
  -- who says America's revolutionary spirit is dead?