[Expat-discuss] Pull API Status?

Tue Mar 11 12:14:06 EST 2003

> > No one is currently working on it, as we plan to release 
> > Expat 2.0 first.
> 
> This is understandable.  I didn't read the roadmap closely enough, and we
> going on the August posts referring to suspension perhaps being part of
> 1.96.xxx

Yes, our first look at this was triggered by the fact that Mozilla uses
an older version of Expat with such modifications. But thinking some more
about it we didn't think (or at least I didn't) that this is the way to go.

Also, if you look at the code check-ins in the last year or so, you will
find that we haven't had a lot of manpower available.
It's mostly just Fred and me.

> > - Expat is already Pull based internally. So a lot of code 
> > can be re-used.
> >   One does not need to completely re-implement the layer on 
> > top of xmltok.
> > - The main things to change are:
> >   - instead of "pushing" buffers (with XML_ParseBuffer), have the main
> >     parsing loop pull buffers with an XML_GetNextBuffer callback.
> 
> Just to get my terminology straight... By "main parsing loop" you mean any
> of the prologProcessor, contentProcessor, externalEntityProcessor family of
> functions, yes?

Yes, basically doProlog and doContent. Those "processors" above are
basically used to call the parsing loop based on the parser's state.
The input can come in chunks of any size, so part of the state
is stored in a "processor" function pointer, to avoid too many
conditional checks. That is a heavily used approach in Expat.

> So, contentProcessor (or doContent) would change to call the
> XML_GetNextBuffer callback whenever they needed more bytes.

Yes.

> Pull mode would work by calling XML_NextNode() (or whatever we name it) and
> would -never- call XML_Parse or XML_ParseBuffer.

Correct.

> Push mode would continue to work in that XML_Parse and XML_ParseBuffer would
> provide pointers and lengths to the main parser structure just like they do
> today, and the absense of a XML_GetNextBuffer callback would ensure that
> control returns to XML_Parse callers just like it does now. 

How we do this in detail is not sure yet.
One could simply have different "processors" (without the XML_GetNextBuffer)
callded from XML_Parse(Buffer). It might even be possible to mix
push and pull operation - but not sure, haven't really thought it through.

Or, one could, for instance, imagine that we leave the XML_GetBuffer callback,
and simply call the XML_NextNode function, with defaults set to skip all nodes,
and instead of the internal callbacks, the programmer supplies his/her
own (just like today), with one change: there will still be return codes,
and if the programmer wants to terminate parsing prematurely he/she just
returns XML_STOP or XML_ERROR, or some other code.
This would give us the additional benefit of enabling C programmers
to terminate parsing without having to resort to setjmp/longjmp.

In simpler words, push mode could be achieved by simply having the
programmer supply his/her own callbacks instead of the internal ones,
and by defaulting to XML_SKIP.

One thing: this push API would not be compatible with the current one.
Therefore this should be part of a branch that builds up to Expat 3.0.

> >   - add return codes to all the callbacks (like XML_SKIP, 
> > XML_USE, XML_ERROR, ...)
> >   - supply internal callbacks which perform the PULL API specific
> >     data preparation and also do any required filtering
> 
> Return codes seem to be the cleanest way to do this, especially for an ExPat
> 2.1 or 3.0 release.  
> However, should we preserve compatibility with "push-era" callbacks better
> by providing a XML_SetNodeHandling() function to be used in a callback where
> the user would send the XML_SKIP, XML_USE flags appropriately?
> 
> The default could always be "XML_SKIP".  Callbacks can continue to have void
> return type and for push mode users, the main loop would continue until the
> buffer is exhausted, just like today.

Well, IMO, if we break from the old API anyway, we should do it properly.
However, if we can keep the old API completely the same, then it would make
sense to have something like XML_SetNodeHandling().

But since I have already had some feedback on the namespace processing
API changes, it does not look like the old API will survive completely
unmodified, and so I would say we should go with return codes.
Depends on what kind of feedback we will have, of course.

> > In addition we would also want to improve the API with regards to
> > complete entity reporting (currently the same restrictions as SAX2)
> > and namespace reporting (it seems better to return names as separate
> > localName, prefix and uri parameters).
> 
> My project does not have complex namespace reporting needs, but I would
> certainly want any new pull api functions to provide as much namespace
> support as the rest of Expat.

What I was aiming at was that the way qualified names are reported
is cumbersome to parse and extra work to do within Expat.
Also, the SAX2 API has a separation into localName, QName and uri,
which is confusing. E.g., what is the value of localName if ns-processing
is turned on, but the name has no prefix? etc. We would want to
avoid that.

> > I think if we want your API re-usable we should put a lot of 
> > thought into it.
> 
> I agree, especially with regards to namespace handling.  However, I would
> like to scope out a level of effort based on the following first phase
> requirements:
> 
> 1.  Provide pull parsing that can handle element, attribute, and text nodes
> for well formed XML documents.  Use a minimal subset of the Java API at
> www.xmlpull.org as a guideline Sample API.  Subset of this API is included
> at the bottom of the email.

Yes, existing efforts can serve as a starting point, why re-invent the wheel?

> 2.  The real assessment is this - how extensive are the changes to the main
> parsing loop(s)?  They have to :
> A.  Handle pulling buffers from users when needed.  If this logic
> can stay "outside" in XML_Next(), it will be helpful, I would imagine.

This might be something to do just before doContent is called,
in the various "processors", like for instance:

static enum XML_Error PTRCALL
contentProcessor(XML_Parser parser,
                 const char *start,
                 const char *end,
                 const char **endPtr)
{
  enum XML_Error result =
    doContent(parser, 0, encoding, start, end, endPtr);
  if (result != XML_ERROR_NONE)
    return result;
  if (!storeRawNames(parser))
    return XML_ERROR_NO_MEMORY;
  return result;
}

replaing the doContent call with something like

while (XML_GetNextBuffer(start, end, endPtr)) {
  enum XML_Error result =
    doContent(parser, 0, encoding, start, end, endPtr);
  if (result != XML_ERROR_NONE)
    return result;
}

Haven't lookd at this closely, so this may not be a good approach,
but you get the idea.

> B.  Handle "XML_USE/XML_SKIP/XML_ERROR" values set during callbacks.
> C.  If XML_USE is returned, we have to store information about the
> current node for retrieval and return.

> 3.  Default callbacks that match the capabilities of the Pull API should be
> provided.  That is, "XML_USE" should be the default for StartElement,
> EndElement, and Character callbacks.

Btw, if we use return codes, then one way to default them would be
by passing them by reference (instead of a "real" return value),
and have that value defaulted accordingly. That way the programmer
only needs to set it in the "unusual" case.

> Seeing the roadmap, I can see why the main expat distribution is holding off
> on these things...  My goal would be to provide a proof of concept
> implementation that minimizes impact to the main parsing loops so that any
> changes to the main parsing loop that occur during the 1.95 -> 2.0
> transition are easy to merge with changes necessary for pull parsing.

I think if we can get away with something as simple as the contentProcessor
approach from above, coupled with custom callbacks, it might not be that bad.
But the devil is in the detail ...

Tha API you posted looks good for a starting point. Signature details
can potentially change. But let's take it one step at a time.
I am interested in this myself (a lot), but lack of time is my enemy.

Karl