[Expat-discuss] New Expat functionality and API proposal

Scott Bronson bronson@rinspin.com
Fri Aug 9 20:20:02 2002


---------------------- multipart/mixed attachment
> Requests for a pull-based API for Expat have surfaced a few times over
> (at least) the last couple of years; there is a feature request for
> this on SourceForge (issue #544682):


> Expat could provide the basis for an efficient pull-based API if it
> offered an opportunity to suspend parsing temporarily, allowing
> parsing to resume when the application is ready for additional
> information from the document.  A .NET-like API could easily be built
> on top of such a feature.

Why?  What does suspend have to do with pulling?


> Karl Waclawek and I have been having discussions about this, and think
> we have a good idea of how to introduce such a feature into Expat.
> There are questions and issues regarding the possible API that would
> need to be exposed; I've summarized our ideas an analysis below in the
> form of two alternate API proposals.
> 
> We welcome feedback and discussion, including the introduction of
> additional API proposals, on the expat-discuss list.

I actually thought about this a while ago but never went anywhere
with it (due to other problems with the project that was to use it).
But, I did send the following set of files to a friend.  Glad
they're still in the send mail directory.


After seeing how ugly push got, I wrote a shim to implement pull.
Exrub (stupid name -- I was tired) is pretty easy to use: you
just keep asking it for tokens until it returns EOF.

So, to slam together an example, this is how you would parse
an arbitrary number of section elements like this:

<section name='abc'>
  <other><elems go='here'>
</section>
<section name='def'>
</section>
<section name='ggg'>
...


using Exrub.  The (off the top of my head) Pythonish code:


  parser = exrub.Exrub()
  file = open('file.xml', 'r')
  parser.SetFile(file)

  ...

  while 1:
    tok = parser.GetNextNonWSToken()   # ignore whitespace
    if tok.type == START:
       if tok.name == 'section':
          if tok.attrs.has_key('name'):
            NewSection(tok.attrs['name'])
          else:
            Error("All sections must have a name attribute")
       else:
          Error("This element can only contain sections.")
    elif tok.type == END:
       break
    else    # token is character data
       print token.data

  (NewSection would keep asking for more tokens and parsing
   sub-elements until it gets an end section tag, whereupon
   it would return)

An exrub token has a type (start tag / end tag / character data)
and a name.  If it's a start tag, it also has all of the attributes
in a hash.  If it's char data, it contains the data in a string.
Pretty simple.

It's MUCH easier to parse an XML file using this style of pull than
it is to try to implement a FSM to reassemble data that is pushed.
Compare read-exrub.py and read-fsm.py.  The biggest thing to notice
is that the structure of the code in read-exrub is pretty similar to
the XML file.  The structure of read-fsm,though, is totally different.
Good luck understanding it...

So, is Exrub (minus the name) similar to what you were thinking?
If not, then why not?  :)

    - Scott


---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: parse-vs-fsm.tar.gz
Type: application/x-gzip
Size: 7806 bytes
Desc: not available
Url : http://mail.libexpat.org/pipermail-21/expat-discuss/attachments/20020809/509f8389/parse-vs-fsm.tar.bin

---------------------- multipart/mixed attachment--