[XML-SIG] parsers and XML

travish travish@realtime.net
Thu, 10 Aug 2000 14:40:08 -0500 (CDT)


> | a) most of the XML "parsers" act appear to be lexers
> 
> You mean, since they don't build complete document trees?

I mean since they appear to be lexers:

http://nightflight.com/cgi-bin/foldoc.cgi?query=lexer
lexer -->
lexical analyser
<language> (Or "scanner") The initial input stage of a language
processor (e.g. a compiler), the part that performs lexical analysis.

http://nightflight.com/cgi-bin/foldoc.cgi?lexical+analysis
lexical analysis
<programming> (Or "linear analysis", "scanning") The first stage
of processing a language. The stream of characters making up the
source program or other input is read one at a time and grouped
into lexemes (or "tokens") - word-like pieces such as keywords,
identifiers, literals and punctutation. The lexemes are then passed
to the parser.

["Compilers - Principles, Techniques and Tools", by Alfred V. Aho,
Ravi Sethi and Jeffrey D. Ullman, pp. 4-5]

> This is so
> because XML has a much simpler structure (and potentially much greater
> sizes) than what parsers traditionally have parsed.

I'm not so sure; I've compiled very large C files before.

> This makes an event-based API very useful.

The "event-based API" bears a striking resemblance to a lexer, and is
usually only useful if you do a certain amount of state-tracking yourself.
(e.g. how many levels of tags deep am I, and which tags are they?)
That is the traditional role of a parser, and the "event-driven API" apparently
does none of it.

> In Python we have so far chosen to make tree building separate utilities.

And reasonably so.

> If you want a document tree, look at 4DOM or qp_xml.

Actually, I want something between the two APIs that appear to be present
(lexing and generating an AST).  For example, in the reduce phase
of a shift-reduce parser like yacc (which corresponds to a close-tag
event from an "event driven API"), one is given the ability to
'condense' all of the subtrees of this particular node, requiring
neither a full AST nor keeping track of the stack of nested tags
you may currently be processing in.  This would be extremely handy
for (e.g.) converting XML to nested data structures.

> | b) none of the examples are of sufficient/substantial complexity
> |    (e.g. recursive nesting, deep/complex hierarchy)
> | 
> |    If anyone has suggestions on what kind of parser to use as a back
> |    end (yapps?  kjParsing?  etc.) I'd be interested to hear it.
> 
> I don't understand this question.

Meaning, how does one utilize the existing "real" parsers to quickly and
robustly do the work which seem to be required by the "event-driven API",
namely keeping track of which tags one is in, and correlating those to
actions to take.  This is a solved problem, and has been so for decades.

All of the example I've seen have a fixed, shallow tag hierarchy and so
are toy problems which don't encounter these complexities.

> The diffs seem to be for the pyexpat driver. This has nothing to do
> with sgmlop or xmllib. 

Perhaps you should look a little more carefully before sending back such
a pointed response.

> What is the problem with the description?

For one thing, it appears that the character accumulation callback has
a different signature than the other parsers, passing only one argument
instead of three (charstr, start, len).  If so, that hardly makes sgmlop
replace the other parsers invisibly.
-- 
Those who will not reason, are bigots, those who cannot,
    are fools, and those who dare not, are slaves.
       - George Gordon Noel Byron (1788-1824)