[Expat-discuss] Document boundaries (was Re: Text data handler)

Karl Waclawek karl at waclawek.net
Fri May 21 23:59:08 EDT 2004


----- Original Message ----- 
From: "Thomás Inskip" <tinskip at widevine.com>
To: "Greg Martin" <Greg.Martin at TELUS.COM>
Cc: <expat-discuss at libexpat.org>
Sent: Friday, May 21, 2004 9:27 PM


> >>
> >
> > Right, you need to call XML_ParserReset and then re-register your
> > handlers before calling XML_Parse again. You can call XML_Parse as
> > many times as you want on a single document for the same parser but
> > must re-initialise the parser before starting a new document.
> >
> >
> The thing is that I am implementing a pretty generic
> transaction-oriented communications protocol; requests go in one
> direction, and responses are sent back.  Those transactions are encoded
> as XML.  The transactions go in each direction in blocks of data, which
> may contain multiple transactions, or portions of a transaction.  I'd
> rather not have to pre-parse the stream to figure out where each
> transaction (document) starts and ends before I pass it on to the
> parser.

I don't know of a parser that can handle that.
You pretty much have to tell the parser where the document ends,
as they are all geared towards processing one document only.

Also, consider this: as the parser is reading past the end tag
it is still legal to have comments, processing instructions
and whitespace. So, unless the parser encounters anything illegal
it will consider everything part of the document until it
sees - let's say - the XML declaration of the next document.
However, nothing told the parser that this is where the next document
starts - so it will evaluate it from the point of view of having
another XML declaration after the end tag, which is illegal.

This means, start and end of a document has to be determined
outside of the data stream.

> Is it possible to call XML_ParserReset from within a handler (such as
> and end element handler)?  Probably not a good idea, huh?  If I could
> then I would just call it when I reach the end of the top-level element
> (document).

I think that might give you an access violation.

> What I've done for now is just prime the parser with "<Document>" so
> that all of the transactions are considered to be subelements of
> "Document".  What I worry about is this: if there is some screwy XML in
> the stream, the parser may never recover and I won't be able to parse
> past the error point, rendering any further transactions binary waste.
> How good is Expat at recovering from errors?  I couldn't find any info
> to that regards.

Expat does not recover from well-formednes violations.
These are fatal errors. IMO, your best bet is to have
separators in the data stream (like null characters),
and scan for them to detect the end of document.
Then submit each chunk between separators as a separate
document, resetting the parser in between.

Karl




More information about the Expat-discuss mailing list