[Expat-discuss] Discussion: InternalEntityRefHandler

Mon Jun 17 07:15:02 2002

Rolf Ade wrote:

> On 16 Jun, Karl Waclawek wrote:
> I know, you aim for SAX compliance. Although I don't think, that SAX
> compliance is a strict 'must' for expat, I could see some value in
> this.
>
> Interesting enough you argue with SAX compliance for your proposal,
> although your proposal in fact doesn't resemble the SAX startEntity
> handler.

Correct. I also try to satisfy some feature requests.

> The SAX startEntity handler has as sole argument the entity name
> (parameter entities have '%' prepended to their names) and especially
> no return value (is void). See
> http://www.saxproject.org/apidoc/org/xml/sax/ext/LexicalHandler.html#startEntity(java.lang.String)

The return value has the function of enabling the caller to suppress
expansion of selected entities, a feature request - see patch # 429501.
The entity value as parameter goes beyond SAX, for reasons explained.
My goal really is to add SAX functionality missing in Expat, not to turn
the Expat API into a SAX clone.

> Since you now propose a Start/End InternalEntityHandler pair, why not
> just always expand any entity (PE's and GE's), if possible? With the
> Start-/EndInternalEntityHandler the programmer is able to preserve the
> GE's inside the document - which is the main use case for this
> Start-/EndInternalEntityHandler, that I'm able to see. Without the
> risk of any harm. (But see the exceptions, at the end of the mail).

This would work for GEs, I think, but you still have the disadvantage
of letting Expat do unnecessary work - always parsing the entity value
even when you don't need it. But Expat is all about efficiency...

> > - The StartInternalEntityHandler returns a boolean-type
> >   value that determines if the entity will be expanded or not;
> >   if there is no expansion, then the EndInternalEntityHandler
> >   will not be called.
>
> Well, as already said above, SAX doesn't know anything about such a
> return value of startEntity.

Correct. SAX only allows to disable expansion of external PEs globally.
The proposal goes beyond SAX, to satisfy the request to disable expansion
of specific entities (predefined general entities for the feature request).
I didn't narrow the feature to such entities, but made it more generic.

> > - It may not make much sense to have this "optional expansion"
> >   available for parameter entity references too, but I would
> >   leave it in for the following reasons:
> >   a) I haven't seen a good argument why it will never make sense
> >      (anyone please feel free to come up with one)
>
> Since this "optional expansion" is clearly an enhancement of the SAX
> startEntity handler API by you, why don't you bring up a pro argument
> for doing it?

Because I think I rather need a counter argument.
I am the kind of guy who needs an argument *for* doing the extra work,
and *no* argument for *not* doing it. ;-)

> In a previous mail I've provided a range of examples, for which
> suppressing the PE's expansion is disastrous.
>
> Why do you want to add a feature, for which you have to write in the
> documentation: "We don't know a sensible use case for this feature. Be
> warned: If you use suppressing of PE expansion, there are pretty good
> chances that you break your application."
>
> I really don't buy in your argument 'It may eventually useful for
> something, not known up to now'. If we accept this argument as valid,
> we have to add anything to the API.

I don't quite agree. There are reasons for adding the Start/EndEntityHandler
the way I proposed. Coming with it as an automatic side-effect would then
be the ability to suppress PE expansion.
So, I don't need an argument for that ability, since it will be there
automatically. What I need is a good argument against, since my point is:
It will be there - why make extra steps to take it out?

> Do you know about any feature request for 'suppressing PE expansion'?
> No? Oh...

See above.

> Don't fix it, if it isn't broken.

Does that now mean - don't take it out. ;-)

> >   b) It is extra work to disable it, and because of b)
> >      I see no reason to invest more effort to not have  it
>
> Again, I don't buy in this argument. Even if we came to the
> conclusion, that the StartInternalEntityHandler should return a flag,
> this would be 1 line of code. Come on, Karl.

I just see no reason do do something extra to take a feature
out that comes automatically, if you can't tell me that this
feature makes no sense. It's not about the extra hours (or rather minutes)
I would have to spend.

Now - is there anybody on this list who knows XML well
enough to produce a good argument? I would be the first
to take the "suppress PE expansion" capability out, if
it makes no sense - but unfortunately I am not enough
of an expert to have a definitive opinion.

> >   c) This capability is analog to an existing one for the
> >      ExternalEntityRefhandler, since there, expansion
> >      can be suppressed within the handler - and I like symmetry <g>
>
> No. This symmetry is only very vague. The ExternalEntityRefhandler has
> to do the expansion of the external entity by it's own (resolve the
> external entity, create an externalEntityParser, read the external
> entity and feed it into the externalEntityParser etc.). The
> StartInternalEntityHandler doesn't do the entity expansion, it's done
> by the parser. Where's the symmetry?

In the things you can do - suppress expansion of all types of entities,
report entity boundaries, look at the entity value ...

> Or look what happens, if both ExternalEntityRefhandler and
> StartInternalEntityHandler are not set. External Entities are silently
> skipped (not expanded), but internal entities will be expanded (if
> there declaration is known). Really no clear symmetry.

It is true that, because of the fact that the handler doesn't actually
initiate the parsing of the internal entity, the behaviour in the case
of a missing handler would be different, unless we had an extra API capability
to control internal entity expansion. But we have it: SetDefaultHandler(Expand).

You just provided an argument to keep SetDefaulthandler and SetDefaultHandlerExpand,
and not deprecate them.

> >   d) I don't like function arguments that depend on other
> >      function arguments - i.e. "ignore return value if PE".
> >      However, I recognize that this can often not be avoided
> >      in realistic implementations.
>
> OK. But if StartInternalEntityHandler returns void, this problem is
> vanished.

*if* it returns void...

> > - The literal entity value (with character references expanded)
> >   will be passed to the StartInternalEntityHandler.
> >   Main reason: the information is available, and is easy to pass.
> >   However, I do not know of a really good use case for it,
> >   but left it in, just in case there is one. Again, feel free
> >   anyone to come up with pros and cons.
>
> What you call the "literal entity value" is exactly the same, that
> someone could get via the entityDeclHandler, right?

I would say so.

> So, since there is already a way, to get this information, if someone
> really needs it, where's the point in providing a function argument,
> which will - as even you confirm - be ignored in almost all cases?

Well, in our private conversations I was thinking about taking it out,
but you thought I might as well leave it in. One advantage I can think of:
It saves you an extra hash table lookup - since Expat does that anyway.

> > - *Unclear Issue*:
> >
> > There is a statement in the SAX specs that pertains
> > to the topic of this thread, but I don't quite understand it:
> >
> > <SAX-Quote>
> > Because of the streaming event model that SAX uses, some entity boundaries
> > cannot be reported under any circumstances:
> >   a.. general entities within attribute values
> >   b.. parameter entities within declarations
> > These will be silently expanded, with no indication of where the original entity boundaries
were.
> > </SAX_Quote>
> >
> > This would mean, that this should not be possible in Expat.
> > I agree that this is true when parsing entity declarations
> > or attribute values in declarations, since at this point
> > the replacement text is not constructed. However, whenever
> > the replacement text is built, this should be possible, or
> > is there something obvious I am not understanding?
>
> Yes. As illustration, look at this example:
>
> <?xml version="1.0" ?>
> <!DOCTYPE element [
> <!ENTITY a "f" >
> <!ENTITY b "b" >
> ]>
> <element attr1="foo&a;foo" attr2="boo" attr3="bar&b;bar">baz</element>
>
> Since the element name and all attributes are reported at once via the
> elementStartHandler, at which point will you call Start- and
> EndInternalEntityHandler? You could only call it befor or after the
> call of the elementStartHandler. But this doesn't provide really
> useful information. Even if you call the StartInternalEntityHandler
> twice, let's say, befor the elementStartHandler, in which of the three
> attributes are the two entities? And at which position inside the
> Attribute value?  Same argumentation goes for "b..".

Very good point, thanks!

Another way of stating the argumentation for b) in the quoted SAX-spec :
The XML spec seems to indicate, that the replacement text for
an entity is constructed *before* it is parsed. Which would mean
that entity boundaries for PEs in literal entity values get lost.

> This is cleary a drawback. But there isn't much, one can do against,
> even with your proposal.
>
> (Well, I could see a way around, which even brings your
> StartInternalEntityHandler return value back, but I doubt it would be
> accepted. And since it's already high time to keep still, I'll do
> exactly this ;-).)

Now you made me curious... <g>

Karl