[Expat-discuss] Discussion: InternalEntityRefHandler

Sun Jun 16 19:01:02 2002

On 16 Jun, Karl Waclawek wrote:
> Here is what I believe the proposed InternalEntityRefhandler
> should do (I leave Rolf's objections for him to explain):
> 
> - For SAX compliance I need a StartInternalEntityHandler
>   and an EndInternalEntityHandler instead of just an
>   InternalEntityRefhandler.

I know, you aim for SAX compliance. Although I don't think, that SAX
compliance is a strict 'must' for expat, I could see some value in
this.

Interesting enough you argue with SAX compliance for your proposal,
although your proposal in fact doesn't resemble the SAX startEntity
handler.

The SAX startEntity handler has as sole argument the entity name
(parameter entities have '%' prepended to their names) and especially
no return value (is void). See
http://www.saxproject.org/apidoc/org/xml/sax/ext/LexicalHandler.html#startEntity(java.lang.String)  

Since you now propose a Start/End InternalEntityHandler pair, why not
just always expand any entity (PE's and GE's), if possible? With the
Start-/EndInternalEntityHandler the programmer is able to preserve the
GE's inside the document - which is the main use case for this
Start-/EndInternalEntityHandler, that I'm able to see. Without the
risk of any harm. (But see the exceptions, at the end of the mail).

> - The StartInternalEntityHandler returns a boolean-type
>   value that determines if the entity will be expanded or not;
>   if there is no expansion, then the EndInternalEntityHandler
>   will not be called.

Well, as already said above, SAX doesn't know anything about such a
return value of startEntity.

> - It may not make much sense to have this "optional expansion"
>   available for parameter entity references too, but I would
>   leave it in for the following reasons:
>   a) I haven't seen a good argument why it will never make sense
>      (anyone please feel free to come up with one)

Since this "optional expansion" is clearly an enhancement of the SAX
startEntity handler API by you, why don't you bring up a pro argument
for doing it?

In a previous mail I've provided a range of examples, for which
suppressing the PE's expansion is disastrous.

Why do you want to add a feature, for which you have to write in the
documentation: "We don't know a sensible use case for this feature. Be
warned: If you use suppressing of PE expansion, there are pretty good
chances that you break your application."

I really don't buy in your argument 'It may eventually useful for
something, not known up to now'. If we accept this argument as valid,
we have to add anything to the API.

Do you know about any feature request for 'suppressing PE expansion'?
No? Oh...

Don't fix it, if it isn't broken.

>   b) It is extra work to disable it, and because of b)
>      I see no reason to invest more effort to not have  it

Again, I don't buy in this argument. Even if we came to the
conclusion, that the StartInternalEntityHandler should return a flag,
this would be 1 line of code. Come on, Karl.

>   c) This capability is analog to an existing one for the
>      ExternalEntityRefhandler, since there, expansion
>      can be suppressed within the handler - and I like symmetry <g>

No. This symmetry is only very vague. The ExternalEntityRefhandler has
to do the expansion of the external entity by it's own (resolve the
external entity, create an externalEntityParser, read the external
entity and feed it into the externalEntityParser etc.). The
StartInternalEntityHandler doesn't do the entity expansion, it's done
by the parser. Where's the symmetry?

Or look what happens, if both ExternalEntityRefhandler and
StartInternalEntityHandler are not set. External Entities are silently
skipped (not expanded), but internal entities will be expanded (if
there declaration is known). Really no clear symmetry.

>   d) I don't like function arguments that depend on other
>      function arguments - i.e. "ignore return value if PE".
>      However, I recognize that this can often not be avoided 
>      in realistic implementations.

OK. But if StartInternalEntityHandler returns void, this problem is
vanished.

> - The literal entity value (with character references expanded)
>   will be passed to the StartInternalEntityHandler.
>   Main reason: the information is available, and is easy to pass.
>   However, I do not know of a really good use case for it,
>   but left it in, just in case there is one. Again, feel free
>   anyone to come up with pros and cons.

What you call the "literal entity value" is exactly the same, that
someone could get via the entityDeclHandler, right?

So, since there is already a way, to get this information, if someone
really needs it, where's the point in providing a function argument,
which will - as even you confirm - be ignored in almost all cases?

> - There is possible conflict with SetDefaulthandler and
>   SetDefaultHandlerExpand, since these *also* try to determine
>   if internal entities will be expanded.
>   For backwards compatibility, it should work like this:
>   a) StartInternalEntityHandler is set:
>      SetDefaulthandler and SetDefaultHandlerExpand have no effect
>      on internal entity expansion, they just set the default handler.
>   b) StartInternalEntityHandler is NOT set:
>      SetDefaulthandler and SetDefaulthandlerExpand behave as
>      they used to - no change.
>   In the future, SetDefaultHandlerExpand should be deprecated,
>   and SetDefaultHandler should loose its function to suppress
>   internal entity expansion. 

Yes.

> - *Unclear Issue*:
> 
> There is a statement in the SAX specs that pertains
> to the topic of this thread, but I don't quite understand it:
> 
> <SAX-Quote>
> Because of the streaming event model that SAX uses, some entity boundaries
> cannot be reported under any circumstances:
>   a.. general entities within attribute values
>   b.. parameter entities within declarations
> These will be silently expanded, with no indication of where the original entity boundaries were.
> </SAX_Quote>
> 
> This would mean, that this should not be possible in Expat.
> I agree that this is true when parsing entity declarations
> or attribute values in declarations, since at this point
> the replacement text is not constructed. However, whenever
> the replacement text is built, this should be possible, or
> is there something obvious I am not understanding?

Yes. As illustration, look at this example:

<?xml version="1.0" ?>
<!DOCTYPE element [
<!ENTITY a "f" >
<!ENTITY b "b" >
]>
<element attr1="foo&a;foo" attr2="boo" attr3="bar&b;bar">baz</element>

Since the element name and all attributes are reported at once via the
elementStartHandler, at which point will you call Start- and
EndInternalEntityHandler? You could only call it befor or after the
call of the elementStartHandler. But this doesn't provide really
useful information. Even if you call the StartInternalEntityHandler
twice, let's say, befor the elementStartHandler, in which of the three
attributes are the two entities? And at which position inside the
Attribute value?  Same argumentation goes for "b..".

This is cleary a drawback. But there isn't much, one can do against,
even with your proposal.

(Well, I could see a way around, which even brings your
StartInternalEntityHandler return value back, but I doubt it would be
accepted. And since it's already high time to keep still, I'll do
exactly this ;-).)

rolf