adding the XML to 2.0 to be a mistake?

Robert Roy rjroy at takingcontrol.com
Fri Jan 19 23:45:16 EST 2001


On Thu, 18 Jan 2001 20:41:21 -0800, Paul Prescod
<paulp at ActiveState.com> wrote:

>As to your complaints about SAX. The one constant in the world of XML
>and SGML, going back over 10 years is that people always complain that
>their parsers do not give them low enough access to the parse stream.
>This is because parseres are optimized for the *average case*. In the
>average case, you don't want to do your own entity lookups.
>
>If what you want is an XML tokenizer, then perhaps xmllib is just right.
>But even then it isn't perfect because it will return your attributes in
>an arbitrary order, not the order they were specified in the document. I
>need an XML tokenizer for a project I am working on and I am going to
>have to wrap Expat's XMLTok API.
>

I guess what I have been doing wants a hybrid of sorts. I care about
the entities but I don't really care about the attribute order.

>I'm disputing the usefulness of XML tokenizers -- I'm disputing their
>relative utility compared to XML parsers.
>
>>  For several tasks (eg: translation to another DTD/Schema) it is
>> desireable not to resolve any character entities including the
>> standard XML entity defs. ...
>
>Wat you want isn't an XML parser. XML parsers are required by the
>specification to resolve character entities.
>
>> Undeclared entities are a problem in SAX but can be handled cleanly
>> using the unknown_entityref mecanism in xmllib.
>
>According to the XML specification, *all* entities must be declared. An
>XML parser is required to check that.
>

I may be misinterpreting the spec but if I declare standalone="no"
with a non-validating parser, should it not ignore entities that it
can't find?

section 4.1
"Similarly, the declaration of a general entity must precede any
reference to it which appears in a default value in an attribute-list
declaration. Note that if entities are declared in the external subset
or in external parameter entities, a non-validating processor is not
obligated to read and process their declarations; for such documents,
the rule that an entity must be declared is a well-formedness
constraint only if standalone='yes'. "

If I interpret the spec properly then expat's behavior (at least as
included with 2.0)  is questionable. It should not choke on an
undeclared entity ref since it must assume that it is declared
elsewhere.


>Here's the example XML document:
>
><!DOCTYPE Element [
><!ELEMENT Element ANY>
><!ENTITY abcdef "<Element/>">
>]>
><Element>&abcdef;</Element>


Understanding that this is an example but being picky.... this
resolves to <Element> <Element/> </Element>, which gets by the parser
all right but this seems to violate the part of the root element
definition that states "no part of which appears in the content of any
other element"

>
>SAX correctly reports two elements. xmllib incorrectly reports one. And
>it isn't fair to come back with: "Who cares about entities" because you
>were complaining about SAX's entity support. The real question is
>whether you would rather have low-level vs. correct entity handling. The
>answer is: a tokenizer should have low-level, a parser should have
>correct.

I want it all correct AND flexible <g/>. Now if we can just hook in
here and override there...

Bob




More information about the Python-list mailing list