[Expat-discuss] Bug? Illegal parameter reference error for valid document
Josh Martin
Josh.Martin@abq.sc.philips.com
Tue Nov 6 16:35:02 2001
Karl,
I'm sorry if I was vague, but you misunderstood me. It's not the size of the
buffer per se that matters, it's how much of the file you read in at a time and
send to the parser. Let me see if I can make this clearer.
Let's say we're parsing "test.xml":
<?xml version="1.0" standalone="no"?>
<!DOCTYPE test SYSTEM "test.dtd">
<thing>My name is &bob.</thing>
, "test.dtd":
<!ENTITY % TESTent SYSTEM "test.ent">
%TESTent;
<!ENTITY bob %myname;>
<!ELEMENT thing (#PCDATA)>
and "test.ent":
<!ENTITY % myname ""Bob"">
and we have an external entity parser that starts out like this:
int extern_ent(XML_Parser p, const XML_Char *context, const XML_Char *base,
const XML_Char *systemId, const XML_Char *publicId)
{
XML_Parser e;
unsigned char buff[4096];
FILE *ext;
int length;
char fname[1024];
/* Open the file FNAME and assign it to the file stream EXT. */
.
.
.
/* Create a new external entity parser and assign it to E. */
.
.
.
Now we might be tempted to read in the file one buffer-sized chunk at a time, in
which case the external entity parsing loop would look like this (ignoring
errors for clarity):
// Parse until the file is done
while (!feof(ext))
{
// Read one buff-sized chunk of the file into BUFF and store the number
// of characters successfuly read into LENGTH.
length = fread(buff, sizeof(buff), 1, ext);
// Parse the contents of BUFF.
XML_Parse(e, buff, length, length == 0);
}
You would think this should work. When the external parser is told to parse the
file "test.dtd" it reads the entire file into the buffer and parses it. It then
sees an external reference to the file "test.ent" and spawns off another
external parser to parse it. However, when that parser returns and parsing of
"test.dtd" continues it discards the rest of the XML stored in the buffer and
reads in a new buffer (which is empty because the end of the file has been
reached). This causes the entity declaration for 'bob' to go unparsed, and when
the main parser encounters the XML containing '&bob' it will generate an error.
To get around this problem instead of reading the file in one buffer-sized chunk
at a time, read it in one line at a time, that way when the external parser for
"test.ent" is done the next buffer that will be parsed by the external parser
for "test.dtd" will be the line containing the entity declaration for 'bob'.
If we do this the external parsing loop will look something like this (again,
ignoring errors for clarity):
// Parse until the file is done
while (!feof(ext))
{
// Read sizeof(buff) number of characters the file into BUFF
// until a new-line character is read or EOF is encountered
// and store the number of characters successfuly read into LENGTH.
length = strlen(fgets(buff, sizeof(buff), ext));
// Parse the contents of BUFF.
XML_Parse(e, buff, length, length == 0);
}
One could also parse one character of the file at a time (possibly using
fgetc(3S)) and be able to catch all of the external references, but this would
probably be less efficient.
I hope this is a little clearer this time.
- Josh Martin
> > Hi,
> >
> > I was having alot of problems with nested external entities. I would assume
> > that your problem is the same as the problem that I was having. When you
parse
> > your XML document and external entities you need to either read the file in
one
> > byte at a time, or you need to read in one line at a time (stopping at the
> > newline character). The reason for this is that if you read past the
newline
> > you might pick up more than one entity declaration in your buffer. After
you
> > come back from you external entity handler you will start reading after the
end
> > of the buffer and lose the extra information. This will most likely cause
you
> > to miss some of the declarations. For example take the entity file
"004-1.ent":
>
> This does not sound like proper behaviour. Is this a documented bug?
>
> > 1: <!ELEMENT doc EMPTY>
> > 2: <!ENTITY % e1 SYSTEM "004-2.ent">
> > 3: <!ENTITY % e2 "%e1;"> --> Expat does not seem to like "%e1;"
> > 4: %e1;
> >
> > If you use a large buffer and just read in bytes until you fill the buffer
and
> > then parse it, you will probably load the whole file into the buffer. When
you
> > parse the buffer the external entity handler will be called on the second
line
> > (which contains the 'e1' entity declaration). When you return from the
external
> > entity parser and begin parsing again the parser will start at the end of
the
> > file, skipping lines 3 and 4. If you parse the file either one byte at a
time,
> > or one line at a time then all of the entity declarations will be parsed
> > correctly.
>
> I just tried it with a one-byte buffer, but got the same problem (file
004.xml).
>
> > Also, make sure that you have something like the following in your main
parser
> > loop, otherwise you might never parse all the entities that you need to
parse.
> >
> > XML_SetParamEntityParsing(p, XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE);
>
> I use that one: XML_PARAM_ENTITY_PARSING_ALWAYS.
>
> > Personally I think this behavior of expat is flawed, since other handlers
seem
> > to be called multiple times just fine if there are multiple instances in the
> > buffer. However, the fact that the external entity handler spawns off an
> > entirely new parser probably throws a monkey wrench into the whole deal.
> > Meanwhile you can just use the work-around that I mentioned and it should
all
> > work just fine. Let me know if this fixes your problem, or if you find
another
> > solution.
>
> To be honest, so far Expat's behaviour seems consistent and independent of
buffer size.
> Can you give an example that behaves differently with different buffer sizes?
>
> Karl
>