[Expat-discuss] Bug? Illegal parameter reference error for valid document

Josh Martin Josh.Martin@abq.sc.philips.com
Tue Nov 6 16:35:02 2001


Karl,

I'm sorry if I was vague, but you misunderstood me.  It's not the size of the 
buffer per se that matters, it's how much of the file you read in at a time and 
send to the parser.  Let me see if I can make this clearer.

Let's say we're parsing "test.xml":

   <?xml version="1.0" standalone="no"?>
   <!DOCTYPE test SYSTEM "test.dtd">
   <thing>My name is &bob.</thing>

, "test.dtd":
   <!ENTITY % TESTent SYSTEM "test.ent">
   %TESTent;
   <!ENTITY bob %myname;>
   <!ELEMENT thing (#PCDATA)>

and "test.ent":
   <!ENTITY % myname "&#x22;Bob&#x22;">
   
and we have an external entity parser that starts out like this:

   int extern_ent(XML_Parser p, const XML_Char *context, const XML_Char *base,
   	       const XML_Char *systemId, const XML_Char *publicId)
   {
     XML_Parser e;
     unsigned char buff[4096];
     FILE *ext;
     int length;
     char fname[1024];

     /* Open the file FNAME and assign it to the file stream EXT. */
     .
     .
     .
     /* Create a new external entity parser and assign it to E. */
     .
     .
     .
     
Now we might be tempted to read in the file one buffer-sized chunk at a time, in 
which case the external entity parsing loop would look like this (ignoring 
errors for clarity):

   // Parse until the file is done
   while (!feof(ext))
   {
      // Read one buff-sized chunk of the file into BUFF and store the number
      // of characters successfuly read into LENGTH.
      length = fread(buff, sizeof(buff), 1, ext);
      // Parse the contents of BUFF.
      XML_Parse(e, buff, length, length == 0);
   }

You would think this should work.  When the external parser is told to parse the 
file "test.dtd" it reads the entire file into the buffer and parses it.  It then 
sees an external reference to the file "test.ent" and spawns off another 
external parser to parse it.  However, when that parser returns and parsing of 
"test.dtd" continues it discards the rest of the XML stored in the buffer and 
reads in a new buffer (which is empty because the end of the file has been 
reached).  This causes the entity declaration for 'bob' to go unparsed, and when 
the main parser encounters the XML containing '&bob' it will generate an error.

To get around this problem instead of reading the file in one buffer-sized chunk 
at a time, read it in one line at a time, that way when the external parser for 
"test.ent" is done the next buffer that will be parsed by the external parser 
for "test.dtd" will be the line containing the entity declaration for 'bob'.

If we do this the external parsing loop will look something like this (again, 
ignoring errors for clarity):

   // Parse until the file is done
   while (!feof(ext))
   {
      // Read sizeof(buff) number of characters the file into BUFF 
      // until a new-line character is read or EOF is encountered
      // and store the number of characters successfuly read into LENGTH.
      length = strlen(fgets(buff, sizeof(buff), ext));
      // Parse the contents of BUFF.
      XML_Parse(e, buff, length, length == 0);
   }
   
One could also parse one character of the file at a time (possibly using 
fgetc(3S)) and be able to catch all of the external references, but this would 
probably be less efficient.

I hope this is a little clearer this time.

 - Josh Martin

> > Hi,
> > 
> > I was having alot of problems with nested external entities.  I would assume 
> > that your problem is the same as the problem that I was having.  When you 
parse 
> > your XML document and external entities you need to either read the file in 
one 
> > byte at a time, or you need to read in one line at a time (stopping at the 
> > newline character).  The reason for this is that if you read past the 
newline 
> > you might pick up more than one entity declaration in your buffer.  After 
you 
> > come back from you external entity handler you will start reading after the 
end 
> > of the buffer and lose the extra information.  This will most likely cause 
you 
> > to miss some of the declarations.  For example take the entity file 
"004-1.ent":
> 
> This does not sound like proper behaviour. Is this a documented bug?
>  
> > 1: <!ELEMENT doc EMPTY>
> > 2: <!ENTITY % e1 SYSTEM "004-2.ent">
> > 3: <!ENTITY % e2 "%e1;">  --> Expat does not seem to like "%e1;"
> > 4: %e1;
> > 
> > If you use a large buffer and just read in bytes until you fill the buffer 
and 
> > then parse it, you will probably load the whole file into the buffer.  When 
you 
> > parse the buffer the external entity handler will be called on the second 
line 
> > (which contains the 'e1' entity declaration).  When you return from the 
external 
> > entity parser and begin parsing again the parser will start at the end of 
the 
> > file, skipping lines 3 and 4.  If you parse the file either one byte at a 
time, 
> > or one line at a time then all of the entity declarations will be parsed 
> > correctly.
> 
> I just tried it with a one-byte buffer, but got the same problem (file 
004.xml).
>  
> > Also, make sure that you have something like the following in your main 
parser 
> > loop, otherwise you might never parse all the entities that you need to 
parse.
> > 
> > XML_SetParamEntityParsing(p, XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE);
> 
> I use that one: XML_PARAM_ENTITY_PARSING_ALWAYS.
>  
> > Personally I think this behavior of expat is flawed, since other handlers 
seem 
> > to be called multiple times just fine if there are multiple instances in the 
> > buffer.  However, the fact that the external entity handler spawns off an 
> > entirely new parser probably throws a monkey wrench into the whole deal.  
> > Meanwhile you can just use the work-around that I mentioned and it should 
all 
> > work just fine.  Let me know if this fixes your problem, or if you find 
another 
> > solution.
> 
> To be honest, so far Expat's behaviour seems consistent and independent of 
buffer size.
> Can you give an example that behaves differently with different buffer sizes?
> 
> Karl
>