[Expat-discuss] Parsing question from a newbie.

Wed Mar 30 21:00:50 CEST 2011

Lee,

Thank you for your help.  

Indeed, I am seeing the <LF> between XML tags.  That accounts for all the curious cases of len=1;  I had observed that the tab and space characters were counted.  I was ignoring line terminators.  I will take whitespace and non-ASCII characters into account in my handler.

I understand your caution about "split buffers" and the possible need to concatenate/combine calls to value_data_handler().

Thanks again for your help.

John
jct at sharplabs.com

-----Original Message-----
From: Lee Passey [mailto:lee at novomail.net] 
Sent: Wednesday, March 30, 2011 9:04 AM
To: Thomas, John
Subject: Re: [Expat-discuss] Parsing question from a newbie.

On Tue, March 29, 2011 7:37 pm, Thomas, John wrote:

> Suppose that I have an XML snippet of the form:
>
> <myXML>
>   <size_1>
>     <height>8.5</height>
>     <width>11.0</width>
>   </size_1>
>   <size_2>
>     <height>8.0</height>
>     <width>10.0</width>
>   </size_2>
> </myXML>

[snip]

> If my understanding is correct, expat calls the value_data_handler() 
> function when it has located the start and stop tags for an XML element.

Close, but not quite.

> First question: Why does expat give a length value of 1 when there is 
> no text between <tokenA><\tokenA>?

My hunch is that there /is/ text between the two tokens, you just don't see it.

One of the first things you need to remember about XML is that /all/ data is significant, even white space. If your input is:

<tokenA>
</tokenA>

there would, in fact, be a single (or maybe two) characters between the tags:
a newline (and perhaps a carriage return), and the Character Data Handler would be called to handle it. If whitespace is insignificant in your application it is up to you to deal with it.

Essentially, /everything/ in your input that doesn't fall into one of the other handler categories will end up being passed to the Character Data Handler. It's up to the programmer to put this data together in a meaningful way.

One important thing to know about the Character Data Handler is that there is no guarantee that when it is called you will have all of the data that appears between two elements; you need to be prepared to concatenate data if need be.
Common reasons why data is broken up are 1. the existence of newline characters in the data stream, 2. encountering the end of the input buffer, 3.
encountering named or numeric entities.

Consider the following XML snippet, where &#233 and &eacute; both represent the acute 'e' (é):

<title>
  My r&#233;sum&eacute;
</title>

Expat would /probably/ parse this as follows:

Call StartElement, name = "title"
Call value_data_handler, len = 1, *buf_ptr = '\n'
Call value_data_handler, len = 6, *buf_ptr = "  My r"
Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 Call value_data_handler, len = 3, *buf_ptr = "sum"
Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 [1] Call value_data_handler, len = 1, *buf_ptr = '\n'
Call EndElement, name = "title"

> I know that the userData structure COULD hold this information that I 
> seek provided that I, the developer, put it there.  But I would have 
> to know how and when to OBTAIN that contextual data in the first place.
> And I do not.  So ....
>
> Second question: How (and when) do I obtain the context information 
> for a call to value_data_handler()?  Does it exist in a convenient 
> form and is there an expat-provided function call to get it?

The context information is provided to every call through the "userData"
pointer, which you provided when you called SetUserData(). In your example, you passed the address of an integer to SetUserData(). Thereafter, in every Expat call a pointer to that integer was passed in as the first parameter.

Now a pointer to an integer is not very useful, but you /can/ pass a pointer to any data structure you want. For example, suppose you are writing an application that will read an XML file and "pretty print" it by removing unwanted spaces and adding other space. You could do something like this:

FILE *fout = fopen( "Pretty.xml", "w" ); SetUserData( parser, fout );

startElement(void *userData, const char *name, const char **atts) {
  FILE *fout = (FILE *) userData;
  fprintf( fout, "<%s", name );
  etc.
}

I have written some routines to build and manipulate a DOM tree (http://sourceforge.net/projects/domcapi/). I sometimes use Expat as the base parser. In these case I build a data structure to capture my state something like this

struct
{
  Node currNode;
  Node parentNode;
  char[1024] charData;
} xmlctx;

I then pass a pointer to this structure in the call to SetUserData and consequently this structure is now available to every Expat callback. (The same thing can be accomplished with global variables, but it's not as safe).

The upshot of this is that if you need to know the current element name when in your value_data_handler() callback it's up to /you/ to save that state in some sort of data structure (probably a stack) when StartElement() is called.

HTH

Cheers,
Lee

[1] For Expat to parse named entities (e.g. &eacute;) the entity must either need to be defined in a declared DTD in the input stream, or you will have to implement and register an ExternalEntityRefHandler.