[Expat-discuss] Parsing question from a newbie.

Lee Passey lee at novomail.net
Wed Mar 30 18:05:58 CEST 2011


On Tue, March 29, 2011 7:37 pm, Thomas, John wrote:

> Suppose that I have an XML snippet of the form:
>
> <myXML>
>   <size_1>
>     <height>8.5</height>
>     <width>11.0</width>
>   </size_1>
>   <size_2>
>     <height>8.0</height>
>     <width>10.0</width>
>   </size_2>
> </myXML>

[snip]

> If my understanding is correct, expat calls the value_data_handler()
function when it has located the start and stop tags for an XML element.

Close, but not quite.

> First question: Why does expat give a length value of 1 when there is no
text between <tokenA><\tokenA>?

My hunch is that there /is/ text between the two tokens, you just don't see it.

One of the first things you need to remember about XML is that /all/ data is
significant, even white space. If your input is:

<tokenA>
</tokenA>

there would, in fact, be a single (or maybe two) characters between the tags:
a newline (and perhaps a carriage return), and the Character Data Handler
would be called to handle it. If whitespace is insignificant in your
application it is up to you to deal with it.

Essentially, /everything/ in your input that doesn't fall into one of the
other handler categories will end up being passed to the Character Data
Handler. It's up to the programmer to put this data together in a meaningful
way.

One important thing to know about the Character Data Handler is that there is
no guarantee that when it is called you will have all of the data that appears
between two elements; you need to be prepared to concatenate data if need be.
Common reasons why data is broken up are 1. the existence of newline
characters in the data stream, 2. encountering the end of the input buffer, 3.
encountering named or numeric entities.

Consider the following XML snippet, where &#233 and &eacute; both represent
the acute 'e' (é):

<title>
  My r&#233;sum&eacute;
</title>

Expat would /probably/ parse this as follows:

Call StartElement, name = "title"
Call value_data_handler, len = 1, *buf_ptr = '\n'
Call value_data_handler, len = 6, *buf_ptr = "  My r"
Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 Call
value_data_handler, len = 3, *buf_ptr = "sum"
Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 [1]
Call value_data_handler, len = 1, *buf_ptr = '\n'
Call EndElement, name = "title"


> I know that the userData structure COULD hold this information that I seek
provided that I, the developer, put it there.  But I would have to know how
and when to OBTAIN that contextual data in the first place. And I do not. 
So ....
>
> Second question: How (and when) do I obtain the context information for a
call to value_data_handler()?  Does it exist in a convenient form and is
there an expat-provided function call to get it?

The context information is provided to every call through the "userData"
pointer, which you provided when you called SetUserData(). In your example,
you passed the address of an integer to SetUserData(). Thereafter, in every
Expat call a pointer to that integer was passed in as the first parameter.

Now a pointer to an integer is not very useful, but you /can/ pass a pointer
to any data structure you want. For example, suppose you are writing an
application that will read an XML file and "pretty print" it by removing
unwanted spaces and adding other space. You could do something like this:

FILE *fout = fopen( "Pretty.xml", "w" );
SetUserData( parser, fout );

startElement(void *userData, const char *name, const char **atts)
{
  FILE *fout = (FILE *) userData;
  fprintf( fout, "<%s", name );
  etc.
}

I have written some routines to build and manipulate a DOM tree
(http://sourceforge.net/projects/domcapi/). I sometimes use Expat as the base
parser. In these case I build a data structure to capture my state something
like this

struct
{
  Node currNode;
  Node parentNode;
  char[1024] charData;
} xmlctx;

I then pass a pointer to this structure in the call to SetUserData and
consequently this structure is now available to every Expat callback. (The
same thing can be accomplished with global variables, but it's not as safe).

The upshot of this is that if you need to know the current element name when
in your value_data_handler() callback it's up to /you/ to save that state in
some sort of data structure (probably a stack) when StartElement() is called.

HTH

Cheers,
Lee

[1] For Expat to parse named entities (e.g. &eacute;) the entity must either
need to be defined in a declared DTD in the input stream, or you will have to
implement and register an ExternalEntityRefHandler.





More information about the Expat-discuss mailing list