From pd at peterdavis.info Fri Mar 11 02:03:42 2011 From: pd at peterdavis.info (Peter Davis) Date: Thu, 10 Mar 2011 20:03:42 -0500 Subject: [Expat-discuss] Preserve character references in character data? Message-ID: For a project I'm working on, I'm processing files which have been "pretty" printed to use tabs, etc. to make the XML more readable. However, most of the tabs and linefeeds can be completely ignored. Unfortunately, not all of them can. Specifically, I want to be able to act on tabs and linefeeds which are stored in the XML as character references ( and ). It appears that Expat is automatically converting these to the corresponding characters. Is there any way to suppress that conversion? >From the documentation I've seen, the XML_SetDefaultHandler function should suppress the conversion, but it's not working that way for me. I'm getting a character data handler call with a single LF, with no indication that it came from , and no calls to the default handler. This seems like a common question, but I haven't been able to find anything on it. Thanks, -pd From jct at sharplabs.com Wed Mar 30 03:37:17 2011 From: jct at sharplabs.com (Thomas, John) Date: Tue, 29 Mar 2011 18:37:17 -0700 Subject: [Expat-discuss] Parsing question from a newbie. Message-ID: Ladies and gentlemen, I'm sorry for the dumb*** question, but I am an expat newbie. And, for that matter, I am an XML newbie. I may make some silly assumptions in my problem statement below. Please feel free to thrash me with my own ignorance. Suppose that I have an XML snippet of the form: 8.5 11.0 8.0 10.0 Note that there are redundant leaf nodes for both height and width. I presume that this snippet is legal XML, despite this redundancy. (The pair can, in thoery, be disambiguated by the context; i.e. size_1 vs size_2). Perhaps this is NOT legal XML, but I shall proceed with my questions assuming that it is so. Now suppose that I have the following application code (adapted shamelessly from "elements.c". static void XMLCALL startElement(void *userData, const char *name, const char **atts) { } void value_data_handler (void *userData, const char *buf_ptr, int len) { } void default_handler (void *userData, const char *buf_ptr, int len) { } static void XMLCALL endElement(void *userData, const char *name) { } int main(int argc, char *argv[]) { char buf[BUFSIZ]; XML_Parser parser = XML_ParserCreate(NULL); int done; int depth = 0; XML_SetUserData(parser, &depth); XML_SetElementHandler(parser, startElement, endElement); XML_SetCharacterDataHandler(parser, value_data_handler); XML_SetDefaultHandler(parser, default_handler); do { int len = (int)fread(buf, 1, sizeof(buf), stdin); done = len < sizeof(buf); if (XML_Parse(parser, buf, len, done) == XML_STATUS_ERROR) { fprintf(stderr, "%s at line %" XML_FMT_INT_MOD "u\n", XML_ErrorString(XML_GetErrorCode(parser)), XML_GetCurrentLineNumber(parser)); return 1; } } while (!done); XML_ParserFree(parser); return 0; } If my understanding is correct, expat calls the value_data_handler() function when it has located the start and stop tags for an XML element. And the data handler function is written by me, the developer, to parse element values (height and width in my example) out of the XML. In order for my code to disambiguate these redundant element names, there must be some way to determine the context for each call. In other words, I would expect one of the parameters to the value_data_handler() call to be something equivalent to "myXML:size_1:height" to distinguish this call from the one with the context equivalent to "myXML:size_2:height". Unless I am missing something, the only parameters that I get from expat to this call are: 1) A pointer to my own "userData" data structure. 2) A pointer to the XML data buffer at the point corresponding to the value that is to be parsed. 3) The length of the data value text, before the element-ending markup token. First question: Why does expat give a length value of 1 when there is no text between <\tokenA>? I know that the userData structure COULD hold this information that I seek provided that I, the developer, put it there. But I would have to know how and when to OBTAIN that contextual data in the first place. And I do not. So .... Second question: How (and when) do I obtain the context information for a call to value_data_handler()? Does it exist in a convenient form and is there an expat-provided function call to get it? And finally, Third question: If I am looking at this problem "all wrong", what is the "right way" to look at it? Thank you for your patience and (in advance) for your help. John C Thomas Sharp Laboratories of America jct at sharplabs.com From robert.bielik at xponaut.se Wed Mar 30 06:45:15 2011 From: robert.bielik at xponaut.se (Robert Bielik) Date: Wed, 30 Mar 2011 06:45:15 +0200 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: References: Message-ID: <4D92B55B.9070807@xponaut.se> Thomas, John skrev 2011-03-30 03:37: > Suppose that I have an XML snippet of the form: > > > > 8.5 > 11.0 > > > 8.0 > 10.0 > > First off, perfectly valid XML. There are no redundant elements. > Second question: How (and when) do I obtain the context information for > a call to value_data_handler()? Does it exist in a convenient form and > is there an expat-provided function call to get it? You'll have to keep a tag on context by utilizing the call sequence from Expat. It is after all SAX (simple api for XML) :) You'll get called as follows: startElement: "myXML" startElement: "size_1" startElement: "height" valueData: "8.5" endElement: "height" startElement: "size" valueData: "11.0" endElement: "size" endElement: "size_1" startElement: "size_2" startElement: "height" valueData: "8.0" endElement: "height" startElement: "size" valueData: "10.0" endElement: "size" endElement: "size_2" endElement: "myXML" Hope that helps. Regards /Rob From lee at novomail.net Wed Mar 30 18:05:58 2011 From: lee at novomail.net (Lee Passey) Date: Wed, 30 Mar 2011 10:05:58 -0600 (MDT) Subject: [Expat-discuss] Parsing question from a newbie. Message-ID: <48172.209.20.119.245.1301501158.squirrel@www.dysfunctionals.org> On Tue, March 29, 2011 7:37 pm, Thomas, John wrote: > Suppose that I have an XML snippet of the form: > > > > 8.5 > 11.0 > > > 8.0 > 10.0 > > [snip] > If my understanding is correct, expat calls the value_data_handler() function when it has located the start and stop tags for an XML element. Close, but not quite. > First question: Why does expat give a length value of 1 when there is no text between <\tokenA>? My hunch is that there /is/ text between the two tokens, you just don't see it. One of the first things you need to remember about XML is that /all/ data is significant, even white space. If your input is: there would, in fact, be a single (or maybe two) characters between the tags: a newline (and perhaps a carriage return), and the Character Data Handler would be called to handle it. If whitespace is insignificant in your application it is up to you to deal with it. Essentially, /everything/ in your input that doesn't fall into one of the other handler categories will end up being passed to the Character Data Handler. It's up to the programmer to put this data together in a meaningful way. One important thing to know about the Character Data Handler is that there is no guarantee that when it is called you will have all of the data that appears between two elements; you need to be prepared to concatenate data if need be. Common reasons why data is broken up are 1. the existence of newline characters in the data stream, 2. encountering the end of the input buffer, 3. encountering named or numeric entities. Consider the following XML snippet, where é and é both represent the acute 'e' (?): My résumé Expat would /probably/ parse this as follows: Call StartElement, name = "title" Call value_data_handler, len = 1, *buf_ptr = '\n' Call value_data_handler, len = 6, *buf_ptr = " My r" Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 Call value_data_handler, len = 3, *buf_ptr = "sum" Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 [1] Call value_data_handler, len = 1, *buf_ptr = '\n' Call EndElement, name = "title" > I know that the userData structure COULD hold this information that I seek provided that I, the developer, put it there. But I would have to know how and when to OBTAIN that contextual data in the first place. And I do not. So .... > > Second question: How (and when) do I obtain the context information for a call to value_data_handler()? Does it exist in a convenient form and is there an expat-provided function call to get it? The context information is provided to every call through the "userData" pointer, which you provided when you called SetUserData(). In your example, you passed the address of an integer to SetUserData(). Thereafter, in every Expat call a pointer to that integer was passed in as the first parameter. Now a pointer to an integer is not very useful, but you /can/ pass a pointer to any data structure you want. For example, suppose you are writing an application that will read an XML file and "pretty print" it by removing unwanted spaces and adding other space. You could do something like this: FILE *fout = fopen( "Pretty.xml", "w" ); SetUserData( parser, fout ); startElement(void *userData, const char *name, const char **atts) { FILE *fout = (FILE *) userData; fprintf( fout, "<%s", name ); etc. } I have written some routines to build and manipulate a DOM tree (http://sourceforge.net/projects/domcapi/). I sometimes use Expat as the base parser. In these case I build a data structure to capture my state something like this struct { Node currNode; Node parentNode; char[1024] charData; } xmlctx; I then pass a pointer to this structure in the call to SetUserData and consequently this structure is now available to every Expat callback. (The same thing can be accomplished with global variables, but it's not as safe). The upshot of this is that if you need to know the current element name when in your value_data_handler() callback it's up to /you/ to save that state in some sort of data structure (probably a stack) when StartElement() is called. HTH Cheers, Lee [1] For Expat to parse named entities (e.g. é) the entity must either need to be defined in a declared DTD in the input stream, or you will have to implement and register an ExternalEntityRefHandler. From Guillaume.Drevon at lumension.com Wed Mar 30 18:08:31 2011 From: Guillaume.Drevon at lumension.com (Guillaume Drevon) Date: Wed, 30 Mar 2011 17:08:31 +0100 Subject: [Expat-discuss] Expat loading the entire file in memory? Message-ID: <750C20E09BB8BF458D0AC0A0CE22B98D27D709FEE8@IE-ML-01v.patchlink.com> Greetings, I am trying to use Expat for parsing a 500MB file. I read the file by chunks (of 1000 byte) and send them to the parser. However, the memory used by my application goes up to 550 MB. The memory usage is entirely done by expat as the XmlParserFree does remove almost all memory consumption. I see that the memory is mostly used by the m_buffer member of the parser. Am I doing something wrong? The purpose of this SAX-like parser is not memory usage? From jct at sharplabs.com Wed Mar 30 20:38:53 2011 From: jct at sharplabs.com (Thomas, John) Date: Wed, 30 Mar 2011 11:38:53 -0700 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: <4D92B55B.9070807@xponaut.se> References: <4D92B55B.9070807@xponaut.se> Message-ID: Rob, Thanks for this info. So, apparently: 1) I must construct and maintain my own context (i.e. expat does not provide global context). 2) The "right" place to DO that might be in the startElement() and endElement() routines. 3) The "right" place to STORE this context would be in my userData structure (which I should expanded from the simple pointer to depth that is used in the elements.c sample source code.) 4) In your explanation below when you showed: startElement: "size" valueData: "11.0" endElement: "size" I think you may have meant: startElement: "width" valueData: "11.0" endElement: "width" Would that be correct? Thanks millions. John jct at sharplabs.com -----Original Message----- From: expat-discuss-bounces+jct=sharplabs.com at libexpat.org [mailto:expat-discuss-bounces+jct=sharplabs.com at libexpat.org] On Behalf Of Robert Bielik Sent: Tuesday, March 29, 2011 9:45 PM To: expat-discuss at libexpat.org Subject: Re: [Expat-discuss] Parsing question from a newbie. Thomas, John skrev 2011-03-30 03:37: > Suppose that I have an XML snippet of the form: > > > > 8.5 > 11.0 > > > 8.0 > 10.0 > > First off, perfectly valid XML. There are no redundant elements. > Second question: How (and when) do I obtain the context information > for a call to value_data_handler()? Does it exist in a convenient > form and is there an expat-provided function call to get it? You'll have to keep a tag on context by utilizing the call sequence from Expat. It is after all SAX (simple api for XML) :) You'll get called as follows: startElement: "myXML" startElement: "size_1" startElement: "height" valueData: "8.5" endElement: "height" startElement: "size" valueData: "11.0" endElement: "size" endElement: "size_1" startElement: "size_2" startElement: "height" valueData: "8.0" endElement: "height" startElement: "size" valueData: "10.0" endElement: "size" endElement: "size_2" endElement: "myXML" Hope that helps. Regards /Rob _______________________________________________ Expat-discuss mailing list Expat-discuss at libexpat.org http://mail.libexpat.org/mailman/listinfo/expat-discuss From jct at sharplabs.com Wed Mar 30 21:00:50 2011 From: jct at sharplabs.com (Thomas, John) Date: Wed, 30 Mar 2011 12:00:50 -0700 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: <58720.209.20.119.245.1301501068.squirrel@www.dysfunctionals.org> References: <58720.209.20.119.245.1301501068.squirrel@www.dysfunctionals.org> Message-ID: Lee, Thank you for your help. Indeed, I am seeing the between XML tags. That accounts for all the curious cases of len=1; I had observed that the tab and space characters were counted. I was ignoring line terminators. I will take whitespace and non-ASCII characters into account in my handler. I understand your caution about "split buffers" and the possible need to concatenate/combine calls to value_data_handler(). Thanks again for your help. John jct at sharplabs.com -----Original Message----- From: Lee Passey [mailto:lee at novomail.net] Sent: Wednesday, March 30, 2011 9:04 AM To: Thomas, John Subject: Re: [Expat-discuss] Parsing question from a newbie. On Tue, March 29, 2011 7:37 pm, Thomas, John wrote: > Suppose that I have an XML snippet of the form: > > > > 8.5 > 11.0 > > > 8.0 > 10.0 > > [snip] > If my understanding is correct, expat calls the value_data_handler() > function when it has located the start and stop tags for an XML element. Close, but not quite. > First question: Why does expat give a length value of 1 when there is > no text between <\tokenA>? My hunch is that there /is/ text between the two tokens, you just don't see it. One of the first things you need to remember about XML is that /all/ data is significant, even white space. If your input is: there would, in fact, be a single (or maybe two) characters between the tags: a newline (and perhaps a carriage return), and the Character Data Handler would be called to handle it. If whitespace is insignificant in your application it is up to you to deal with it. Essentially, /everything/ in your input that doesn't fall into one of the other handler categories will end up being passed to the Character Data Handler. It's up to the programmer to put this data together in a meaningful way. One important thing to know about the Character Data Handler is that there is no guarantee that when it is called you will have all of the data that appears between two elements; you need to be prepared to concatenate data if need be. Common reasons why data is broken up are 1. the existence of newline characters in the data stream, 2. encountering the end of the input buffer, 3. encountering named or numeric entities. Consider the following XML snippet, where é and é both represent the acute 'e' (?): My résumé Expat would /probably/ parse this as follows: Call StartElement, name = "title" Call value_data_handler, len = 1, *buf_ptr = '\n' Call value_data_handler, len = 6, *buf_ptr = " My r" Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 Call value_data_handler, len = 3, *buf_ptr = "sum" Call value_data_handler, len = 2, *buf_ptr = utf-8 representation of 233 [1] Call value_data_handler, len = 1, *buf_ptr = '\n' Call EndElement, name = "title" > I know that the userData structure COULD hold this information that I > seek provided that I, the developer, put it there. But I would have > to know how and when to OBTAIN that contextual data in the first place. > And I do not. So .... > > Second question: How (and when) do I obtain the context information > for a call to value_data_handler()? Does it exist in a convenient > form and is there an expat-provided function call to get it? The context information is provided to every call through the "userData" pointer, which you provided when you called SetUserData(). In your example, you passed the address of an integer to SetUserData(). Thereafter, in every Expat call a pointer to that integer was passed in as the first parameter. Now a pointer to an integer is not very useful, but you /can/ pass a pointer to any data structure you want. For example, suppose you are writing an application that will read an XML file and "pretty print" it by removing unwanted spaces and adding other space. You could do something like this: FILE *fout = fopen( "Pretty.xml", "w" ); SetUserData( parser, fout ); startElement(void *userData, const char *name, const char **atts) { FILE *fout = (FILE *) userData; fprintf( fout, "<%s", name ); etc. } I have written some routines to build and manipulate a DOM tree (http://sourceforge.net/projects/domcapi/). I sometimes use Expat as the base parser. In these case I build a data structure to capture my state something like this struct { Node currNode; Node parentNode; char[1024] charData; } xmlctx; I then pass a pointer to this structure in the call to SetUserData and consequently this structure is now available to every Expat callback. (The same thing can be accomplished with global variables, but it's not as safe). The upshot of this is that if you need to know the current element name when in your value_data_handler() callback it's up to /you/ to save that state in some sort of data structure (probably a stack) when StartElement() is called. HTH Cheers, Lee [1] For Expat to parse named entities (e.g. é) the entity must either need to be defined in a declared DTD in the input stream, or you will have to implement and register an ExternalEntityRefHandler. From robert.bielik at xponaut.se Thu Mar 31 06:30:17 2011 From: robert.bielik at xponaut.se (Robert Bielik) Date: Thu, 31 Mar 2011 06:30:17 +0200 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: References: <4D92B55B.9070807@xponaut.se> Message-ID: <4D940359.1030802@xponaut.se> Thomas, John skrev 2011-03-30 20:38: > Rob, > > Thanks for this info. So, apparently: > > 1) I must construct and maintain my own context (i.e. expat does not > provide global context). Yup. > 2) The "right" place to DO that might be in the startElement() and > endElement() routines. In startElement I think, since its the place where you get the element name first. > 3) The "right" place to STORE this context would be in my userData > structure > (which I should expanded from the simple pointer to depth that > is used in the elements.c sample source code.) Yes. > 4) In your explanation below when you showed: > startElement: "size" > valueData: "11.0" > endElement: "size" > I think you may have meant: > startElement: "width" > valueData: "11.0" > endElement: "width" > Would that be correct? Yes :) Regards /Rob From robert.bielik at xponaut.se Thu Mar 31 06:35:36 2011 From: robert.bielik at xponaut.se (Robert Bielik) Date: Thu, 31 Mar 2011 06:35:36 +0200 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: <4D940359.1030802@xponaut.se> References: <4D92B55B.9070807@xponaut.se> <4D940359.1030802@xponaut.se> Message-ID: <4D940498.6070800@xponaut.se> Robert Bielik skrev 2011-03-31 06:30: >> 2) The "right" place to DO that might be in the startElement() and >> endElement() routines. > > In startElement I think, since its the place where you get the element name first. Also, the best way to do it is to have a stack in which you push the element name in startElement, and pop the top of the stack in endElement. Then you'll always have the correct context when dealing with character data. /Rob From aleix at member.fsf.org Thu Mar 31 08:37:36 2011 From: aleix at member.fsf.org (=?UTF-8?Q?Aleix_Conchillo_Flaqu=C3=A9?=) Date: Thu, 31 Mar 2011 08:37:36 +0200 Subject: [Expat-discuss] Parsing question from a newbie. In-Reply-To: <4D940498.6070800@xponaut.se> References: <4D92B55B.9070807@xponaut.se> <4D940359.1030802@xponaut.se> <4D940498.6070800@xponaut.se> Message-ID: El 31/03/2011 06:35, "Robert Bielik" escribi?: > > Also, the best way to do it is to have a stack in which you push the element name in startElement, and > pop the top of the stack in endElement. Then you'll always have the correct context when dealing with > character data. > You can also use one of the available Expat wrappers that will do all this job for you, and more. Aleix