From 11mjazbdg02 at sneakemail.com Mon Oct 1 10:13:43 2007 From: 11mjazbdg02 at sneakemail.com (Mark) Date: 1 Oct 2007 08:13:43 -0000 Subject: [Expat-discuss] Re; XPATH for expat? Message-ID: <20829-42961@sneakemail.com> Hi Nick, Thanks for your post. You're right - I just seem a simple solution. Your stack idea sounds just the thing. Cheers, Mark :-) From ali at mental.com Mon Oct 22 15:52:36 2007 From: ali at mental.com (Albrecht Fritzsche) Date: Mon, 22 Oct 2007 15:52:36 +0200 Subject: [Expat-discuss] isFinal parameter in XML_Parser() and XML_ParseBuffer() Message-ID: <471CAB24.1030704@mental.com> Hi, I cannot find much documentation about this isFinal parameter in calls like XML_Parser() and XML_ParseBuffer(). Is it vital that this one was called or can I get away w/o using it? My parsing loop looks like const size_t BUFFSIZE = 8192; char buffer[BUFFSIZE]; bool done = false; while (!done) { int size = m_reader->read(buffer, BUFFSIZE); if (size == 0) { // (Debug, "End of file reached"); //<-- break; } else if (size < BUFFSIZE) { done = true; } if (XML_Parse(parser, buffer, int(size), done) == 0) { // (Warning, "Parse error"); return false; } } So, if the file size is exactly equal to BUFFSIZE then the XML_Parser() gets called only once, the second iteration in the while loop terminates because the reader returns now 0. Do I have to add in line //<-- a XML_Parse(parser, buffer, 0, true); just for the sake of passing the EndOfFile indication? Thanks Ali From omar.crea at jusan.it Mon Oct 22 17:39:12 2007 From: omar.crea at jusan.it (omar.crea at jusan.it) Date: Mon, 22 Oct 2007 17:39:12 +0200 Subject: [Expat-discuss] Fwd: Accented characters Message-ID: <1193067552.471cc4205c93d@arrakis.kinetikon.it> ----- Forwarded message from omar.crea at jusan.it ----- Date: Fri, 19 Oct 2007 16:08:17 +0200 From: omar.crea at jusan.it Reply-To: omar.crea at jusan.it Subject: Accented characters To: libexpat ML Hi everybody. I have a problem with accented characters: I obtain "realt?" when parsing the content "realt?", for example. I also noted that if I try to parse a content like this: "Showroom Devon & Devon", I obtain as output "Devonoom Devon" (the text after & overrides the previous one). I use Expat 2.0.1 on a Gentoo distro, but it gives me the same errors on an embedded ARM system too. I create the parser and then set the encoding this way: p = XML_ParserCreate(NULL); XML_SetEncoding(p, (const XML_Char)"iso-8859-1"); Has anyone a solution? Thanks in advance. Omar ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. ----- End forwarded message ----- ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From crazybob at crazybob.org Mon Oct 22 21:22:02 2007 From: crazybob at crazybob.org (Bob Lee) Date: Mon, 22 Oct 2007 12:22:02 -0700 Subject: [Expat-discuss] isFinal parameter in XML_Parser() and XML_ParseBuffer() In-Reply-To: <471CAB24.1030704@mental.com> References: <471CAB24.1030704@mental.com> Message-ID: Passing isFinal isn't required, but passing true enables you to get an error from Expat if the XML is incomplete. Bob On 10/22/07, Albrecht Fritzsche wrote: > > Hi, > > I cannot find much documentation about this isFinal parameter in calls > like XML_Parser() and XML_ParseBuffer(). Is it vital that this one was > called or can I get away w/o using it? > > My parsing loop looks like > > const size_t BUFFSIZE = 8192; > char buffer[BUFFSIZE]; > > bool done = false; > while (!done) { > int size = m_reader->read(buffer, BUFFSIZE); > if (size == 0) { > // (Debug, "End of file reached"); //<-- > break; > } > else if (size < BUFFSIZE) { > done = true; > } > > if (XML_Parse(parser, buffer, int(size), done) == 0) { > // (Warning, "Parse error"); > return false; > } > } > > So, if the file size is exactly equal to BUFFSIZE then the XML_Parser() > gets called only once, the second iteration in the while loop terminates > because the reader returns now 0. > > Do I have to add in line //<-- a > > XML_Parse(parser, buffer, 0, true); > > just for the sake of passing the EndOfFile indication? > > Thanks > Ali > > _______________________________________________ > Expat-discuss mailing list > Expat-discuss at libexpat.org > http://mail.libexpat.org/mailman/listinfo/expat-discuss > From m_biswas at mailinator.com Fri Oct 26 18:50:12 2007 From: m_biswas at mailinator.com (Mohun Biswas) Date: Fri, 26 Oct 2007 12:50:12 -0400 Subject: [Expat-discuss] expat parsing destructively? Message-ID: I want to do something with expat which seems elegant to me but I need to find out if it makes sense, if there's a way to do it, and if anyone has prior experience or example code. Imagine I have a complete XML document in a memory buffer and pass the starting address of the buffer to expat for parsing. Some of the attributes will need to be added into a hash table or similar in-memory data structure. In the way I've seen expat used that would mean calling malloc/strdup on the values passed into the callback, putting them in the table, then arranging to free them later. But, given that the entire document is already stored in a malloc-ed buffer for which I have no further use once parsed, I've been thinking how great it would be if expat would work "destructively", which in this case means writing a null byte at the back of each attribute value. This would be safe as it would always replace the trailing quote. If it would do that I could just put the data in the table as is and cleanup would be as simple as freeing the buffer. No debugging of memory leaks or bad frees, no heap fragmentation, simpler and faster code. For all I know expat already works like this but I can't find any documentation which discusses it. Does anyone know or have a pointer to related documentation? Thanks, MB From nickmacd at gmail.com Sun Oct 28 00:01:02 2007 From: nickmacd at gmail.com (Nick MacDonald) Date: Sat, 27 Oct 2007 18:01:02 -0400 Subject: [Expat-discuss] expat parsing destructively? In-Reply-To: References: Message-ID: Well.. it may not be exactly what you want.. but I think I have a really easy way to get a similar result for you: Simply allocate 2 copies of the same data... i.e. read in the data to one buffer, malloc() another buffer the same size, and do a huge memcpy() to make a duplicate copy. Now let eXpat scan one copy, and doing some quick and dirty pointer math, you can now find corresponding locations in the "non-eXpat" buffer and destroy it to your hearts content. When you're all done, simply free both buffers. Its a little more memory intensive.. but sounds fairly elegant to me, and you don't need to worry about doing anything unsupported by eXpat. Good luck, Nick On 10/26/07, Mohun Biswas wrote: > I want to do something with expat which seems elegant to me but I need > to find out if it makes sense, if there's a way to do it, and if anyone > has prior experience or example code. > > Imagine I have a complete XML document in a memory buffer and pass the > starting address of the buffer to expat for parsing. Some of the > attributes will need to be added into a hash table or similar in-memory > data structure. In the way I've seen expat used that would mean calling > malloc/strdup on the values passed into the callback, putting them in > the table, then arranging to free them later. But, given that the entire > document is already stored in a malloc-ed buffer for which I have no > further use once parsed, I've been thinking how great it would be if > expat would work "destructively", which in this case means writing a > null byte at the back of each attribute value. This would be safe as it > would always replace the trailing quote. > > If it would do that I could just put the data in the table as is and > cleanup would be as simple as freeing the buffer. No debugging of memory > leaks or bad frees, no heap fragmentation, simpler and faster code. For > all I know expat already works like this but I can't find any > documentation which discusses it. Does anyone know or have a pointer to > related documentation? From 11mjazbdg02 at sneakemail.com Mon Oct 29 11:22:46 2007 From: 11mjazbdg02 at sneakemail.com (Mark) Date: 29 Oct 2007 10:22:46 -0000 Subject: [Expat-discuss] expat parsing destructively? Message-ID: <13866-07183@sneakemail.com> I think this is extremely unlikely this will work. You are assuming that expat will always return a pointer to the data contained within the original memory buffer and not in any internal buffers. ---- Original Message ---- Simply allocate 2 copies of the same data... i.e. read in the data to one buffer, malloc() another buffer the same size, and do a huge memcpy() to make a duplicate copy. Now let eXpat scan one copy, and doing some quick and dirty pointer math, you can now find corresponding locations in the "non-eXpat" buffer and destroy it to your hearts content. When you're all done, simply free both buffers. Its a little more memory intensive.. but sounds fairly elegant to me, and you don't need to worry about doing anything unsupported by eXpat. -------------------------------------- Protect yourself from spam, use http://sneakemail.com From Mukesh.S at mphasis.com Mon Oct 29 11:27:20 2007 From: Mukesh.S at mphasis.com (Mukesh Kumar) Date: Mon, 29 Oct 2007 15:57:20 +0530 Subject: [Expat-discuss] expat parsing destructively? References: <13866-07183@sneakemail.com> Message-ID: <7C83A8A6B56D3A478333B1DF47E18586072422@MPBABGEX01.corp.mphasis.com> Hi Mark, it only extends one more heap memory... :) basically it is used as temporary buffer & later on he is calling free, bascially the similar logics works for me too... it is just a small trick......... :) Regards, -Mukesh Kumar Sr.Sofrware Engineer, India, Bangalore. ________________________________ From: expat-discuss-bounces+mukesh.s=mphasis.com at libexpat.org on behalf of Mark Sent: Mon 10/29/2007 3:52 PM To: expat-discuss at libexpat.org Subject: Re: [Expat-discuss] expat parsing destructively? I think this is extremely unlikely this will work. You are assuming that expat will always return a pointer to the data contained within the original memory buffer and not in any internal buffers. ---- Original Message ---- Simply allocate 2 copies of the same data... i.e. read in the data to one buffer, malloc() another buffer the same size, and do a huge memcpy() to make a duplicate copy. Now let eXpat scan one copy, and doing some quick and dirty pointer math, you can now find corresponding locations in the "non-eXpat" buffer and destroy it to your hearts content. When you're all done, simply free both buffers. Its a little more memory intensive.. but sounds fairly elegant to me, and you don't need to worry about doing anything unsupported by eXpat. -------------------------------------- Protect yourself from spam, use http://sneakemail.com _______________________________________________ Expat-discuss mailing list Expat-discuss at libexpat.org http://mail.libexpat.org/mailman/listinfo/expat-discuss From crackeur at comcast.net Mon Oct 29 09:21:14 2007 From: crackeur at comcast.net (jimmy Zhang) Date: Mon, 29 Oct 2007 01:21:14 -0700 Subject: [Expat-discuss] [ANN]VTD-XML 2.2 Message-ID: <00ea01c81a04$ac0a0bb0$0402a8c0@your55e5f9e3d2> I am proud to announce the the release of version 2.2 of VTD-XML, the next generation XML parsers/indexer/slicer/editor. This release significantly expands VTD-XML's ability to slice, split, edit and incrementally update the XML documents. To this end, we introduce the concept of namespace-compensated element fragment. This release also adds VTD+XML index writing capability to the VTD Navigator class. Other enhancements in this release include index size pre-computation, support for HTTP get, and some bug fixes. To download the latest release, please go to http://sourceforge.net/project/showfiles.php?group_id=110612. From omar.crea at jusan.it Fri Oct 19 16:08:17 2007 From: omar.crea at jusan.it (omar.crea at jusan.it) Date: Fri, 19 Oct 2007 16:08:17 +0200 Subject: [Expat-discuss] Accented characters Message-ID: <1192802897.4718ba516af79@arrakis.kinetikon.it> Hi everybody. I have a problem with accented characters: I obtain "realt?" when parsing the content "realt?", for example. I also noted that if I try to parse a content like this: "Showroom Devon & Devon", I obtain as output "Devonoom Devon" (the text after & overrides the previous one). I use Expat 2.0.1 on a Gentoo distro, but it gives me the same errors on an embedded ARM system too. I create the parser and then set the encoding this way: p = XML_ParserCreate(NULL); XML_SetEncoding(p, (const XML_Char)"iso-8859-1"); Has anyone a solution? Thanks in advance. Omar ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From nikoudel at gmail.com Mon Oct 29 11:43:48 2007 From: nikoudel at gmail.com (Nikolai Koudelia) Date: Mon, 29 Oct 2007 12:43:48 +0200 Subject: [Expat-discuss] parsing error Message-ID: Hi! I am trying to parse "xml" document with Python and Expat. I need to scan through xml and collect values which match pattern. Example: pattern: GROUP2 With pattern above I need to fetch "asdf" and "qwerty" from material below:
qwerty
The problem is that the material may not be correct. It may look like this:
qwerty
fgh 16
When expat parser reaches , it throws an exception and stops parsing. Is there a way to handle situation like that? Some option telling expat to skip broken closing tags? Or should I repair the material before parsing? Last one could be quite tricky, because expat could not be used for that... Any ideas? -NK From 11mjazbdg02 at sneakemail.com Mon Oct 29 13:16:38 2007 From: 11mjazbdg02 at sneakemail.com (Mark) Date: 29 Oct 2007 12:16:38 -0000 Subject: [Expat-discuss] Accented characters Message-ID: <19421-14291@sneakemail.com> Hi Omar, AFAIK Expat will always return character data in UTF-8 encoding no matter the character set of the original document. Look at iconv() to convert back to ISO-8859-1. ---- Original Message ---- Hi everybody. I have a problem with accented characters: I obtain "realt?" when parsing the content "realt?", for example. I also noted that if I try to parse a content like this: "Showroom Devon & Devon", I obtain as output "Devonoom Devon" (the text after & overrides the previous one). I use Expat 2.0.1 on a Gentoo distro, but it gives me the same errors on an embedded ARM system too. I create the parser and then set the encoding this way: p = XML_ParserCreate(NULL); XML_SetEncoding(p, (const XML_Char)"iso-8859-1"); Has anyone a solution? Thanks in advance. From fdrake at acm.org Mon Oct 29 13:19:57 2007 From: fdrake at acm.org (Fred Drake) Date: Mon, 29 Oct 2007 08:19:57 -0400 Subject: [Expat-discuss] parsing error In-Reply-To: References: Message-ID: On Oct 29, 2007, at 6:43 AM, Nikolai Koudelia wrote: > The problem is that the material may not be correct. It may look > like this: ... > When expat parser reaches , it throws an exception and > stops parsing. Is there a way to handle situation like that? Some > option telling expat to skip broken closing tags? Or should I repair > the material before parsing? Last one could be quite tricky, because > expat could not be used for that... Any ideas? XML parsers aren't forgiving the way HTML parsers should be, and that's a specific goal. If you're interested in tolerating any ol' HTML, use an HTML parser. There area a number of those available in Python as well (htmllib, BeautifulSoup, lxml.html). -Fred -- Fred Drake From nikoudel at gmail.com Mon Oct 29 13:42:13 2007 From: nikoudel at gmail.com (Nikolai Koudelia) Date: Mon, 29 Oct 2007 14:42:13 +0200 Subject: [Expat-discuss] parsing error In-Reply-To: References: Message-ID: That makes sense :) Thanks for a quick reply! 2007/10/29, Fred Drake : > On Oct 29, 2007, at 6:43 AM, Nikolai Koudelia wrote: > > The problem is that the material may not be correct. It may look > > like this: > ... > > When expat parser reaches , it throws an exception and > > stops parsing. Is there a way to handle situation like that? Some > > option telling expat to skip broken closing tags? Or should I repair > > the material before parsing? Last one could be quite tricky, because > > expat could not be used for that... Any ideas? > > XML parsers aren't forgiving the way HTML parsers should be, and > that's a specific goal. If you're interested in tolerating any ol' > HTML, use an HTML parser. There area a number of those available in > Python as well (htmllib, BeautifulSoup, lxml.html). > > > -Fred > > -- > Fred Drake > > > > From nickmacd at gmail.com Mon Oct 29 14:53:24 2007 From: nickmacd at gmail.com (Nick MacDonald) Date: Mon, 29 Oct 2007 09:53:24 -0400 Subject: [Expat-discuss] expat parsing destructively? In-Reply-To: <13866-07183@sneakemail.com> References: <13866-07183@sneakemail.com> Message-ID: Mark: You are of course absolutely correct in that regard... and I don't know what kind of internal memory management eXpat performs, so this may well be a very bad idea. I do know, that in many cases, eXpat seems to return to you a copy of the same data it is parsing... I just don't know if it does that in all cases. You could of course detect when the returned data did not reside within your supplied buffer, however its not quite clear to me what action you might take in such a case. Nick On 29 Oct 2007 10:22:46 -0000, Mark <11mjazbdg02 at sneakemail.com> wrote: > I think this is extremely unlikely this will work. You are assuming that expat will always return a pointer to the data contained within the original memory buffer and not in any internal buffers. > > ---- Original Message ---- > Simply allocate 2 copies of the same data... i.e. read in the data to > one buffer, malloc() another buffer the same size, and do a huge > memcpy() to make a duplicate copy. Now let eXpat scan one copy, and > doing some quick and dirty pointer math, you can now find > corresponding locations in the "non-eXpat" buffer and destroy it to > your hearts content. When you're all done, simply free both buffers. > Its a little more memory intensive.. but sounds fairly elegant to me, > and you don't need to worry about doing anything unsupported by eXpat. From 11mjazbdg02 at sneakemail.com Mon Oct 29 14:57:32 2007 From: 11mjazbdg02 at sneakemail.com (Mark) Date: Mon, 29 Oct 2007 13:57:32 -0000 Subject: [Expat-discuss] expat parsing destructively? In-Reply-To: References: <13866-07183@sneakemail.com> Message-ID: <30048-59304@sneakemail.com> Hi Nick, IIRC I found cases where expat does not return copies of the buffer which is being parsed. (In past I tried to develop a parser that did not copy any data to speed things up, but I could not achieve it for the reasons below). > -----Original Message----- > From: Nick MacDonald nickmacd-at-gmail.com |Expat/1.0-Allow| > [mailto:...] > Sent: 29 October 2007 13:53 > To: Mark Williams > Cc: expat-discuss at libexpat.org > Subject: Re: [Expat-discuss] expat parsing destructively? > > Mark: > > You are of course absolutely correct in that regard... and I don't > know what kind of internal memory management eXpat performs, so this > may well be a very bad idea. I do know, that in many cases, eXpat > seems to return to you a copy of the same data it is parsing... I just > don't know if it does that in all cases. You could of course detect > when the returned data did not reside within your supplied buffer, > however its not quite clear to me what action you might take in such a > case. > > Nick > > > On 29 Oct 2007 10:22:46 -0000, Mark > <11mjazbdg02 at sneakemail.com> wrote: > > I think this is extremely unlikely this will work. You are > assuming that expat will always return a pointer to the data > contained within the original memory buffer and not in any > internal buffers. > > > > ---- Original Message ---- > > Simply allocate 2 copies of the same data... i.e. read in > the data to > > one buffer, malloc() another buffer the same size, and do a huge > > memcpy() to make a duplicate copy. Now let eXpat scan one copy, and > > doing some quick and dirty pointer math, you can now find > > corresponding locations in the "non-eXpat" buffer and destroy it to > > your hearts content. When you're all done, simply free > both buffers. > > Its a little more memory intensive.. but sounds fairly > elegant to me, > > and you don't need to worry about doing anything > unsupported by eXpat. > From m_biswas at mailinator.com Mon Oct 29 16:57:20 2007 From: m_biswas at mailinator.com (Mohun Biswas) Date: Mon, 29 Oct 2007 11:57:20 -0400 Subject: [Expat-discuss] expat parsing destructively? In-Reply-To: References: <13866-07183@sneakemail.com> Message-ID: Nick MacDonald wrote: > Mark: > > You are of course absolutely correct in that regard... and I don't > know what kind of internal memory management eXpat performs, so this > may well be a very bad idea. I do know, that in many cases, eXpat > seems to return to you a copy of the same data it is parsing... I just > don't know if it does that in all cases. You could of course detect > when the returned data did not reside within your supplied buffer, > however its not quite clear to me what action you might take in such a > case. I wrote up the following little test case just to get an idea where the data returned by expat resides. It prints out the address range of the parsed document plus the address of each parsed attribute plus guesses at the approximate area where the stack and heap seem to be residing. The addresses of the attributes don't seem to be near the heap nor stack nor within the document itself. Damned if I know where they do live or how they get there. Yes, I should UTSL. This also shows that the document survives the parse undamaged, so clearly null bytes are not being written to it. I still say it would be really great if expat had an optional destructive-parsing mode. For one thing it seems in keeping with the spirit of SAX where you chew through the data once linearly and never look back (except with your own state variables of course). To expand a little bit on my original posting, I'm actually parsing a document contained in a file. The file is mapped into memory using mmap() and then passed to XML_Parse as a single buffer. Currently it's mapped read-only (PROT_READ) but I've been thinking how cool it would be if I could turn on writable mode (PROT_WRITE plus MAP_PRIVATE) and let expat munge the mapping in place. Memory would still need to be allocated, since the kernel would do a copy-on-write for each modified page, but I think it's a very safe bet that the kernel's memory management code is faster and more robust than mine would be. #include #include #include #include "expat.h" static char *document = "\n\n\n"; static XML_Parser xp; static void enter_elem(void *data, const char *el, const char **attrs) { data = data; int i; fprintf(stderr, "ENTER %s\n", el); for (i = 0; attrs[i]; i += 2) fprintf(stderr, "%p=%p stack=%p heap=%p\n", attrs[i], attrs[i+1], &i, malloc(32)); } static void leave_elem(void *data, const char *el) { data = data; fprintf(stderr, "LEAVE %s\n", el); } int main(void) { fprintf(stderr, "%p <--> %p\n", document, document + strlen(document)); xp = XML_ParserCreate(NULL); XML_SetElementHandler(xp, enter_elem, leave_elem); XML_Parse(xp, document, strlen(document), 1); fprintf(stderr, "%s", document); return 0; } From jzhang at ximpleware.com Mon Oct 29 19:10:21 2007 From: jzhang at ximpleware.com (jimmy Zhang) Date: Mon, 29 Oct 2007 11:10:21 -0700 Subject: [Expat-discuss] [ANN]VTD-XML 2.2 Message-ID: <00dd01c81a56$f8ccf370$0402a8c0@your55e5f9e3d2> I am proud to announce the the release of version 2.2 of VTD-XML, the next generation XML parsers/indexer/slicer/editor. This release significantly expands VTD-XML's ability to slice, split, edit and incrementally update the XML documents. To this end, we introduce the concept of namespace-compensated element fragment. This release also adds VTD+XML index writing capability to the VTD Navigator class. Other enhancements in this release include index size pre-computation, support for HTTP get, and some bug fixes. To download the latest release, please go to http://sourceforge.net/project/showfiles.php?group_id=110612.