From jeremyhu at uclink4.berkeley.edu Sat Mar 1 02:57:49 2003 From: jeremyhu at uclink4.berkeley.edu (Jeremy Huddleston) Date: Mon Mar 3 01:34:28 2003 Subject: [Expat-discuss] expat, XFree86, and soname Message-ID: <3E60922D.8070707@uclink4.berkeley.edu> I just installed XFree86 4.3, and to my supprise it includes expat 1.95.2. I was going to delete it so that I could use the 1.95.6 that I have on my system instead, but there is a small problem. XFree86 calls the library libexpat.so.1 while the soname used in the expat tarballs is libexpat.so.0. Clearly this is not a great situation. I can (and have) removed the libexpat* files from /usr/X11R6/lib and run he following in /usr/local/lib: ld -shared -soname=libexpat.so.1 libexpat.so.0 -o libexpat.so.1 This has the effect of allowing files linked against libexpat.so.1 to gain access to the libexpat.so.0 "through" it. Alternatively, you can create a symlink: ln -s libexpat.so.0 libexpat.so.1 While these work arounds work, it does not solve the problem with the soname. Some of my programs will be looking for libexpat.so.0 and some will be looking for libexpat.so.1. Because of this, will all future libexpat.so.1 and libexpat.so.0 libraries be binary compatible with eachother, and will future versions start using the libexpat.so.1 soname? Thanks, Jeremy From v.cimino at resi.it Mon Mar 3 09:12:48 2003 From: v.cimino at resi.it (Vincenzo Cimino) Date: Mon Mar 3 03:16:37 2003 Subject: [Expat-discuss] I'm confused about CDATA section Message-ID: <006f01c2e15c$b0965c00$10cc09c0@ciminov> I read in this site http://www.w3schools.com/xml/xml_cdata.asp "Only text inside a CDATA section is ignored by the parser." but also in another points. The behaviour of Expat Library is therefore different? Vincenzo Cimino Project Manager Divisione Network IP RESI Informatica S.r.l. Tel + 39.6.92710/406 Tel + 39.6.92710/222 Fax + 39.6.92710/218 mailto:v.cimino@resi.it From karl at waclawek.net Mon Mar 3 09:18:59 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 3 09:19:45 2003 Subject: [Expat-discuss] Re: I'm confused about CDATA section References: <006f01c2e15c$b0965c00$10cc09c0@ciminov> Message-ID: <001801c2e18f$d5732020$9e539696@citkwaclaww2k> > I read in this site > > http://www.w3schools.com/xml/xml_cdata.asp > > "Only text inside a CDATA section is ignored by the parser." > > but also in another points. > > The behaviour of Expat Library is therefore different? Not really. The above is misleading. The text in CDATA section is not ignored, but it it no treated as XML markup. Check this section of the specs: http://www.w3.org/TR/REC-xml.html#sec-cdata-sect Karl From AFish at GoldenGate.com Tue Mar 4 11:23:34 2003 From: AFish at GoldenGate.com (AFish@GoldenGate.com) Date: Wed Mar 5 11:44:06 2003 Subject: [Expat-discuss] DOM parser, expat vs. lbxml? Message-ID: <8E7C7D71F727654A8C99F1606231B8A4058EB2@exchange.earth.ggsoftware.com> We need a DOM parser in C that will compile on any platform. So far, the only C xml parsers I have seen are expat and libxml. The only DOM parser build on top of expat I have seen is 'SCEW' the simple C expat wrapper (http://www.nongnu.org/scew/). Questions: 1. Are there other expat wrappers or examples which provide DOM-like xml tree traversal? 2. Has anyone done a side-by-side libxml vs. expat comparison? Is there any reason we should roll our own DOM parser on top of expat instead of using libxml? From karl at waclawek.net Wed Mar 5 11:54:08 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 5 11:54:13 2003 Subject: [Expat-discuss] DOM parser, expat vs. lbxml? References: <8E7C7D71F727654A8C99F1606231B8A4058EB2@exchange.earth.ggsoftware.com> Message-ID: <008301c2e337$d6f8ef20$9e539696@citkwaclaww2k> > We need a DOM parser in C that will compile on any platform. So far, the > only C xml parsers I have seen are expat and libxml. The only DOM parser > build on top of expat I have seen is 'SCEW' the simple C expat wrapper > (http://www.nongnu.org/scew/). > Questions: > 1. Are there other expat wrappers or examples which provide DOM-like xml > tree traversal? Not that I know of. > 2. Has anyone done a side-by-side libxml vs. expat comparison? Is there any > reason we should roll our own DOM parser on top of expat instead of using > libxml? Why would you not want to use SCEW instead of rolling your own? About libxml vs. Expat: I have never compared them, but it seems that Expat is pretty good in the areas of speed and memory use, as well as being quite compliant. However, Expat does not validate. Karl From AFish at GoldenGate.com Wed Mar 5 10:40:26 2003 From: AFish at GoldenGate.com (AFish@GoldenGate.com) Date: Wed Mar 5 13:41:00 2003 Subject: [Expat-discuss] DOM parser, expat vs. lbxml? Message-ID: <8E7C7D71F727654A8C99F1606231B8A4058EB7@exchange.earth.ggsoftware.com> >> Why would you not want to use SCEW instead of rolling your own? This is a good point, I tried SCEW yesterday, and it was pretty good. I had a little trouble compiling on Win32 but the issues were minimal. From rolf at pointsman.de Wed Mar 5 20:26:04 2003 From: rolf at pointsman.de (rolf@pointsman.de) Date: Wed Mar 5 14:29:31 2003 Subject: [Expat-discuss] DOM parser, expat vs. lbxml? In-Reply-To: <8E7C7D71F727654A8C99F1606231B8A4058EB2@exchange.earth.ggsoftware.com> Message-ID: <200303051926.UAA18340@pointsman.pointsman.de> On 4 Mar, AFish@GoldenGate.com wrote: > We need a DOM parser in C that will compile on any platform. So far, the > only C xml parsers I have seen are expat and libxml. The only DOM parser Don't forget rxp, a good, fast, compliant and optional validating parser http://www.cogsci.ed.ac.uk/~richard/rxp.html (don't get afraid about the artless home page, it's a good product). If C++ is also OK for you, there's of course also xerces-c++ (http://xml.apache.org). Well, and for completeness sakes, don't forget msxml (the XML parser out of the evil empire). I'm not a fan of MS for various reasons, but their XML parser (and there XSLT engine) isn't bad. > build on top of expat I have seen is 'SCEW' the simple C expat wrapper > (http://www.nongnu.org/scew/). > > Questions: > 1. Are there other expat wrappers or examples which provide DOM-like xml > tree traversal? Sablotron (http://www.gingerall.com/charlie/ga/xml/p_sab.xml). From the home page: " Sablotron is a fast, compact and portable XML toolkit implementing XSLT 1.0, DOM Level2 and XPath 1.0. [...] Sablotron uses James Clark's expat XML parser." You better shouldn't buy in there claim, that they have a "fast" XSLT processor (this claim is somewhat ridiculous). Though, Sablotron itself is written in C++. There are for sure more DOM implementations based on expat around, than only one. For example, there's an Tcl extension (I'm one of the maintainers), which implements DOM on top of expat (and also XPath and XSLT) (http://www.tdom.org). The DOM building parts are completetly in C, so it may worth a look. > 2. Has anyone done a side-by-side libxml vs. expat comparison? Is there any > reason we should roll our own DOM parser on top of expat instead of using > libxml? There could be said a lot - your question is a bit vague about your needs. Expat does not validate (although it does read, on demand, external entities). If a well-formdness parser is OK for you, expat is definitely somewhat faster - but since both parsers are really fast, this may only be of interest, if you aim for maximum speed. Additionally, the time, needed to build a DOM like structure in memory (which typically needs a lot of mallocs for the node structures) isn't negligible, so the overall speed depends not only on the raw parser speed, but also on the quality of the DOM building code. Another factor, which may be important (depending on the size of your XML data) is, that DOM trees typically need _a lot_ of memory. This depends of course on how much markup you have in your document (and how much 'indentation' fluff you have in your document) but it's normal, that you need 3 to 5 times the file size of memory for the DOM tree. Although the libxml DOM trees need notable lesser memory than every Java DOM implementaion, I know, it isn't the slimmest implementation, avaliable. For example, the above mentioned tDOM implentation has a notable lesser overhead (which is important for me, because I've to handle really large product data lists in XML). DOM and DOM are not the same. Do you mean DOM 1, 2 or 3? What about entities? Must you preserve parsed entities? DOM alone will probably make you somewhat unhappy, in short time. Navigation within the tree can get tedious, if you don't have support for at least XPath (libxml provides this). But I better stop now. rolf > > _______________________________________________ > Expat-discuss mailing list > Expat-discuss@libexpat.org > http://mail.libexpat.org/mailman/listinfo/expat-discuss From miallen at eskimo.com Wed Mar 5 14:57:12 2003 From: miallen at eskimo.com (Michael B. Allen) Date: Wed Mar 5 14:56:18 2003 Subject: [Expat-discuss] DOM parser, expat vs. lbxml? In-Reply-To: <8E7C7D71F727654A8C99F1606231B8A4058EB2@exchange.earth.ggsoftware.com> References: <8E7C7D71F727654A8C99F1606231B8A4058EB2@exchange.earth.ggsoftware.com> Message-ID: <20030305145712.5c460818.miallen@eskimo.com> On Tue, 4 Mar 2003 11:23:34 -0800 AFish@GoldenGate.com wrote: > We need a DOM parser in C DOM is not a parser. Expat is a parser. The DOM is a tree of nodes in memory. Frequently they do come with some kind of module to load and store from and to XML however in which case it would *use* an XML parser. > that will compile on any platform. So far, the "any platform"? You really should be a little more spcecific. You might say POSIX or ANSI platform but "any" just isn't possible. > only C xml parsers I have seen are expat and libxml. The only DOM parser There are several XML parsers in C and particularly in C++. Use 'xml parser c' on google. Xerces C++, IBM's xml4c, and Oracle XML for C are three that I can think of. > build on top of expat I have seen is 'SCEW' the simple C expat wrapper > (http://www.nongnu.org/scew/). This isn't a real DOM though. > Questions: > 1. Are there other expat wrappers or examples which provide DOM-like xml > tree traversal? There are lot's of these I suspect. And you could write your own in 20 mintues. Here's one: http://www.eskimo.com/~miallen/libmba/dl/docs/ref/domnode.html > 2. Has anyone done a side-by-side libxml vs. expat comparison? Is there any I have never really used libxml. I believe it requires glib. I would be willing to bet expat would be quite a bit faster and much much more effecient though. > reason we should roll our own DOM parser on top of expat instead of using > libxml? There are a couple of other DOM implementations. If you need a real DOM rather than a simple DOM-like interface like those mentioned above there's is DOMC (by yours truely): http://www.eskimo.com/~miallen/domc/ Incedentally I just finished testing 0.7 but I have to run through the portability tests and create the various packages so it will take me another week or so. I have already compiled it on Windows NT and have a working Win32 Makefile for MSVC (I'm not a Linux zealot!). Mike -- A program should be written to model the concepts of the task it performs rather than the physical world or a process because this maximizes the potential for it to be applied to tasks that are conceptually similar and, more important, to tasks that have not yet been conceived. From xcross at us.ibm.com Wed Mar 5 15:08:37 2003 From: xcross at us.ibm.com (Chris Cross) Date: Wed Mar 5 15:09:03 2003 Subject: [Expat-discuss] Can expat interrupt processing? Message-ID: Can expat interrupt processing? Say you encounter an error in the start element handler and want to bail? thanks, chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 From karl at waclawek.net Wed Mar 5 15:19:27 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 5 15:19:33 2003 Subject: [Expat-discuss] Can expat interrupt processing? References: Message-ID: <010a01c2e354$85aed220$9e539696@citkwaclaww2k> > Can expat interrupt processing? Say you encounter an error in the start > element handler and want to bail? This question comes up from time to time. If you are using it from C++, throw an exception. In C, use setjmp/longjmp if your compiler supports it. >From other languages it will depend. With Delphi it works raising an exception, just like with C++. Karl From xcross at us.ibm.com Wed Mar 5 17:52:01 2003 From: xcross at us.ibm.com (Chris Cross) Date: Wed Mar 5 17:52:28 2003 Subject: [Expat-discuss] Can expat interrupt processing? Message-ID: Hi Karl, I've not used setjmp/longjmp before so thanks for your patience. If I do a longjmp out of the element handler what are the memory implications in expat when it hasn't finished the parse? Would the code look something like this: jmp_buf mark; main() { ... jmpret = setjmp( mark ); if (jmpret == 0) { if (!XML_Parse(...)) { // process errors } } else { // What is the state of the parser here? // Would XML_GetCurrentLineNumber work? // Should I call XML_ParserFree to clean up? // process errors } ... } startElement(...) { // process the element ... if (some error condition) longjmp(mark, -1) } Thanks for your help, chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 |---------+----------------------------> | | "Karl Waclawek" | | | | | | | | | 03/05/2003 03:19 | | | PM | | | | |---------+----------------------------> >-----------------------------------------------------------------------------------------------------------------| | | | To: , Chris Cross/West Palm Beach/IBM@IBMUS | | cc: | | Subject: Re: [Expat-discuss] Can expat interrupt processing? | | | >-----------------------------------------------------------------------------------------------------------------| > Can expat interrupt processing? Say you encounter an error in the start > element handler and want to bail? This question comes up from time to time. If you are using it from C++, throw an exception. In C, use setjmp/longjmp if your compiler supports it. >From other languages it will depend. With Delphi it works raising an exception, just like with C++. Karl From karl at waclawek.net Wed Mar 5 20:08:46 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 5 20:06:15 2003 Subject: [Expat-discuss] Can expat interrupt processing? References: Message-ID: <000b01c2e37c$f051ae90$0207a8c0@karl> > > Hi Karl, > I've not used setjmp/longjmp before so thanks for your patience. Neither have I. Most of the time I am not programming in C. > If I do a > longjmp out of the element handler what are the memory implications in > expat when it hasn't finished the parse? Would the code look something > like this: > > jmp_buf mark; > > main() > { > ... > jmpret = setjmp( mark ); > if (jmpret == 0) > { > if (!XML_Parse(...)) > { > // process errors > } > } > else > { > // What is the state of the parser here? > // Would XML_GetCurrentLineNumber work? > // Should I call XML_ParserFree to clean up? > // process errors > } > > ... > } > > startElement(...) > { > // process the element > ... > if (some error condition) > longjmp(mark, -1) > } That code looks right to me. I would not call XML_GetCurrentLineNumber on return from the error through longjmp, I would rather store all information that is needed later, while still in the handler, before longjmp is called. Most such functions in Expat are only meant to be called from a handler, even though they sometimes might still work after parsing has ended. I suggest you try XML_GetCurrentLineNumber both ways, but "officially" is is better to call from the handler. You should call XML_ParserFree or XML_ParserReset just like you would when parsing has ended normally. AFAIK, you should not use setjmp/longjmp when in C++. The memory implications are such - to the best of my knowledge - that Expat does not allocate memory in local variables (e.g. for the purpose of a callback), so that jumping out of parsing through longjmp should not cause a memory leak. Karl From brox at corena.no Thu Mar 6 08:30:23 2003 From: brox at corena.no (Bjorn Brox) Date: Thu Mar 6 02:30:38 2003 Subject: [Expat-discuss] Manage unknown entityes? Message-ID: <3E66F90F.9000308@corena.no> How can I manage unknown entities? When parsing the following xml file I get the error: "undefined entity at line 7" and the parser stops. -------------------- ]> Hello World¿ -------------------- The ¿ entity is one of the standard SGML character entities defined in ISOPub, but unknown by the expat parser. I know that I could solve this by adding an entity declaration, but very often users asssume that a parser have a knowledge about all the ISO defined entities. Is it possible to set up a callback where I can handle unknown entities where I can deside myself if I want the parser to terminate or simply return the correct UTF-8 code if my callback know the entity? -- Bjorn Brox, CORENA Norge AS, http://www.corena.no/, ICQ 17872043 Industritunet, Dyrmyrgt. 35, N-3611 Kongsberg, NORWAY Phone: +47 32717210, Fax: +47 32717201, Mobile: +47 92638590 From karl at waclawek.net Thu Mar 6 09:14:02 2003 From: karl at waclawek.net (Karl Waclawek) Date: Thu Mar 6 09:14:12 2003 Subject: [Expat-discuss] Manage unknown entityes? References: <3E66F90F.9000308@corena.no> Message-ID: <000b01c2e3ea$a39ea210$9e539696@citkwaclaww2k> > How can I manage unknown entities? > > When parsing the following xml file I get the error: "undefined entity > at line 7" and the parser stops. > > -------------------- > > > > ]> > > Hello World¿ > > -------------------- > > The ¿ entity is one of the standard SGML character entities > defined in ISOPub, but unknown by the expat parser. This is not a pre-defined entity in XML and requires a declaration. Your document is simply not well-formed and will be rejected by any compliant XML parser. > I know that I could solve this by adding an entity declaration, but very > often users asssume that a parser have a knowledge about all the ISO > defined entities. In XML that is an incorrect assumption. > Is it possible to set up a callback where I can handle unknown entities > where I can deside myself if I want the parser to terminate or simply > return the correct UTF-8 code if my callback know the entity? We don't have such a call-back, and it really is not a good idea since it encourages the creation of non-wellformed (malformed?) documents. Karl From dmoore at viefinancial.com Tue Mar 11 07:45:22 2003 From: dmoore at viefinancial.com (Moore, Dave) Date: Tue Mar 11 07:45:30 2003 Subject: [Expat-discuss] Pull API Status? Message-ID: <2FE8C75C7A06D4118BB50008C7F7E831DE3BF1@EXCHSRV> I was browsing the archives and came across a proposal to implement a pull based API on top of expat's XML_Parse+callbacks API. This is something I could use for a current project at work, and I'd be willing to tackle it if no one is actively working on this. API Option 1 in this message http://mail.libexpat.org/pipermail/expat-discuss/2002-August/000602.html meets my needs, but I wanted to make sure this was still considered to be a viable option. While I agree with a later message in the thread that a Pull API built directly on top of the xmltok tokenizer would be cleaner and more efficient, it's not something I personally feel up to tackling at this point due to time constaints on my work project. After I hear back on the status, I have a couple of specific questions for how people would like to handle character text nodes in a pull based API. Thanks, Dave ----------------------- David Moore dmoore@viefinancial.com From karl at waclawek.net Tue Mar 11 09:40:25 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 09:40:31 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BF1@EXCHSRV> Message-ID: <007801c2e7dc$2754b780$9e539696@citkwaclaww2k> > I was browsing the archives and came across a proposal to implement a pull > based API on top of expat's XML_Parse+callbacks API. This is something I > could use for a current project at work, and I'd be willing to tackle it if > no one is actively working on this. No one is currently working on it, as we plan to release Expat 2.0 first. > API Option 1 in this message > http://mail.libexpat.org/pipermail/expat-discuss/2002-August/000602.html > meets my needs, but I wanted to make sure this was still considered to be a > viable option. It is not what I have in mind, but it could serve as the Pull solution until we really go for a proper implementation. If we can hide implementation details and make the API itself robust then a change in implementation might not even break existing code - in theory, at least ;-). > While I agree with a later message in the thread that a Pull API built > directly on top of the xmltok tokenizer would be cleaner and more efficient, > it's not something I personally feel up to tackling at this point due to > time constaints on my work project. Yes, that is always our problem!!! Btw, here is how I see the "proper" Pull implementation: (assuming an API has been established, details may change): - Expat is already Pull based internally. So a lot of code can be re-used. One does not need to completely re-implement the layer on top of xmltok. - The main things to change are: - instead of "pushing" buffers (with XML_ParseBuffer), have the main parsing loop pull buffers with an XML_GetNextBuffer callback. - add return codes to all the callbacks (like XML_SKIP, XML_USE, XML_ERROR, ...) - supply internal callbacks which perform the PULL API specific data preparation and also do any required filtering - The Next() function would simply call the main parsing loop which returns when an (internal) callback returns XML_USE. The data to be reported would be stored in some fields in the Parser structure. In addition we would also want to improve the API with regards to complete entity reporting (currently the same restrictions as SAX2) and namespace reporting (it seems better to return names as separate localName, prefix and uri parameters). And, of course, it should still be possible to use Expat in Push mode. > After I hear back on the status, I have a couple of specific questions for > how people would like to handle character text nodes in a pull based API. I think if we want your API re-usable we should put a lot of thought into it. Karl From rolf at pointsman.de Tue Mar 11 15:58:04 2003 From: rolf at pointsman.de (rolf@pointsman.de) Date: Tue Mar 11 10:01:28 2003 Subject: [Expat-discuss] Pull API Status? In-Reply-To: <007801c2e7dc$2754b780$9e539696@citkwaclaww2k> Message-ID: <200303111458.PAA14381@pointsman.pointsman.de> (... sorry for putting this a bit out of the context) On 11 Mar, Karl Waclawek wrote: > [...] > In addition we would also want to improve the API with regards to > complete entity reporting (currently the same restrictions as SAX2) > and namespace reporting (it seems better to return names as separate > localName, prefix and uri parameters). I would love to see this. (Especially the second thing (namespace reporting), which should be much more easier, as far as I see, than the first thing, which needs a lot more new API stuff.) rolf From dmoore at viefinancial.com Tue Mar 11 11:09:49 2003 From: dmoore at viefinancial.com (Moore, Dave) Date: Tue Mar 11 11:10:43 2003 Subject: [Expat-discuss] Pull API Status? Message-ID: <2FE8C75C7A06D4118BB50008C7F7E831DE3BF7@EXCHSRV> Karl, Thanks for the reply. Let me make some comments and questions inline to make sure I understand you. > > I was browsing the archives and came across a proposal to > implement a pull > > based API on top of expat's XML_Parse+callbacks API. This > is something I > > could use for a current project at work, and I'd be willing > to tackle it if > > no one is actively working on this. > > No one is currently working on it, as we plan to release > Expat 2.0 first. This is understandable. I didn't read the roadmap closely enough, and we going on the August posts referring to suspension perhaps being part of 1.96.xxx > - Expat is already Pull based internally. So a lot of code > can be re-used. > One does not need to completely re-implement the layer on > top of xmltok. > - The main things to change are: > - instead of "pushing" buffers (with XML_ParseBuffer), have the main > parsing loop pull buffers with an XML_GetNextBuffer callback. Just to get my terminology straight... By "main parsing loop" you mean any of the prologProcessor, contentProcessor, externalEntityProcessor family of functions, yes? So, contentProcessor (or doContent) would change to call the XML_GetNextBuffer callback whenever they needed more bytes. Pull mode would work by calling XML_NextNode() (or whatever we name it) and would -never- call XML_Parse or XML_ParseBuffer. Push mode would continue to work in that XML_Parse and XML_ParseBuffer would provide pointers and lengths to the main parser structure just like they do today, and the absense of a XML_GetNextBuffer callback would ensure that control returns to XML_Parse callers just like it does now. > - add return codes to all the callbacks (like XML_SKIP, > XML_USE, XML_ERROR, ...) > - supply internal callbacks which perform the PULL API specific > data preparation and also do any required filtering Return codes seem to be the cleanest way to do this, especially for an ExPat 2.1 or 3.0 release. However, should we preserve compatibility with "push-era" callbacks better by providing a XML_SetNodeHandling() function to be used in a callback where the user would send the XML_SKIP, XML_USE flags appropriately? The default could always be "XML_SKIP". Callbacks can continue to have void return type and for push mode users, the main loop would continue until the buffer is exhausted, just like today. > - The Next() function would simply call the main parsing loop > which returns > when an (internal) callback returns XML_USE. The data to be > reported would > be stored in some fields in the Parser structure. Agreed. > In addition we would also want to improve the API with regards to > complete entity reporting (currently the same restrictions as SAX2) > and namespace reporting (it seems better to return names as separate > localName, prefix and uri parameters). My project does not have complex namespace reporting needs, but I would certainly want any new pull api functions to provide as much namespace support as the rest of Expat. > And, of course, it should still be possible to use Expat in Push mode. Absolutely. > > After I hear back on the status, I have a couple of > specific questions for > > how people would like to handle character text nodes in a > pull based API. > > I think if we want your API re-usable we should put a lot of > thought into it. I agree, especially with regards to namespace handling. However, I would like to scope out a level of effort based on the following first phase requirements: 1. Provide pull parsing that can handle element, attribute, and text nodes for well formed XML documents. Use a minimal subset of the Java API at www.xmlpull.org as a guideline Sample API. Subset of this API is included at the bottom of the email. 2. The real assessment is this - how extensive are the changes to the main parsing loop(s)? They have to : A. Handle pulling buffers from users when needed. If this logic can stay "outside" in XML_Next(), it will be helpful, I would imagine. B. Handle "XML_USE/XML_SKIP/XML_ERROR" values set during callbacks. C. If XML_USE is returned, we have to store information about the current node for retrieval and return. 3. Default callbacks that match the capabilities of the Pull API should be provided. That is, "XML_USE" should be the default for StartElement, EndElement, and Character callbacks. Seeing the roadmap, I can see why the main expat distribution is holding off on these things... My goal would be to provide a proof of concept implementation that minimizes impact to the main parsing loops so that any changes to the main parsing loop that occur during the 1.95 -> 2.0 transition are easy to merge with changes necessary for pull parsing. Any feedback would be appreciated! Thanks, Dave Minimal API for proof of concept: typedef enum { START_DOCUMENT, END_DOCUMENT, START_TAG, END_TAG, TEXT, } NodeType; typedef enum { XML_USE, XML_SKIP, XML_ERROR, } NodeHandling; /* Main entry point for pull parsing. Returns the NodeType of the most recently parsed node where XML_USE was set in a callback */ NodeType XML_Next(XML_Parser p); /* Function used in callbacks to set pull parse handling of the node */ void XML_SetNodeHandling(XML_Parser p,NodeHandling handling); /* Functions to retrieve information about the current node. */ char *XML_GetName(); /* Returns the name of the next node */ char *XML_GetText(); /* Returns the text content of the current node */ int XML_GetAttributeCount(); char *XML_GetAttributeName(int index); char *XML_GetAttributeValue(int index); /* Clearly the above family of functions would grow/change to handle namespaces, CDATA, comments, etc. */ /* Pull buffer management */ /* User provided function will provide a pointer to additional buffer space and the length of that buffer. If the user sets *nextBufferPtr to NULL, this signals the end of input. */ typedef void(*XML_GetNextBufferHandler)(void *userData, char **nextBufferPtr, int *nextBufferLen); XML_SetGetNextBufferHandler(XML_Parser p,XML_GetNextBufferHandler handler); From karl at waclawek.net Tue Mar 11 12:14:06 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 12:14:14 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BF7@EXCHSRV> Message-ID: <00c201c2e7f1$9f504e10$9e539696@citkwaclaww2k> > > No one is currently working on it, as we plan to release > > Expat 2.0 first. > > This is understandable. I didn't read the roadmap closely enough, and we > going on the August posts referring to suspension perhaps being part of > 1.96.xxx Yes, our first look at this was triggered by the fact that Mozilla uses an older version of Expat with such modifications. But thinking some more about it we didn't think (or at least I didn't) that this is the way to go. Also, if you look at the code check-ins in the last year or so, you will find that we haven't had a lot of manpower available. It's mostly just Fred and me. > > - Expat is already Pull based internally. So a lot of code > > can be re-used. > > One does not need to completely re-implement the layer on > > top of xmltok. > > - The main things to change are: > > - instead of "pushing" buffers (with XML_ParseBuffer), have the main > > parsing loop pull buffers with an XML_GetNextBuffer callback. > > Just to get my terminology straight... By "main parsing loop" you mean any > of the prologProcessor, contentProcessor, externalEntityProcessor family of > functions, yes? Yes, basically doProlog and doContent. Those "processors" above are basically used to call the parsing loop based on the parser's state. The input can come in chunks of any size, so part of the state is stored in a "processor" function pointer, to avoid too many conditional checks. That is a heavily used approach in Expat. > So, contentProcessor (or doContent) would change to call the > XML_GetNextBuffer callback whenever they needed more bytes. Yes. > Pull mode would work by calling XML_NextNode() (or whatever we name it) and > would -never- call XML_Parse or XML_ParseBuffer. Correct. > Push mode would continue to work in that XML_Parse and XML_ParseBuffer would > provide pointers and lengths to the main parser structure just like they do > today, and the absense of a XML_GetNextBuffer callback would ensure that > control returns to XML_Parse callers just like it does now. How we do this in detail is not sure yet. One could simply have different "processors" (without the XML_GetNextBuffer) callded from XML_Parse(Buffer). It might even be possible to mix push and pull operation - but not sure, haven't really thought it through. Or, one could, for instance, imagine that we leave the XML_GetBuffer callback, and simply call the XML_NextNode function, with defaults set to skip all nodes, and instead of the internal callbacks, the programmer supplies his/her own (just like today), with one change: there will still be return codes, and if the programmer wants to terminate parsing prematurely he/she just returns XML_STOP or XML_ERROR, or some other code. This would give us the additional benefit of enabling C programmers to terminate parsing without having to resort to setjmp/longjmp. In simpler words, push mode could be achieved by simply having the programmer supply his/her own callbacks instead of the internal ones, and by defaulting to XML_SKIP. One thing: this push API would not be compatible with the current one. Therefore this should be part of a branch that builds up to Expat 3.0. > > - add return codes to all the callbacks (like XML_SKIP, > > XML_USE, XML_ERROR, ...) > > - supply internal callbacks which perform the PULL API specific > > data preparation and also do any required filtering > > Return codes seem to be the cleanest way to do this, especially for an ExPat > 2.1 or 3.0 release. > However, should we preserve compatibility with "push-era" callbacks better > by providing a XML_SetNodeHandling() function to be used in a callback where > the user would send the XML_SKIP, XML_USE flags appropriately? > > The default could always be "XML_SKIP". Callbacks can continue to have void > return type and for push mode users, the main loop would continue until the > buffer is exhausted, just like today. Well, IMO, if we break from the old API anyway, we should do it properly. However, if we can keep the old API completely the same, then it would make sense to have something like XML_SetNodeHandling(). But since I have already had some feedback on the namespace processing API changes, it does not look like the old API will survive completely unmodified, and so I would say we should go with return codes. Depends on what kind of feedback we will have, of course. > > In addition we would also want to improve the API with regards to > > complete entity reporting (currently the same restrictions as SAX2) > > and namespace reporting (it seems better to return names as separate > > localName, prefix and uri parameters). > > My project does not have complex namespace reporting needs, but I would > certainly want any new pull api functions to provide as much namespace > support as the rest of Expat. What I was aiming at was that the way qualified names are reported is cumbersome to parse and extra work to do within Expat. Also, the SAX2 API has a separation into localName, QName and uri, which is confusing. E.g., what is the value of localName if ns-processing is turned on, but the name has no prefix? etc. We would want to avoid that. > > I think if we want your API re-usable we should put a lot of > > thought into it. > > I agree, especially with regards to namespace handling. However, I would > like to scope out a level of effort based on the following first phase > requirements: > > 1. Provide pull parsing that can handle element, attribute, and text nodes > for well formed XML documents. Use a minimal subset of the Java API at > www.xmlpull.org as a guideline Sample API. Subset of this API is included > at the bottom of the email. Yes, existing efforts can serve as a starting point, why re-invent the wheel? > 2. The real assessment is this - how extensive are the changes to the main > parsing loop(s)? They have to : > A. Handle pulling buffers from users when needed. If this logic > can stay "outside" in XML_Next(), it will be helpful, I would imagine. This might be something to do just before doContent is called, in the various "processors", like for instance: static enum XML_Error PTRCALL contentProcessor(XML_Parser parser, const char *start, const char *end, const char **endPtr) { enum XML_Error result = doContent(parser, 0, encoding, start, end, endPtr); if (result != XML_ERROR_NONE) return result; if (!storeRawNames(parser)) return XML_ERROR_NO_MEMORY; return result; } replaing the doContent call with something like while (XML_GetNextBuffer(start, end, endPtr)) { enum XML_Error result = doContent(parser, 0, encoding, start, end, endPtr); if (result != XML_ERROR_NONE) return result; } Haven't lookd at this closely, so this may not be a good approach, but you get the idea. > B. Handle "XML_USE/XML_SKIP/XML_ERROR" values set during callbacks. > C. If XML_USE is returned, we have to store information about the > current node for retrieval and return. > 3. Default callbacks that match the capabilities of the Pull API should be > provided. That is, "XML_USE" should be the default for StartElement, > EndElement, and Character callbacks. Btw, if we use return codes, then one way to default them would be by passing them by reference (instead of a "real" return value), and have that value defaulted accordingly. That way the programmer only needs to set it in the "unusual" case. > Seeing the roadmap, I can see why the main expat distribution is holding off > on these things... My goal would be to provide a proof of concept > implementation that minimizes impact to the main parsing loops so that any > changes to the main parsing loop that occur during the 1.95 -> 2.0 > transition are easy to merge with changes necessary for pull parsing. I think if we can get away with something as simple as the contentProcessor approach from above, coupled with custom callbacks, it might not be that bad. But the devil is in the detail ... Tha API you posted looks good for a starting point. Signature details can potentially change. But let's take it one step at a time. I am interested in this myself (a lot), but lack of time is my enemy. Karl From karl at waclawek.net Tue Mar 11 13:13:13 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 13:13:22 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BF7@EXCHSRV> Message-ID: <00ca01c2e7f9$e1961ea0$9e539696@citkwaclaww2k> > int XML_GetAttributeCount(); > char *XML_GetAttributeName(int index); > char *XML_GetAttributeValue(int index); One comment about attributes - this is a shortcoming of SAX2 as well: Due to the nature of reporting attributes values as one chunk of data, entities within the value are silently expanded. The same applies to parameter entities in entity values. This is one of the deficiencies we wanted to fix, so that Rolf and others can build a complete DOM on top of Expat. Rolf, please fill in if I am missing anything here. Karl From dmoore at viefinancial.com Tue Mar 11 13:19:36 2003 From: dmoore at viefinancial.com (Moore, Dave) Date: Tue Mar 11 13:19:45 2003 Subject: [Expat-discuss] Pull API Status? Message-ID: <2FE8C75C7A06D4118BB50008C7F7E831DE3BFC@EXCHSRV> I'm certainly very flexible on this point. The Pull API can certainly produce a single attribute during each call to Next() - it just means we add a few new states - START_TAG_OPEN, ATTRIBUTE, START_TAG_CLOSE, something like that. Or were you saying that the tokenizer in the "far future" expat will no longer process a start tag with attributes as one token? Dave > -----Original Message----- > From: Karl Waclawek [mailto:karl@waclawek.net] > Sent: Tuesday, March 11, 2003 1:13 PM > To: Moore, Dave; expat-discuss@libexpat.org > Subject: Re: [Expat-discuss] Pull API Status? > > > > int XML_GetAttributeCount(); > > char *XML_GetAttributeName(int index); > > char *XML_GetAttributeValue(int index); > > One comment about attributes - this is a shortcoming of SAX2 as well: > Due to the nature of reporting attributes values as one chunk > of data, entities within the value are silently expanded. > The same applies to parameter entities in entity values. > > This is one of the deficiencies we wanted to fix, so that > Rolf and others can build a complete DOM on top of Expat. > > Rolf, please fill in if I am missing anything here. > > Karl > From karl at waclawek.net Tue Mar 11 13:39:33 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 13:39:42 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BFC@EXCHSRV> Message-ID: <00da01c2e7fd$8f543d80$9e539696@citkwaclaww2k> > I'm certainly very flexible on this point. The Pull API can certainly > produce a single attribute during each call to Next() - it just means we add > a few new states - START_TAG_OPEN, ATTRIBUTE, START_TAG_CLOSE, something > like that. That would not really help. The simplest would be to have a startAttribute callback, followed by characters, start/endEntity callbacks and terminated by an endAttribute callback, just like it is done for elements. However, that many callbacks can make Expat inefficient. > Or were you saying that the tokenizer in the "far future" expat will no > longer process a start tag with attributes as one token? If I remember correctly - it was a while ago - some of the changes to support complete entity reporting would have to be made in the tokenizer. There are various other ways to do this through an API (no just like above). I imagine we could have *two* XML_GetAttributeValue functions (or supply an extra argument), where we would only support one of them initially (the one that always expands entities). For the second one, there are again options: A low level, efficient, but maybe not so nice, option would be to insert pointers to entity values in the character string, prefixed by an invalid XML character (like U-FFFF) as marker. This can work recursively. Another option could be to return the attribute as node with children. I prefer the first option since Expat is most often used through wrappers anyway, and if you want the speed, then you can still get it by going down to the Expat API level directly. Karl From dmoore at viefinancial.com Tue Mar 11 13:42:04 2003 From: dmoore at viefinancial.com (Moore, Dave) Date: Tue Mar 11 13:42:14 2003 Subject: [Expat-discuss] Pull API Status? Message-ID: <2FE8C75C7A06D4118BB50008C7F7E831DE3BFE@EXCHSRV> After looking further into the way that XML_GetNextBuffer() would work, there is a "wrinkle" involved when we have a partial token left over from the previous buffer. Since we have to arrange for the leftover bytes to be just in front of any new bytes before continuing with our parsing, we have some choices: 1. Copy leftover bytes out and into a dynamic buffer, ask the user for their new buffer, and then copy both blocks into a contiguous dynamically allocated buffer. Simple for the user, slower for us. 2. Ask the user for a buffer of a certain minimum size, and instruct the user to copy their new data into this buffer starting at a certain offset into that buffer, leaving us room to shift the old bytes to the beginning of the buffer, and allowing for "huge" nodes that span several buffers. New signature: XML_GetNextBuffer(int min_buffer_len,int offset, char **p_new_buffer,int *p_new_len); 3. Ask the user to prepend our leftover data for us, effective shifting the work that XML_GetBuffer does onto the user. This will minimize unnecessary allocations and copies, but doesn't feel clean to me. New signature: XML_GetNextBuffer(const char *p_old_data,int len_old,data, char **p_new_buffer,int *p_new_len); 4. Some combination of the above - that main thing is to allow for and protect ourselves against the user re-using the same buffer across calls. These kind of issues seem to be philosophical - who pays in complexity for improved performance - the library or the user? Since I've been here all of 5 hours, I'll defer to others on this list for the choice that's consistent with expat philosophy. Thanks, Dave From karl at waclawek.net Tue Mar 11 13:45:53 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 13:46:00 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BFE@EXCHSRV> Message-ID: <00e201c2e7fe$7195fe40$9e539696@citkwaclaww2k> > After looking further into the way that XML_GetNextBuffer() would work, > there is a "wrinkle" involved when we have a partial token left over from > the previous buffer. Since we have to arrange for the leftover bytes to be > just in front of any new bytes before continuing with our parsing, we have > some choices: > > 1. Copy leftover bytes out and into a dynamic buffer, ask the user for > their new buffer, and then copy both blocks into a contiguous dynamically > allocated buffer. Simple for the user, slower for us. > > 2. Ask the user for a buffer of a certain minimum size, and instruct the > user to copy their new data into this buffer starting at a certain offset > into that buffer, leaving us room to shift the old bytes to the beginning of > the buffer, and allowing for "huge" nodes that span several buffers. New > signature: > XML_GetNextBuffer(int min_buffer_len,int offset, char **p_new_buffer,int > *p_new_len); > > 3. Ask the user to prepend our leftover data for us, effective shifting the > work that XML_GetBuffer does onto the user. This will minimize unnecessary > allocations and copies, but doesn't feel clean to me. New signature: > > XML_GetNextBuffer(const char *p_old_data,int len_old,data, char > **p_new_buffer,int *p_new_len); > > 4. Some combination of the above - that main thing is to allow for and > protect ourselves against the user re-using the same buffer across calls. XML_Parse and XML_ParseBuffer already do that. We just need to convert them from being supplied a buffer to getting one. Of course, the devlish details may make it not quite that simple. Karl From karl at waclawek.net Tue Mar 11 14:54:19 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 11 14:54:24 2003 Subject: [Expat-discuss] Pull API Status? References: <2FE8C75C7A06D4118BB50008C7F7E831DE3BFE@EXCHSRV> <00e201c2e7fe$7195fe40$9e539696@citkwaclaww2k> Message-ID: <010e01c2e808$014b9c30$9e539696@citkwaclaww2k> > > 4. Some combination of the above - that main thing is to allow for and > > protect ourselves against the user re-using the same buffer across calls. > > XML_Parse and XML_ParseBuffer already do that. We just need to > convert them from being supplied a buffer to getting one. > Of course, the devlish details may make it not quite that simple. What I mean is: We have a function NextBuffer() - based on XML_Parse(Buffer) - which is the one used in the parsing loop. This function in turn then gets more input by calling back through XML_GetNextBuffer(). Karl From fdrake at acm.org Wed Mar 12 20:09:35 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed Mar 12 20:10:12 2003 Subject: [Expat-discuss] Performance magic in internal.h Message-ID: <15983.55887.53503.279030@grendel.zope.com> Expat currently contains some fairly magical compiler-specific performance hackery in the form of the FASTCALL, PTRCALL, and PTRFASTCALL macros defined in the internal.h header. When these were added, I measured a real but very small speed improvement using these. On many platforms and compilers, these are essentially disabled, and for others, they should be (for example, Mac OS X using GCC, the standard C compiler for that platform). Given the high maintenance cost of these macros (they seem to be wrong or useless most of the time), I'm inclined to remove them completely. Is there any reason not to? They don't affect the public API in any way, only the performance (very slightly), and the maintainability. I'd *really* like to remove these for Expat 1.95.7. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Wed Mar 12 21:16:06 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 12 21:13:14 2003 Subject: [Expat-discuss] Performance magic in internal.h References: <15983.55887.53503.279030@grendel.zope.com> Message-ID: <00d701c2e906$817095b0$0207a8c0@karl> > Expat currently contains some fairly magical compiler-specific > performance hackery in the form of the FASTCALL, PTRCALL, and > PTRFASTCALL macros defined in the internal.h header. > > When these were added, I measured a real but very small speed > improvement using these. On many platforms and compilers, these are > essentially disabled, and for others, they should be (for example, Mac > OS X using GCC, the standard C compiler for that platform). > > Given the high maintenance cost of these macros (they seem to be wrong > or useless most of the time), I'm inclined to remove them completely. > Is there any reason not to? They don't affect the public API in any > way, only the performance (very slightly), and the maintainability. > > I'd *really* like to remove these for Expat 1.95.7. What about making all of these macros noop, but leaving them there in case anyone wants to tune Expat for their own purposes? It was some effort to put them there in the first place. Karl From carlos at pehoe.civil.ist.utl.pt Wed Mar 12 20:35:58 2003 From: carlos at pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Thu Mar 13 08:39:27 2003 Subject: [Expat-discuss] isFinal parameter Message-ID: <200303122035.h2CKZwp02328@pehoe.civil.ist.utl.pt> Both XML_Parse and XML_ParseBuffer have a isFinal parameter that tells Expat whether this is the last block or not. Usually I would use something as this to know whether isFinal is true or false: size = read (socket_fd, buffer, 1024); isFinal = size < 1024; Unfortunately this does not work for HTTP streams, reading from a socket, where the actual size of the block returned by read is variable, depending of the connection, so if I receive a block of 512 this does not really mean that this is the last block. To get around this, I consider: 1) using two buffers, reading buffer1, then buffer2, an then parsing buffer1 (if the returned size for buffer2 is 0 then buffer1 is the last block), and then swaping buffers, until the end. 2) use one buffer, and then read (socket_fd, char, 1); to read just a byte ahead, to see if read returns 0 or not. This tells me what should be the isFinal parameter. 3) run XML_Parse always with isFinal parameter set to FALSE. This is easier but it really looks completely wrong. Is there a better solution? I am forgeting something obvious? is there some magic solution for this problem? Thanks a lot for enlighten my clueness! Carlos From fdrake at acm.org Thu Mar 13 15:22:16 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu Mar 13 15:22:53 2003 Subject: [Expat-discuss] isFinal parameter In-Reply-To: <200303122035.h2CKZwp02328@pehoe.civil.ist.utl.pt> References: <200303122035.h2CKZwp02328@pehoe.civil.ist.utl.pt> Message-ID: <15984.59512.447219.317945@grendel.zope.com> Carlos Pereira writes: > Both XML_Parse and XML_ParseBuffer have a isFinal parameter > that tells Expat whether this is the last block or not. ... > 3) run XML_Parse always with isFinal parameter set to FALSE. > This is easier but it really looks completely wrong. This is entirely reasonable, and quite common. Since this is easier to handle consistently and keeps the code calling into Expat simpler, that's a good way to do it. I certainly use this in the Python bindings. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From carlos at pehoe.civil.ist.utl.pt Thu Mar 13 21:12:01 2003 From: carlos at pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Thu Mar 13 16:11:32 2003 Subject: [Expat-discuss] Re: isFinal parameter Message-ID: <200303132112.h2DLC1w07989@pehoe.civil.ist.utl.pt> > Carlos Pereira writes: > > Both XML_Parse and XML_ParseBuffer have a isFinal parameter > > that tells Expat whether this is the last block or not. > ... > > 3) run XML_Parse always with isFinal parameter set to FALSE. > > This is easier but it really looks completely wrong. > This is entirely reasonable, and quite common. Since this is easier > to handle consistently and keeps the code calling into Expat simpler, > that's a good way to do it. I certainly use this in the Python > bindings. Thanks a lot, this helps me a lot, I was under the impression that Expat would take special precautions for the last block, but I can see now that it shouldn't really matter. Carlos From rolf at pointsman.de Thu Mar 13 22:34:47 2003 From: rolf at pointsman.de (rolf@pointsman.de) Date: Thu Mar 13 16:38:38 2003 Subject: [Expat-discuss] Re: isFinal parameter In-Reply-To: <200303132112.h2DLC1w07989@pehoe.civil.ist.utl.pt> Message-ID: <200303132134.WAA20958@pointsman.pointsman.de> On 13 Mar, Carlos Pereira wrote: >> Carlos Pereira writes: >> > Both XML_Parse and XML_ParseBuffer have a isFinal parameter >> > that tells Expat whether this is the last block or not. >> ... >> > 3) run XML_Parse always with isFinal parameter set to FALSE. >> > This is easier but it really looks completely wrong. > >> This is entirely reasonable, and quite common. Since this is easier >> to handle consistently and keeps the code calling into Expat simpler, >> that's a good way to do it. I certainly use this in the Python >> bindings. > > Thanks a lot, this helps me a lot, I was under the impression > that Expat would take special precautions for the last block, > but I can see now that it shouldn't really matter. Oh, sure there is a difference between isFinal == 0 and isFinal == 1. With isFinal == 0, the parser don't complain, if the block to parse ends in the middle of markup, or without all levels closed etc. It does, if isFinal == 1. Therefor, it may be wise, to call XML_Parse() with a 0 bytes long block (this is legal) and isFinal == 1, after you have noticed the end of the input in the wrapper code around. Otherwise, you may no detected, if the input ended premature. rolf From carlos at pehoe.civil.ist.utl.pt Thu Mar 13 22:07:36 2003 From: carlos at pehoe.civil.ist.utl.pt (Carlos Pereira) Date: Thu Mar 13 17:07:06 2003 Subject: [Expat-discuss] Re: isFinal parameter Message-ID: <200303132207.h2DM7ah08057@pehoe.civil.ist.utl.pt> >>> Carlos Pereira writes: >>> > Both XML_Parse and XML_ParseBuffer have a isFinal parameter >>> > that tells Expat whether this is the last block or not. >>> ... >>> > 3) run XML_Parse always with isFinal parameter set to FALSE. >>> > This is easier but it really looks completely wrong. >> >>> This is entirely reasonable, and quite common. Since this is easier >>> to handle consistently and keeps the code calling into Expat simpler, >>> that's a good way to do it. I certainly use this in the Python >>> bindings. >> >> Thanks a lot, this helps me a lot, I was under the impression >> that Expat would take special precautions for the last block, >> but I can see now that it shouldn't really matter. >Oh, sure there is a difference between isFinal == 0 and isFinal == 1. >With isFinal == 0, the parser don't complain, if the block to parse >ends in the middle of markup, or without all levels closed etc. It >does, if isFinal == 1. >Therefor, it may be wise, to call XML_Parse() with a 0 bytes long >block (this is legal) and isFinal == 1, after you have noticed the end >of the input in the wrapper code around. Otherwise, you may no >detected, if the input ended premature. That is very clever, many, many thanks for clarifying this point. I just implemented your expert advice. The picture is now clear. Thank you both, Fred and Rolf, for helping me. Carlos From laurent.carcone at w3.org Fri Mar 14 14:21:10 2003 From: laurent.carcone at w3.org (Laurent Carcone) Date: Fri Mar 14 08:21:14 2003 Subject: [Expat-discuss] Message-ID: <20030314132110.293CD170F5@tux.inrialpes.fr> Hello, Would it be possible to include a simple patch in Expat to report an unresolved external general entity in attribute value at it's original position inside the attribute value rather than the default handler. In Function 'appendAttributeValue' in case XML_TOK_ENTITY_REF: I replace if ((pool == &tempPool) && defaultHandler) reportDefault(parser, enc, ptr, next); by if ((pool == &tempPool) && defaultHandler) { const char *ent; for (ent = ptr; ent < next; ent++) { if (!poolAppendChar(pool, ent[0])) return XML_ERROR_NO_MEMORY; } } We use Expat for our open-source browser/editor Amaya and in this case, we need to receive the whole entity value in the elementStartHandler and let the application decide how to manage external entities. Thanks, Laurent Carcone --------------- W3C - ERCIM INRIA Rh?ne-Alpes 655 avenue de l'Europe ZIRST Montbonnot 38334 Saint Ismier Cedex email: laurent.carcone@w3.org / Laurent.Carcone@inrialpes.fr Phone: +33 4 76 61 52 67 Fax: +33 4 76 61 52 07 From karl at waclawek.net Fri Mar 14 09:39:20 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 14 09:39:29 2003 Subject: [Expat-discuss] References: <20030314132110.293CD170F5@tux.inrialpes.fr> Message-ID: <002101c2ea37$7fcc9510$9e539696@citkwaclaww2k> > Hello, > > Would it be possible to include a simple patch in Expat to report an > unresolved external general entity in attribute value at it's original > position inside the attribute value rather than the default handler. AFAIK, *external* general entity references are not allowed in attributes. (section 4.4.4 in the XML 1.0 specs http://www.w3.org/TR/REC-xml.html#forbidden) But assuming you mean internal general entities, then it seems you are running into a limitation that exists for all SAX-like parsers: they do not report entity boundaries for general entities in attribute values and parameter entities in declarations. > In Function 'appendAttributeValue' > in case XML_TOK_ENTITY_REF: > > I replace > if ((pool == &tempPool) && defaultHandler) > reportDefault(parser, enc, ptr, next); > by > if ((pool == &tempPool) && defaultHandler) > { > const char *ent; > for (ent = ptr; ent < next; ent++) { > if (!poolAppendChar(pool, ent[0])) > return XML_ERROR_NO_MEMORY; > > } > } If I understand correctly, you want "&myEntity;" to show up verbatim in the reported attribute value? That would a a problem, because the string "&myEntity;" could be generated without an actual entity reference, like here: where the attribute value would be "ABC&myEntity;DEF". > We use Expat for our open-source browser/editor Amaya and in this case, > we need to receive the whole entity value in the elementStartHandler > and let the application decide how to manage external entities. I hate to say that, but as it currently stands, a fully DOM compliant parser would be better for you. However, we have been discussing to improve Expat in this regard, but it would be implemented as part of a new Expat API, after version 2.0 has been released. There are several options, the most efficient being the insertion of an illegal XML character (like Unicode 0xFFFF) followed by the entity name or even a pointer. An extra API call could be used to pass that name or pointer and retrieve the entity value. Another one would be to turn attribute reporting into a streaming style, with a startAttribute, endAttribute callback, and between them we would have character and start/endEntity callbacks. However that looks like a lot of calls just for reporting an attribute. We haven't discussed this thoroughly though, and anybody who comes up with a better idea is welcome to share it with us. Karl From Laurent.Carcone at inrialpes.fr Fri Mar 14 16:33:07 2003 From: Laurent.Carcone at inrialpes.fr (Laurent Carcone) Date: Fri Mar 14 10:33:12 2003 Subject: [Expat-discuss] In-Reply-To: Your message of Fri, 14 Mar 2003 09:39:20 -0500." <002101c2ea37$7fcc9510$9e539696@citkwaclaww2k> Message-ID: <20030314153307.7DFD4170F5@tux.inrialpes.fr> > > > Hello, > > > > Would it be possible to include a simple patch in Expat to report an > > unresolved external general entity in attribute value at it's original > > position inside the attribute value rather than the default handler. > > AFAIK, *external* general entity references are not allowed in attributes. > (section 4.4.4 in the XML 1.0 specs http://www.w3.org/TR/REC-xml.html#forbidden) You made the right assumption :-) > But assuming you mean internal general entities, then it seems you are > running into a limitation that exists for all SAX-like parsers: they do not > report entity boundaries for general entities in attribute values and > parameter entities in declarations. I read some mails in the archives and it is what I was afraid of. > > > In Function 'appendAttributeValue' > > in case XML_TOK_ENTITY_REF: > > > > I replace > > if ((pool == &tempPool) && defaultHandler) > > reportDefault(parser, enc, ptr, next); > > by > > if ((pool == &tempPool) && defaultHandler) > > { > > const char *ent; > > for (ent = ptr; ent < next; ent++) { > > if (!poolAppendChar(pool, ent[0])) > > return XML_ERROR_NO_MEMORY; > > > > } > > } > > If I understand correctly, you want "&myEntity;" to show up verbatim > in the reported attribute value? > > That would a a problem, because the string "&myEntity;" could be > generated without an actual entity reference, like here: > where the attribute value would > be "ABC&myEntity;DEF". It's precisely the problem we encountered. To fix it, we use a special character followed by the entity name (like in your first option). In fact, my real patch is : if ((pool == &tempPool) && defaultHandler) { const char *ent; if (!poolAppendChar(pool, START_ENTITY)) return XML_ERROR_NO_MEMORY; for (ent = ptr+1; ent < next; ent++) { if (!poolAppendChar(pool, ent[0])) return XML_ERROR_NO_MEMORY; } } where START_ENTITY is a special character shared by Expat and the application. > > > We use Expat for our open-source browser/editor Amaya and in this case, > > we need to receive the whole entity value in the elementStartHandler > > and let the application decide how to manage external entities. > > I hate to say that, but as it currently stands, a fully DOM compliant > parser would be better for you. > > However, we have been discussing to improve Expat in this regard, > but it would be implemented as part of a new Expat API, after > version 2.0 has been released. > > There are several options, the most efficient being the insertion > of an illegal XML character (like Unicode 0xFFFF) followed by > the entity name or even a pointer. An extra API call could be used > to pass that name or pointer and retrieve the entity value. > > Another one would be to turn attribute reporting into a streaming style, > with a startAttribute, endAttribute callback, and between them we would > have character and start/endEntity callbacks. However that looks like > a lot of calls just for reporting an attribute. > > We haven't discussed this thoroughly though, and anybody who comes > up with a better idea is welcome to share it with us. > > Karl > I'm looking forward this discussion Thank you for your response. Laurent > > _______________________________________________ > Expat-discuss mailing list > Expat-discuss@libexpat.org > http://mail.libexpat.org/mailman/listinfo/expat-discuss > From fdrake at acm.org Fri Mar 14 12:02:14 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri Mar 14 12:03:04 2003 Subject: [Expat-discuss] Performance magic in internal.h In-Reply-To: <00d701c2e906$817095b0$0207a8c0@karl> References: <15983.55887.53503.279030@grendel.zope.com> <00d701c2e906$817095b0$0207a8c0@karl> Message-ID: <15986.2838.608875.575402@grendel.zope.com> Karl Waclawek writes: > What about making all of these macros noop, but leaving > them there in case anyone wants to tune Expat for their > own purposes? It was some effort to put them there in the > first place. Given the incredible response to this discussion, I'm going to assume that no one is using Expat any more and re-write the whole thing to me a graphical email client that only supports obscure protocols. I'll check in the new code tonight. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake at acm.org Fri Mar 14 12:06:05 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri Mar 14 12:06:40 2003 Subject: [Expat-discuss] Performance magic in internal.h In-Reply-To: <00d701c2e906$817095b0$0207a8c0@karl> References: <15983.55887.53503.279030@grendel.zope.com> <00d701c2e906$817095b0$0207a8c0@karl> Message-ID: <15986.3069.126323.544840@grendel.zope.com> Karl Waclawek writes: > What about making all of these macros noop, but leaving > them there in case anyone wants to tune Expat for their > own purposes? It was some effort to put them there in the > first place. Given the incredible response to this suggestion, I'm going to keep internal.h, but disable the three *CALL macros for everything except GCC on Linux, since we know the current definitions there are known to work and actually help. Anyone else that wants to set them should edit internal.h themselves. I'll commit the changes shortly, then close SF bug #692878. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Fri Mar 14 12:48:42 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 14 12:48:48 2003 Subject: [Expat-discuss] Performance magic in internal.h References: <15983.55887.53503.279030@grendel.zope.com><00d701c2e906$817095b0$0207a8c0@karl> <15986.2838.608875.575402@grendel.zope.com> Message-ID: <008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k> > Karl Waclawek writes: > > What about making all of these macros noop, but leaving > > them there in case anyone wants to tune Expat for their > > own purposes? It was some effort to put them there in the > > first place. > > Given the incredible response to this discussion, I'm going to assume > that no one is using Expat any more and re-write the whole thing to me > a graphical email client that only supports obscure protocols. > > I'll check in the new code tonight. Good stuff, Fred. I also got a request to add support for IIOP. Should we delay this until after release 2.0? Karl From fdrake at acm.org Fri Mar 14 13:04:10 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri Mar 14 13:04:48 2003 Subject: [Expat-discuss] Performance magic in internal.h In-Reply-To: <008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k> References: <15983.55887.53503.279030@grendel.zope.com> <00d701c2e906$817095b0$0207a8c0@karl> <15986.2838.608875.575402@grendel.zope.com> <008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k> Message-ID: <15986.6554.512051.137536@grendel.zope.com> Karl Waclawek writes: > Good stuff, Fred. > I also got a request to add support for IIOP. > Should we delay this until after release 2.0? Only the IIOP support; that'll require we write at least one unit test. The fancy GUI and mail protocol support defy testing, so it can go in now. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Fri Mar 14 13:17:12 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 14 13:18:34 2003 Subject: [Expat-discuss] Performance magic in internal.h References: <15983.55887.53503.279030@grendel.zope.com><00d701c2e906$817095b0$0207a8c0@karl><15986.2838.608875.575402@grendel.zope.com><008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k> <15986.6554.512051.137536@grendel.zope.com> Message-ID: <000b01c2ea55$ef572260$9e539696@citkwaclaww2k> > Karl Waclawek writes: > > Good stuff, Fred. > > I also got a request to add support for IIOP. > > Should we delay this until after release 2.0? > > Only the IIOP support; that'll require we write at least one unit > test. The fancy GUI and mail protocol support defy testing, so it can > go in now. Fine with me. Btw, we might be able to tunnel IIOP through SMTP. Karl From rolf at pointsman.de Fri Mar 14 19:19:28 2003 From: rolf at pointsman.de (rolf@pointsman.de) Date: Fri Mar 14 13:22:56 2003 Subject: [Expat-discuss] Performance magic in internal.h In-Reply-To: <15986.6554.512051.137536@grendel.zope.com> Message-ID: <200303141819.TAA06918@pointsman.pointsman.de> On 14 Mar, Fred L. Drake, Jr. wrote: > > Karl Waclawek writes: > > Good stuff, Fred. > > I also got a request to add support for IIOP. > > Should we delay this until after release 2.0? > > Only the IIOP support; that'll require we write at least one unit > test. The fancy GUI and mail protocol support defy testing, so it can > go in now. Im sorry, but, what the heck, does IIOP mean in this context (for sure not CORBA's Internet Inter-ORB Protocol) ? And about what fancy GUI and mail protocol support does you guys talk here? rolf From karl at waclawek.net Fri Mar 14 13:30:42 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 14 13:30:54 2003 Subject: [Expat-discuss] Performance magic in internal.h References: <200303141819.TAA06918@pointsman.pointsman.de> Message-ID: <001101c2ea57$d1f93620$9e539696@citkwaclaww2k> > On 14 Mar, Fred L. Drake, Jr. wrote: > > > > Karl Waclawek writes: > > > Good stuff, Fred. > > > I also got a request to add support for IIOP. > > > Should we delay this until after release 2.0? > > > > Only the IIOP support; that'll require we write at least one unit > > test. The fancy GUI and mail protocol support defy testing, so it can > > go in now. > > Im sorry, but, what the heck, does IIOP mean in this context (for sure > not CORBA's Internet Inter-ORB Protocol) ? And about what fancy GUI and > mail protocol support does you guys talk here? Well, since Expat is so fast it could be used for SOAP over IIOP (yes, the CORBA protocol). The e-mail support is for asynchronous messaging, especially when the parties are disconnected. Karl From fdrake at acm.org Fri Mar 14 13:54:34 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Fri Mar 14 13:55:07 2003 Subject: [Expat-discuss] Performance magic in internal.h In-Reply-To: <000b01c2ea55$ef572260$9e539696@citkwaclaww2k> References: <200303141819.TAA06918@pointsman.pointsman.de> <001101c2ea57$d1f93620$9e539696@citkwaclaww2k> <15986.6554.512051.137536@grendel.zope.com> <15983.55887.53503.279030@grendel.zope.com> <00d701c2e906$817095b0$0207a8c0@karl> <15986.2838.608875.575402@grendel.zope.com> <008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k> <000b01c2ea55$ef572260$9e539696@citkwaclaww2k> Message-ID: <15986.9578.860736.412547@grendel.zope.com> Karl Waclawek writes: > Fine with me. Btw, we might be able to tunnel IIOP through SMTP. Most definately. rolf@pointsman.de writes: > Im sorry, but, what the heck, does IIOP mean in this context (for sure > not CORBA's Internet Inter-ORB Protocol) ? And about what fancy GUI and > mail protocol support does you guys talk here? I think you missed an announcement: http://mail.libexpat.org/pipermail/expat-discuss/2003-March/000945.html Karl Waclawek writes: > Well, since Expat is so fast it could be used for SOAP over IIOP > (yes, the CORBA protocol). The e-mail support is for asynchronous > messaging, especially when the parties are disconnected. Oh, you wanted to keep the XML parser? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Fri Mar 14 13:59:12 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 14 13:59:18 2003 Subject: [Expat-discuss] Performance magic in internal.h References: <200303141819.TAA06918@pointsman.pointsman.de><001101c2ea57$d1f93620$9e539696@citkwaclaww2k><15986.6554.512051.137536@grendel.zope.com><15983.55887.53503.279030@grendel.zope.com><00d701c2e906$817095b0$0207a8c0@karl><15986.2838.608875.575402@grendel.zope.com><008d01c2ea51$f3be49e0$9e539696@citkwaclaww2k><000b01c2ea55$ef572260$9e539696@citkwaclaww2k> <15986.9578.860736.412547@grendel.zope.com> Message-ID: <002b01c2ea5b$ccf61cc0$9e539696@citkwaclaww2k> > Karl Waclawek writes: > > Well, since Expat is so fast it could be used for SOAP over IIOP > > (yes, the CORBA protocol). The e-mail support is for asynchronous > > messaging, especially when the parties are disconnected. > > Oh, you wanted to keep the XML parser? Yes, I have a hard time letting go... Karl From tim.fornoville at web.de Mon Mar 17 17:13:00 2003 From: tim.fornoville at web.de (tim fornoville) Date: Mon Mar 17 11:13:34 2003 Subject: [Expat-discuss] =?iso-8859-1?q?=26_=A7_=FC_=F6_=E4_characters_how_to_handle?= Message-ID: <200303171613.h2HGD0316753@mailgate5.cinetic.de> Hi, I've got a little problem here, when I parse a file with " & ? ? ? ? " the parser shuts down.. , is there a work-around... greetings tim ____________________________________________________________________________ G?nnen Sie sich eine Abwechslung vom Wintergrau mit den aktuellen Angeboten von Lufthansa http://img.web.de/lh/lhspecial.html From karl at waclawek.net Mon Mar 17 11:18:51 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 17 11:20:05 2003 Subject: =?iso-8859-1?Q?Re:_=5BExpat-discuss=5D_&_=A7_=FC_=F6_=E4_characters_h?= =?iso-8859-1?Q?ow_to_handle?= References: <200303171613.h2HGD0316753@mailgate5.cinetic.de> Message-ID: <004801c2eca0$e61b5a70$9e539696@citkwaclaww2k> > I've got a little problem here, when I parse a file with " & ? ? ? ? " > the parser shuts down.. , > is there a work-around... Do you mean it crashes? Or do you get an error returned? If the latter, maybe the XML file has the wrong encoding specified in the XML declaration? ISO-8859-1 might work. Karl From karl at waclawek.net Mon Mar 17 11:18:51 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 17 11:20:17 2003 Subject: =?iso-8859-1?Q?Re:_=5BExpat-discuss=5D_&_=A7_=FC_=F6_=E4_characters_h?= =?iso-8859-1?Q?ow_to_handle?= References: <200303171613.h2HGD0316753@mailgate5.cinetic.de> Message-ID: <004b01c2eca1$10b21030$9e539696@citkwaclaww2k> > I've got a little problem here, when I parse a file with " & ? ? ? ? " > the parser shuts down.. , > is there a work-around... Do you mean it crashes? Or do you get an error returned? If the latter, maybe the XML file has the wrong encoding specified in the XML declaration? ISO-8859-1 might work. Karl From brox at corena.no Mon Mar 17 17:26:34 2003 From: brox at corena.no (Bjorn Brox) Date: Mon Mar 17 11:26:54 2003 Subject: [Expat-discuss] & =?ISO-8859-1?Q?=A7_=FC_=F6_=E4_ch?= =?ISO-8859-1?Q?aracters_how_to_handle?= References: <200303171613.h2HGD0316753@mailgate5.cinetic.de> Message-ID: <3E75F73A.1070301@corena.no> tim fornoville wrote: > Hi, > > I've got a little problem here, when I parse a file with " & ? ? ? ? > " the parser shuts down.. , is there a work-around... > Default encoding is UTF 8 and these characters is not valid UTF-8 characters (each of them mark the start of a UTF-8 sequence, but the next character is probably not a valid). Put the correct encoding in your header if you insist of using latin 1 (ISO-8859-1) -- Bjorn Brox, CORENA Norge AS, http://www.corena.no/, ICQ 17872043 Industritunet, Dyrmyrgt. 35, N-3611 Kongsberg, NORWAY Phone: +47 32717210, Fax: +47 32717201, Mobile: +47 92638590 From mba2000 at ioplex.com Mon Mar 17 16:29:01 2003 From: mba2000 at ioplex.com (Michael B.Allen) Date: Mon Mar 17 16:29:29 2003 Subject: [Expat-discuss] & =?ISO-8859-1?Q?=A7_=FC_=F6_=E4?= characters how to handle In-Reply-To: <3E75F73A.1070301@corena.no> References: <200303171613.h2HGD0316753@mailgate5.cinetic.de> <3E75F73A.1070301@corena.no> Message-ID: <20030317162901.329e88fc.mba2000@ioplex.com> On Mon, 17 Mar 2003 17:26:34 +0100 Bjorn Brox wrote: > tim fornoville wrote: > > Hi, > > > > I've got a little problem here, when I parse a file with " & ? ? ? ? > > " the parser shuts down.. , is there a work-around... > > > Default encoding is UTF 8 and these characters is not valid UTF-8 > characters (each of them mark the start of a UTF-8 sequence, but the > next character is probably not a valid). Except for the ampersand character which needs to be replaced with the builtin entity reference &. Mike -- A program should be written to model the concepts of the task it performs rather than the physical world or a process because this maximizes the potential for it to be applied to tasks that are conceptually similar and, more important, to tasks that have not yet been conceived. From fabio at soi.city.ac.uk Tue Mar 18 15:58:58 2003 From: fabio at soi.city.ac.uk (Fabio Venuti) Date: Tue Mar 18 10:59:03 2003 Subject: [Expat-discuss] getting offset of attributes Message-ID: <005601c2ed67$4950c640$4f5b288a@cadfael> Hi all, by using XML_GetCurrentByteIndex I can get the offset of the beginning of an element, but how can I get the offset of, say, an attribute? Thanks, Fabio From fdrake at acm.org Tue Mar 18 11:03:10 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue Mar 18 11:03:46 2003 Subject: [Expat-discuss] getting offset of attributes In-Reply-To: <005601c2ed67$4950c640$4f5b288a@cadfael> References: <005601c2ed67$4950c640$4f5b288a@cadfael> Message-ID: <15991.17214.5247.475433@grendel.zope.com> Fabio Venuti writes: > by using XML_GetCurrentByteIndex I can get the offset of the > beginning of an element, but how can I get the offset of, say, an > attribute? That's not possible with the current API. It will probably become possible in Expat 3.0; feel free to review the roadmap document: http://www.libexpat.org/dev/roadmap.html (I should probably remove the "PROPOSAL" from the top of that document since the developers seem to all agree that's the way to go.) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Tue Mar 18 11:29:05 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 18 11:29:17 2003 Subject: [Expat-discuss] getting offset of attributes References: <005601c2ed67$4950c640$4f5b288a@cadfael> <15991.17214.5247.475433@grendel.zope.com> Message-ID: <009c01c2ed6b$7e82a550$9e539696@citkwaclaww2k> > Fabio Venuti writes: > > by using XML_GetCurrentByteIndex I can get the offset of the > > beginning of an element, but how can I get the offset of, say, an > > attribute? > > That's not possible with the current API. It will probably become > possible in Expat 3.0; Only if we switch to reporting of attributes in a streaming fashion. This would require changes to the tokenizer, since it does not report attributes as separate events. We have also looked at solving the issues of reporting entity boundaries in attribute values and declarations by inserting marker characters followed by the entity name or a pointer to an entity structure. Reporting the attributes all in one event together with the element start tag seems almost necessary for some situations, like declaring namespace prefixes that are used *before* the associated attribute is encountered. With a streaming model the problem is that the element's qualified name cannot be be associated with an URI until the element's attribute that declares the prefix is read. Karl From Michal.Roskanuk at merlin.cz Thu Mar 20 21:43:14 2003 From: Michal.Roskanuk at merlin.cz (Roskanuk Michal) Date: Thu Mar 20 15:43:30 2003 Subject: [Expat-discuss] Terminating expat - idea Message-ID: Hi, i'd like just share idea of how to terminate expar processing for any reason. This question appears periodically here (most asked...), the logical answer is 'use exception in C++/Delphi or setjmp/longjmp in C', which is - of course - ok. However i guess for those who're looking for platform/compiler/etc independent and a clear way (or are afraid of those mysterious jmps ;-) the following way could be acceptable (in scope of C language): Create a struct, for example: struct _XData { int err; // my own error number int errln; // on which line char errtxt[MAX_ERR_SIZE]; // optional additional info ... }; // there is some missing, i know ;-) typedef struct _XData XData; and use it as userData. Main parsing loop: ... XData x_data; x_data.err = 0; ... XML_SetUserData(parser, &x_data); ... do { // get data if(XML_Parse(parser, Buff, len, done) == XML_STATUS_ERROR) 'handle parser error & bye else if(x_data.err > 0) 'handle user error & bye or whatever } while(are_there_some_data); Then check x_data.err at the beginning of every handler, for example: static void start(void *data, const char *el, const char **attr) { XData *px_data = (XData *) data; if(px_data->err > 0) return; // error occured somewhere, get out and, of course fill out an error info whenever you need followed by 'return' or 'goto CleanExit' or so. This way makes code longer and a little bit slower, but you're sure that everybody understands it and it is totally portable (to every mind ;-). Such workaround was posted already, however somebody probably needs it roasted and with a soup, so i hope this helps. Mike - Get Started With Your Compaq, page 44: - If your computer still doesn't work, ensure that it is - plugged into an electrical outlet and try to turn it on. From karl at waclawek.net Thu Mar 20 15:58:08 2003 From: karl at waclawek.net (Karl Waclawek) Date: Thu Mar 20 15:58:16 2003 Subject: [Expat-discuss] Terminating expat - idea References: Message-ID: <016601c2ef23$694c5820$9e539696@citkwaclaww2k> > Hi, > i'd like just share idea of how to terminate expar processing > for any reason. This question appears periodically here (most > asked...), the logical answer is 'use exception in C++/Delphi > or setjmp/longjmp in C', which is - of course - ok. However > i guess for those who're looking for platform/compiler/etc > independent and a clear way (or are afraid of those mysterious > jmps ;-) the following way could be acceptable > (in scope of C language): > > Create a struct, for example: > > struct _XData > { > int err; // my own error number > int errln; // on which line > char errtxt[MAX_ERR_SIZE]; // optional additional info > ... > }; // there is some missing, i know ;-) > typedef struct _XData XData; > > and use it as userData. Main parsing loop: > > ... > XData x_data; > x_data.err = 0; > ... > XML_SetUserData(parser, &x_data); > ... > do > { > // get data > if(XML_Parse(parser, Buff, len, done) == XML_STATUS_ERROR) > 'handle parser error & bye > else > if(x_data.err > 0) > 'handle user error & bye or whatever > } while(are_there_some_data); If I understand this correctly you can only stop parsing on return from XML_Parse (before the next buffer is processed). That means, for large buffers, and especially for documents smaller than the buffer size, this won't really be satisfying. We are actually working on a combined Push/Pull API for Expat versions > 2.0, which will have return codes for the handlers, as it stands right now. Certain return codes will cause Expat to stop parsing. Karl From Stuart.Fulcher at abbeylife.co.uk Fri Mar 21 11:57:41 2003 From: Stuart.Fulcher at abbeylife.co.uk (Fulcher, Stuart) Date: Fri Mar 21 08:15:20 2003 Subject: [Expat-discuss] Newbie Building Expat on Win32 Message-ID: <5436C68F121FD94596EA505E3638570B07870B48@al1rs_s1.abbeylife.co.uk> Hi, I have successfully used expat on a unix development, and now am trying to port my code to a Windows platform. However, I'm having major problems getting expat to build. On Unix I just included the source files with my own to build a static library. Attempting to do the same thing on Windows (having downloaded the win32 expat 1.95.6 source) doesn't work. The actual Visual C++ expat project/workspace all builds fine, but when I pick up the source and add it to my own project, I'm getting problems where the code attempts to reference 'expat_config.h', which I can't find amongst the downloaded files. I can't find any reference to it in the documentation supplied either. What am I doing wrong?! Please help! Stuart ********************************************************************** This email is sent in confidence for the addressee only. Unauthorised recipients must preserve this confidentiality and should please advise the sender immediately by telephone (01202 290292) and return the original email to us without taking a copy. We have taken reasonable precautions to ensure that no viruses are transmitted to any third party. Neither Abbey Life Assurance Co. Ltd nor Unisys Insurance Services Ltd accepts any responsibility for any loss or damage resulting directly or indirectly from the use of this email or its contents. Abbey Life Assurance Co. Ltd and Abbey Unit Trust Managers Ltd are members of the Lloyds TSB Group, and are regulated by the Financial Services Authority. ********************************************************************** From karl at waclawek.net Fri Mar 21 09:24:54 2003 From: karl at waclawek.net (Karl Waclawek) Date: Fri Mar 21 09:25:04 2003 Subject: [Expat-discuss] Newbie Building Expat on Win32 References: <5436C68F121FD94596EA505E3638570B07870B48@al1rs_s1.abbeylife.co.uk> Message-ID: <001a01c2efb5$a4b04650$9e539696@citkwaclaww2k> > I have successfully used expat on a unix development, and now am trying to > port my code to a Windows platform. However, I'm having major problems > getting expat to build. On Unix I just included the source files with my own > to build a static library. Attempting to do the same thing on Windows > (having downloaded the win32 expat 1.95.6 source) doesn't work. > > The actual Visual C++ expat project/workspace all builds fine, but when I > pick up the source and add it to my own project, I'm getting problems where > the code attempts to reference 'expat_config.h', which I can't find amongst > the downloaded files. I can't find any reference to it in the documentation > supplied either. > > What am I doing wrong?! Please help! Try to define COMPILED_FROM_DSP. If it still doesn't work, check the sample projects (DSP files for MS VC++) for their settings. Karl From vapor at arizonagt.org Sat Mar 22 18:37:31 2003 From: vapor at arizonagt.org (Vapor) Date: Sat Mar 22 19:37:40 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: Message-ID: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff> I have been trying to get lcc-win32 and expat to play nice together, but I fear my inexperience is keeping that from happening. I have copied the lib directory into my projects director (testxml). I have added #include "lib/expat.h". I have also copied libexpat.lib to both the lib directory and to the root project directory. I havent attempted to do anything more than this and just compile the program which previously compiled. I recieved the error d:\.....\testxml.c: d:\........\lib\expat.h: 657 unknown enumeration 'XML_Status' Any insight would be nice as I am quite new to C and Lcc-win32 and am getting very frustrated with it. From karl at waclawek.net Sat Mar 22 22:46:12 2003 From: karl at waclawek.net (Karl Waclawek) Date: Sat Mar 22 22:42:45 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff> Message-ID: <000901c2f0ee$bfd168d0$0207a8c0@karl> > I have been trying to get lcc-win32 and expat to play nice together, but I > fear my inexperience is keeping that from happening. > > I have copied the lib directory into my projects director (testxml). I have > added #include "lib/expat.h". I have also copied libexpat.lib to both the > lib directory and to the root project directory. > > I havent attempted to do anything more than this and just compile the > program which previously compiled. I recieved the error d:\.....\testxml.c: > d:\........\lib\expat.h: 657 unknown enumeration 'XML_Status' > > Any insight would be nice as I am quite new to C and Lcc-win32 and am > getting very frustrated with it. Check out expat.h from CVS. Some compilers have problems with the version used in Expat 1.95.6. Karl From klersun at yahoo.com Mon Mar 24 05:17:16 2003 From: klersun at yahoo.com (Levent Ersun) Date: Mon Mar 24 08:17:18 2003 Subject: [Expat-discuss] example please... Message-ID: <20030324131716.60192.qmail@web11806.mail.yahoo.com> can anyone send me an example of parsing an XML file in C...I want to parse the file and store its content in the arrays...please help --------------------------------- Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! From dmoore at viefinancial.com Mon Mar 24 09:16:31 2003 From: dmoore at viefinancial.com (Moore, Dave) Date: Mon Mar 24 09:16:36 2003 Subject: [Expat-discuss] Summary of Pull API thoughts Message-ID: <2FE8C75C7A06D4118BB50008C7F7E831DE3D05@EXCHSRV> I've tried to collect the many good ideas provided by Karl Waclawek for the design of a Pull API on top of the expat. Any feedback (especially on omissions) is appreciated! Thanks, Dave -------------- next part -------------- 0.0 This document attempts to capture design decisions and issues with designing a Pull API for the Expat parser. 1.0 Pull API Overview The Expat Pull API defines a new set of functions which allow a client application to process nodes of an XML document one at a time by "pulling" them in order via calls to a "Next()" function. The pull API will be built "on top" of the existing API in the sense that its implementation will include callbacks for startElementHandler, endElementHandler, etc. These built in handlers will manage the state necessary to return nodes to the caller of Next(). Next() will also ask the client application for more raw data as necessary via a "GetNextBuffer" callback. A side effect of the pull API implementation is that it provides a graceful means of interrupting XML parsing from inside a callback, even when operating in "push" mode. 2.0 Pull API Specifics /* * XML_GetBufferHandler * * Callback prototype used by PULL api to ask the user for more data. * * XML_SetGetBufferHandler * */ typedef void(*XML_GetBufferHandler)(XML_Parser parser, char **bufferPtr,int *bufferLen,int *isFinal); XMLPARSEAPI(void) XML_SetGetBufferHandler(XML_Parser p,XML_GetBufferHandler handler); New functions: /* XML_PullParserCreate * * Creates a parser based on encoding, installs internal callbacks for * implementing Pull based API */ XML_Parser XML_PullParserCreate(const XML_Char *encoding); /* XML_Next * * Does enough parsing to process the next node of the document. * Pulls in new data if current buffer is exhausted. * */ enum XML_Status XML_Next(XML_Parser p); Higher level XML_Next() functions are possible which allow the parser to move ahead to specific nodes in the document. For example, move to the next element, or move to the next element named "Foo". enum XML_Status XML_Next_______(XML_Parser p,...); /* XML_GetNodeInfo * * Returns a pointer to a XML_NodeInfo structure containing name, character data, * etc. information about the current node * */ XML_NodeInfo *XML_GetNodeInfo(XML_Parser p); struct XML_NodeInfo { XML_NodeType type; char *name; char *data; int data_len; /* More Here... */ } 3.0 Built-in callbacks and internal callbacks Callbacks will return "handling instructions" to the main processing loop. The handling instructions are one of: XML_SKIP - Do not store node information in XML_NodeInfo, but continue parsing. This is the default, and is compatible with existing parsing code. XML_USE - Store node information in XML_NodeInfo, and return to the caller of the parsing function. XML_STOP - Do not store node information in XML_NodeInfo, and return to the caller of the parsing function. Parsing is resumable. XML_ERR - Do not store node information in XML_NodeInfo, and return a status of "XML_STATUS_ERROR" to the caller. XML_GetErrorCode will return "XML_USER_ERROR". Parsing is resumable. Internal callbacks will be automatically installed for some (most? all?) of expat's callbacks. StartElementHandler and EndElementHandler will certainly be among the set of internal callbacks. These callbacks will capture node information and will return XML_SKIP or XML_USE depending on the current call to the XML_Next() family of functions. For simple pull parsing, the user will not have to provide any callbacks except the 'GetNextBuffer' callback. User callbacks can still be installed in conjunction with the internal callbacks above for advanced processing. The internal callbacks will perform their own processing as appropriate, and then forward the call to the user callback. The user callback could then apply additional processing. For instance, the user callback could convert a "XML_USE" state to "XML_SKIP" for a node not of interest to an application. 4.0 Existing API Issues/Changes Some other future changes for expat may impact the design & implementation of pull parsing. 4.1 Namespace reporting Expat may return names as separate localName, prefix and uri parameters. If callback signatures are changing already to pass namespace information, we can have more options in extending callbacks to return our node handling codes. 4.2 Entity expansion Due to the nature of reporting attributes values as one chunk of data, entities within the value are silently expanded. The same applies to parameter entities in entity values. If entities are expanded differently, we may have to add new node types to allow a pull based parser to move through attribute values. 4.3 Additional internal changes XML_NodeInfo sturcture will probably be added as an internal data structure that is filled as each node is reached. Additional callback function pointers are needed for forwarding from internal pull callbacks to client callbacks. 5.0 Buffer Management When the parser calls the user's XML_GetBufferHandler, it expects the user to return a pointer to buffer space containing new data. The user may accomplish this by allocating a new buffer via XML_GetBuffer and then copying/reading data into that buffer, or by referring to one of its own buffers. If it refers to its own buffer, a client application must ensure that this buffer remains valid until the next call to XML_GetBufferHandler or until parsing is complete. The parser must be able to determine when the user provided buffer was created via XML_GetBuffer. If the buffer was -not- created by XML_GetBuffer, then the parser will be responsible for preserving any partial tokens from the last buffer and prepending them to a user buffer when parsing continues. The main processing loops request more data by returning a new error code, XML_ERROR_BUFFER to their caller (i.e. XML_Next()). "Push mode" buffer management could be changed to work more like the pull model (using the XML_GetBufferHandler() callback instead of the 6.0 Internal Complications When XML_USE is returned and parsing is interrupted, some cleanup may be deferred until the next call to XML_Next(). It looks as if in most cases the cleanup after a callback is quite simple. E.g. for startElementHandler (non-empty element) it is poolClear(&tempPool); for the endElementHandler it is a while loop dealing with namespace bindings. All we need to do is to have a switch statement at the entry of XML_Next() that performs the cleanup depending on how we retuned on the last call (assuming the cleanup was skipped when XML_USE was returned). For an empty element, XML_Next is tricky, since Expat assumes both callbacks (start and end) to happen without interruption by and end of buffer. Options: - We add an EMPTY_TAG node type. Internal callbacks for StartElementHandler and EndElementHandler would somehow detect the empty tag and set up the XML_NodeInfo appropriately. - We set up an emptyTagProcessor, so that the next call to XML_Next() will call this as the processor, which will return with the empty element's tag name. That's a common way to handle parser state in Expat. From karl at waclawek.net Mon Mar 24 09:56:09 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 24 09:57:39 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: <2FE8C75C7A06D4118BB50008C7F7E831DE3D05@EXCHSRV> Message-ID: <001501c2f215$8166ace0$9e539696@citkwaclaww2k> > Any feedback (especially on omissions) is appreciated! Thanks, Dave for an excellent summary of our discussions! I'd like to comment some more on a few things of the topics mentioned: > User callbacks can still be installed in conjunction with the internal > callbacks above for advanced processing. The internal callbacks will > perform their own processing as appropriate, and then forward the call > to the user callback. The user callback could then apply additional > processing. For instance, the user callback could convert a "XML_USE" > state to "XML_SKIP" for a node not of interest to an application. I'd say the user callback could be called first, before the internal callback performs its own work, since it is more efficient to know if a node is skipped *before* the internal call-back does any work on it. Also, the internal call-back does not generate any new info that the user call-back requires. > For an empty element, XML_Next is tricky, since Expat assumes > both callbacks (start and end) to happen without interruption by and end of buffer. > > Options: > > - We add an EMPTY_TAG node type. Internal callbacks for > StartElementHandler and EndElementHandler would somehow detect the empty > tag and set up the XML_NodeInfo appropriately. Actually, the Expat tokenizer already provides that information, so we just need to store it somewhere for the call-backs to access. There is one more advantage to an EMPTY_TAG node type: for an application that would like to know this information, it is quite an effort to re-construct it (set a flag in startElementHandler, and install every possible handler that might get called, so that it can reset that flag before the endElementhandler is executed). Providing this info from Expat comes essentially for free, since the tokenizer provides it anyway. - One more internal complication: So far we have detected one internal data structure that is stack based, the linked list openInternalEntities. Since it will never exist across calls to XML_Parse(Buffer) this is currently not a problem. With a Pull API however, the parser may stop, that is, return from XML_Next(), in the middle of an internal entity, thus unwinding the stack and destroying openInternalEntities prematurely. Therefore we need to switch this linked list to being heap allocated. Karl From karl at waclawek.net Mon Mar 24 10:05:25 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 24 10:07:00 2003 Subject: [Expat-discuss] example please... References: <20030324131716.60192.qmail@web11806.mail.yahoo.com> Message-ID: <004701c2f216$cc6b94c0$9e539696@citkwaclaww2k> > can anyone send me an example of parsing an XML file in C...I want to parse the file and store its content in the arrays...please help The Expat package contains the elements, outline and xmlwf sample applications. Karl From vapor at arizonagt.org Mon Mar 24 16:29:52 2003 From: vapor at arizonagt.org (Vapor) Date: Mon Mar 24 17:30:01 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff> <000901c2f0ee$bfd168d0$0207a8c0@karl> Message-ID: <003601c2f254$e46d2cd0$a50c2e0a@tnnashjcluff> Thx for the reply. I have updated it but I think my problem is related to my inexperience with lcc-win32. Does anybody have a demo project use with lcc-win32. I cant seem to get it to link with the static libs at all. Again, I am fairly inexperienced with C and many of the compilers available out there. I have tried the comp.compiliers.lcc newsgroup for help with this can cant seem to get anywhere with that as well. Basicly what I am hoping for is a simple easy to understand example with lcc-win32 that I can see all of the options in the project and see a program compile successfully using the expat libs. Vapor ----- Original Message ----- From: "Karl Waclawek" To: "Vapor" ; Sent: Saturday, March 22, 2003 9:46 PM Subject: Re: [Expat-discuss] using expat in lcc-win32 > > > > I have been trying to get lcc-win32 and expat to play nice together, but I > > fear my inexperience is keeping that from happening. > > > > I have copied the lib directory into my projects director (testxml). I have > > added #include "lib/expat.h". I have also copied libexpat.lib to both the > > lib directory and to the root project directory. > > > > I havent attempted to do anything more than this and just compile the > > program which previously compiled. I recieved the error d:\.....\testxml.c: > > d:\........\lib\expat.h: 657 unknown enumeration 'XML_Status' > > > > Any insight would be nice as I am quite new to C and Lcc-win32 and am > > getting very frustrated with it. > > Check out expat.h from CVS. Some compilers have problems with the > version used in Expat 1.95.6. > > Karl From karl at waclawek.net Mon Mar 24 20:51:25 2003 From: karl at waclawek.net (Karl Waclawek) Date: Mon Mar 24 20:47:55 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff> <000901c2f0ee$bfd168d0$0207a8c0@karl> <003601c2f254$e46d2cd0$a50c2e0a@tnnashjcluff> Message-ID: <000a01c2f271$0d728370$0207a8c0@karl> > Thx for the reply. I have updated it but I think my problem is related to > my inexperience with lcc-win32. Does anybody have a demo project use with > lcc-win32. I cant seem to get it to link with the static libs at all. For MS VC++ you define XML_STATIC. Is lcc different? Check the top of expat.h. Karl From xcross at us.ibm.com Tue Mar 25 09:55:17 2003 From: xcross at us.ibm.com (Chris Cross) Date: Tue Mar 25 09:56:52 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts Message-ID: I know this is off-topic, but while you have your design juices flowing I thought I'd ask a question that we're struggling with in my group. My group is using expat to implement a VoiceXML processor and a multimodal browser using the XHTML + Voice proposal in the W3C (http://www.w3.org/TR/xhtml+voice/.) We're also considering its use by the grammar compiler in our speech engines in order to support the Speech Recognition Grammar Specification (http://www.w3.org/TR/speech-grammar/) However, we have a sticky requirement for our language support to be able to switch dynamically between 1 byte ascii and 2 byte Unicode. By doing this, our European languages require half the space of the Asian languages, which is very important for our embedded customers who sqeal every kilobyte we consume. How hard would it be to change the character size from a build-time to a run-time decision in expat? thanks, chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 From karl at waclawek.net Tue Mar 25 11:35:34 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 25 11:35:43 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: Message-ID: <002701c2f2ec$8f3df080$9e539696@citkwaclaww2k> > I know this is off-topic, but while you have your design juices flowing I > thought I'd ask a question that we're struggling with in my group. > > My group is using expat to implement a VoiceXML processor and a multimodal > browser using the XHTML + Voice proposal in the W3C > (http://www.w3.org/TR/xhtml+voice/.) We're also considering its use by the > grammar compiler in our speech engines in order to support the Speech > Recognition Grammar Specification (http://www.w3.org/TR/speech-grammar/) Cool. Which version of Expat? > However, we have a sticky requirement for our language support to be able > to switch dynamically between 1 byte ascii and 2 byte Unicode. By doing > this, our European languages require half the space of the Asian languages, > which is very important for our embedded customers who sqeal every kilobyte > we consume. > > How hard would it be to change the character size from a build-time to a > run-time decision in expat? It looks hard, since even the API itself is statically tied to the definition of XML_Char. However, you should be able to compile two libraries (XML_Char defined as char or wchar_t), and dynamically load whichever you need at runtime, and even switch between them. Why would that not work for you? Karl From fdrake at acm.org Tue Mar 25 11:54:53 2003 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue Mar 25 11:55:38 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts In-Reply-To: References: <002701c2f2ec$8f3df080$9e539696@citkwaclaww2k> Message-ID: <16000.35293.37806.326224@grendel.zope.com> Chris Cross writes: > However, we have a sticky requirement for our language support to be able > to switch dynamically between 1 byte ascii and 2 byte Unicode. By doing > this, our European languages require half the space of the Asian languages, > which is very important for our embedded customers who sqeal every kilobyte > we consume. > > How hard would it be to change the character size from a build-time to a > run-time decision in expat? Karl Waclawek writes: > It looks hard, since even the API itself is statically tied to the > definition of XML_Char. > > However, you should be able to compile two libraries (XML_Char defined > as char or wchar_t), and dynamically load whichever you need at runtime, > and even switch between them. Why would that not work for you? That sounds fairly tedious to me. Recall that Expat tends to report data in fairly small chunks for typical applications. Even in plain text (PCDATA), Expat breaks data at line boundaries. If you want further control over the amount of data reported in the character data callback, limit the amount of data passed into Expat for any XML_Parse() or XML_ParseBuffer() call. This can be used to an application's advantage, especially if there's concern for the amount of memory being consumed. Compile Expat with the appropriate output encoding for your primary audience, and then re-encode if necessary in the application logic. This should be easy to implement and allows support for output encodings other than UTF-8 or UTF-16. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Tue Mar 25 12:08:47 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 25 12:08:56 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: <002701c2f2ec$8f3df080$9e539696@citkwaclaww2k> <16000.35293.37806.326224@grendel.zope.com> Message-ID: <003801c2f2f1$32c5a910$9e539696@citkwaclaww2k> > Karl Waclawek writes: > > It looks hard, since even the API itself is statically tied to the > > definition of XML_Char. > > > > However, you should be able to compile two libraries (XML_Char defined > > as char or wchar_t), and dynamically load whichever you need at runtime, > > and even switch between them. Why would that not work for you? > > That sounds fairly tedious to me. But the effort to load a library and set up the pointers to the exported functions (that's at least how it works in Windows) is a one-time effort that can be hidden in a re-usable function or class. > Recall that Expat tends to report data in fairly small chunks for > typical applications. Even in plain text (PCDATA), Expat breaks data > at line boundaries. If you want further control over the amount of > data reported in the character data callback, limit the amount of data > passed into Expat for any XML_Parse() or XML_ParseBuffer() call. > > This can be used to an application's advantage, especially if there's > concern for the amount of memory being consumed. Compile Expat with > the appropriate output encoding for your primary audience, and then > re-encode if necessary in the application logic. This should be easy > to implement and allows support for output encodings other than UTF-8 > or UTF-16. This would work, of course. But maybe on embedded devices CPU cycles are at a premium. Depends on Chris' circumstances. Karl From xcross at us.ibm.com Tue Mar 25 12:58:44 2003 From: xcross at us.ibm.com (Chris Cross) Date: Tue Mar 25 12:58:59 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts Message-ID: Well the answer is maybe not so cool. We're using 1.2 with Opera, but planning on jumping to your new version if/when we start using it elsewhere in the product. I'd thought of the brute force appoach of two dll's but that looks bad in an embedded stack. On the other hand, if you're switching between Japanese and English at runtime, 88KB for the XML parser is the least of your worries ;-) thanks, chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 |---------+----------------------------> | | "Karl Waclawek" | | | | | | | | | 03/25/2003 11:35 | | | AM | | | | |---------+----------------------------> >----------------------------------------------------------------------------------------------------------------| | | | To: , Chris Cross/West Palm Beach/IBM@IBMUS | | cc: | | Subject: Re: [Expat-discuss] Re: Summary of Pull API thoughts | | | >----------------------------------------------------------------------------------------------------------------| > I know this is off-topic, but while you have your design juices flowing I > thought I'd ask a question that we're struggling with in my group. > > My group is using expat to implement a VoiceXML processor and a multimodal > browser using the XHTML + Voice proposal in the W3C > (http://www.w3.org/TR/xhtml+voice/.) We're also considering its use by the > grammar compiler in our speech engines in order to support the Speech > Recognition Grammar Specification (http://www.w3.org/TR/speech-grammar/) Cool. Which version of Expat? > However, we have a sticky requirement for our language support to be able > to switch dynamically between 1 byte ascii and 2 byte Unicode. By doing > this, our European languages require half the space of the Asian languages, > which is very important for our embedded customers who sqeal every kilobyte > we consume. > > How hard would it be to change the character size from a build-time to a > run-time decision in expat? It looks hard, since even the API itself is statically tied to the definition of XML_Char. However, you should be able to compile two libraries (XML_Char defined as char or wchar_t), and dynamically load whichever you need at runtime, and even switch between them. Why would that not work for you? Karl From karl at waclawek.net Tue Mar 25 13:28:11 2003 From: karl at waclawek.net (Karl Waclawek) Date: Tue Mar 25 13:28:20 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: Message-ID: <005001c2f2fc$4a4288f0$9e539696@citkwaclaww2k> > Well the answer is maybe not so cool. We're using 1.2 with Opera, but > planning on jumping to your new version if/when we start using it elsewhere > in the product. Are you saying that UTF-16 output worked with version 1.2? > I'd thought of the brute force appoach of two dll's but that looks bad in > an embedded stack. I have to admit that I am not experienced with the constraints of "embedded" programming. Could you enlighten me why this would be bad? You wouldn't need to keep them both loaded at the same time, unless you need to use both capabilities simultaneously, in which case Fred's approach would make a lot more sense. > On the other hand, if you're switching between Japanese > and English at runtime, 88KB for the XML parser is the least of your > worries ;-) Well, Expat 1.95.6 is bigger than 1.2, hopefully not too big. Karl From vapor at arizonagt.org Tue Mar 25 16:33:03 2003 From: vapor at arizonagt.org (Vapor) Date: Tue Mar 25 17:33:13 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff> <000901c2f0ee$bfd168d0$0207a8c0@karl> <003601c2f254$e46d2cd0$a50c2e0a@tnnashjcluff> <000a01c2f271$0d728370$0207a8c0@karl> Message-ID: <006501c2f31e$80ab17a0$a50c2e0a@tnnashjcluff> Thanks a lot karl, the define helped. I found the spot in lcc that allows for defines in the project and added XML_STATIC. I was able to once again compile my program. The problem I get now is it seems unable to find a reference to XML_ParserCreate. Next I decided to add a few simple lines of code and I broke it again. Is there a reason why just adding XML_Parser parser = XML_ParserCreate(NULL); to my code would cause this. I have a suspicion that the libexpat.lib isnt properly added to the project or maybe I need some other files from expat officially added to the project but I dont really know. I am not familiar at all with working outside of a single soure file. Vapor ----- Original Message ----- From: "Karl Waclawek" To: "Vapor" ; Sent: Monday, March 24, 2003 7:51 PM Subject: Re: [Expat-discuss] using expat in lcc-win32 > > > Thx for the reply. I have updated it but I think my problem is related to > > my inexperience with lcc-win32. Does anybody have a demo project use with > > lcc-win32. I cant seem to get it to link with the static libs at all. > > For MS VC++ you define XML_STATIC. Is lcc different? > Check the top of expat.h. > > Karl From vapor at arizonagt.org Tue Mar 25 18:52:15 2003 From: vapor at arizonagt.org (Vapor) Date: Wed Mar 26 08:59:42 2003 Subject: [Expat-discuss] using expat in lcc-win32 References: <02d601c2f0d4$6479c4b0$6501a8c0@tnnashjcluff><000901c2f0ee$bfd168d0$0207a8c0@karl><003601c2f254$e46d2cd0$a50c2e0a@tnnashjcluff><000a01c2f271$0d728370$0207a8c0@karl> <006501c2f31e$80ab17a0$a50c2e0a@tnnashjcluff> Message-ID: <004201c2f34e$db05bc20$6501a8c0@tnnashjcluff> um. nm, I got it figured out. Again thanks for the help with that XML_STATIC, I am not all that familiar with preprocessor commands and that helped me out alot. Vapor ----- Original Message ----- From: "Vapor" To: "Karl Waclawek" ; Sent: Tuesday, March 25, 2003 4:33 PM Subject: Re: [Expat-discuss] using expat in lcc-win32 > Thanks a lot karl, the define helped. I found the spot in lcc that allows > for defines in the project and added XML_STATIC. I was able to once again > compile my program. The problem I get now is it seems unable to find a > reference to XML_ParserCreate. > > Next I decided to add a few simple lines of code and I broke it again. Is > there a reason why just adding XML_Parser parser = XML_ParserCreate(NULL); > to my code would cause this. I have a suspicion that the libexpat.lib isnt > properly added to the project or maybe I need some other files from expat > officially added to the project but I dont really know. I am not familiar > at all with working outside of a single soure file. > > Vapor > ----- Original Message ----- > From: "Karl Waclawek" > To: "Vapor" ; > Sent: Monday, March 24, 2003 7:51 PM > Subject: Re: [Expat-discuss] using expat in lcc-win32 > > > > > > > Thx for the reply. I have updated it but I think my problem is related > to > > > my inexperience with lcc-win32. Does anybody have a demo project use > with > > > lcc-win32. I cant seem to get it to link with the static libs at all. > > > > For MS VC++ you define XML_STATIC. Is lcc different? > > Check the top of expat.h. > > > > Karl > > > _______________________________________________ > Expat-discuss mailing list > Expat-discuss@libexpat.org > http://mail.libexpat.org/mailman/listinfo/expat-discuss From xcross at us.ibm.com Wed Mar 26 11:11:41 2003 From: xcross at us.ibm.com (Chris Cross) Date: Wed Mar 26 11:12:33 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts Message-ID: Actually, we're having to do a flavor of this anyway on Linux, where the wchar.h Unicode interface is 4 bytes while Opera is still 2. Oh by the way, so is almost everyone else in the world but we wrote our VoiceXML processor to wchar.h. We'll probably be moving to an internal Unicode implementation that will free us from wchar definition... thanks, chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 |---------+----------------------------> | | "Fred L. Drake, | | | Jr." | | | | | | | | | 03/25/2003 11:54 | | | AM | | | | |---------+----------------------------> >-----------------------------------------------------------------------------------------------------------------------------| | | | To: Chris Cross/West Palm Beach/IBM@IBMUS | | cc: expat-discuss@libexpat.org | | Subject: Re: [Expat-discuss] Re: Summary of Pull API thoughts | | | >-----------------------------------------------------------------------------------------------------------------------------| Chris Cross writes: > However, we have a sticky requirement for our language support to be able > to switch dynamically between 1 byte ascii and 2 byte Unicode. By doing > this, our European languages require half the space of the Asian languages, > which is very important for our embedded customers who sqeal every kilobyte > we consume. > > How hard would it be to change the character size from a build-time to a > run-time decision in expat? Karl Waclawek writes: > It looks hard, since even the API itself is statically tied to the > definition of XML_Char. > > However, you should be able to compile two libraries (XML_Char defined > as char or wchar_t), and dynamically load whichever you need at runtime, > and even switch between them. Why would that not work for you? That sounds fairly tedious to me. Recall that Expat tends to report data in fairly small chunks for typical applications. Even in plain text (PCDATA), Expat breaks data at line boundaries. If you want further control over the amount of data reported in the character data callback, limit the amount of data passed into Expat for any XML_Parse() or XML_ParseBuffer() call. This can be used to an application's advantage, especially if there's concern for the amount of memory being consumed. Compile Expat with the appropriate output encoding for your primary audience, and then re-encode if necessary in the application logic. This should be easy to implement and allows support for output encodings other than UTF-8 or UTF-16. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From karl at waclawek.net Wed Mar 26 11:18:59 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 26 11:19:39 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: Message-ID: <004d01c2f3b3$687c5730$9e539696@citkwaclaww2k> > Actually, we're having to do a flavor of this anyway on Linux, where the > wchar.h Unicode interface is 4 bytes while Opera is still 2. What compiler are you using? gcc on Linux allows a switch to make wchar_t 2 bytes instead of 4. That is how Expat is compiled for UTF-16 output. Karl From xcross at us.ibm.com Wed Mar 26 12:22:28 2003 From: xcross at us.ibm.com (Chris Cross) Date: Wed Mar 26 12:48:59 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts Message-ID: We're using gcc 2.95.3 (a windows cross-compiler build) for our Linux targets. I'm a little shaky on this subject but I think that the switch you're referring to is supported in gcc 3.x, right? chris Chris Cross IBM Boca Raton xcross@us.ibm.com voice 561.862.2102 t/l 975.2102 fax 561.862.3922 |---------+----------------------------> | | "Karl Waclawek" | | | | | | | | | 03/26/2003 11:18 | | | AM | | | | |---------+----------------------------> >-----------------------------------------------------------------------------------------------------------------------------| | | | To: "Fred L. Drake, Jr." , , Chris Cross/West Palm Beach/IBM@IBMUS | | cc: | | Subject: Re: [Expat-discuss] Re: Summary of Pull API thoughts | | | >-----------------------------------------------------------------------------------------------------------------------------| > Actually, we're having to do a flavor of this anyway on Linux, where the > wchar.h Unicode interface is 4 bytes while Opera is still 2. What compiler are you using? gcc on Linux allows a switch to make wchar_t 2 bytes instead of 4. That is how Expat is compiled for UTF-16 output. Karl From karl at waclawek.net Wed Mar 26 12:56:46 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 26 12:56:58 2003 Subject: [Expat-discuss] Re: Summary of Pull API thoughts References: Message-ID: <005f01c2f3c1$118c4bc0$9e539696@citkwaclaww2k> > We're using gcc 2.95.3 (a windows cross-compiler build) for our Linux > targets. I'm a little shaky on this subject but I think that the switch > you're referring to is supported in gcc 3.x, right? No, it should work on earlier versions too. We have been using it for a while. The switch is -fshort-wchar . Check the README in the current distribution. Karl From dr at netscape.com Wed Mar 26 15:34:13 2003 From: dr at netscape.com (Dan Rosen) Date: Wed Mar 26 18:34:05 2003 Subject: [Expat-discuss] XML_Parse const char *s param? Message-ID: <3E8238F5.8070609@netscape.com> I have a question about the XML_Parse function, which I unfortunately couldn't find an answer to in the documentation: XML_Parse() takes a |const char *s| parameter, which is a buffer containing some or all of a document. This is notably different from the callback functions which take |XML_Char*|. That sort of indicates to me that you can only pass in a buffer of single-byte characters (whether those are UTF-8, ISO-8859-1 or whatever, they're all |char*|). So my question is, what if I have my data in UCS-2? Do I have to convert it all down to UTF-8 before passing it into XML_Parse? I tried simply casting the wchar_t* buffer I have down to a char* buffer, but that obviously didn't work... Can somebody please explain to me exactly what XML_Parse() expects of a buffer that's passed to it? Thanks, Dan From karl at waclawek.net Wed Mar 26 19:13:22 2003 From: karl at waclawek.net (Karl Waclawek) Date: Wed Mar 26 19:09:42 2003 Subject: [Expat-discuss] XML_Parse const char *s param? References: <3E8238F5.8070609@netscape.com> Message-ID: <000701c2f3f5$addf60f0$0207a8c0@karl> > I have a question about the XML_Parse function, which I unfortunately > couldn't find an answer to in the documentation: > > XML_Parse() takes a |const char *s| parameter, which is a buffer > containing some or all of a document. This is notably different from the > callback functions which take |XML_Char*|. That sort of indicates to me > that you can only pass in a buffer of single-byte characters (whether > those are UTF-8, ISO-8859-1 or whatever, they're all |char*|). > > So my question is, what if I have my data in UCS-2? Do I have to convert > it all down to UTF-8 before passing it into XML_Parse? I tried simply > casting the wchar_t* buffer I have down to a char* buffer, but that > obviously didn't work... Expat expects a buffer of bytes, that is all. Is there a better way to declare such a buffer? > Can somebody please explain to me exactly what XML_Parse() expects of a > buffer that's passed to it? Just a buffer of bytes, regardless of encoding. Expat handles buffer boundaries properly, so even if a buffer ends in the middle of a multi-byte character it is not a problem. Karl