From Paul.Madden at six-group.com Thu Mar 4 13:13:26 2010 From: Paul.Madden at six-group.com (Madden, Paul) Date: Thu, 4 Mar 2010 13:13:26 +0100 Subject: [Expat-discuss] Heh guys and gals ... Message-ID: I am processing an XHTML document with expat. All goes fine til I hit an " " entity. The expat terminates the parse with error "undefined entity". Code starts something like : parser = XML_ParserCreate ((const XML_Char *) "UTF-8"); XML_SetElementHandler (parser, FFP_NewsStartElementHandler, FFP_NewsEndElementHandler); XML_SetCharacterDataHandler (parser, FFP_NewsCharacterDataHandler); XML_SetDefaultHandler (parser, FFP_DefaultHandler); I was hoping XML_SetDefaultHandler would (somehow) trap this unhandled token but alas it never enters there (for ?nbsp) before XML_Parse terminates. If I can handle nbsp I can handle any other non standard XML entities. Anyone help with this one. If so, beer or similar would gladly be bought if and whenever the fixer happens to come to Z?rich. Thanks everyone (ever the optimist). Paul The content of this e-mail is intended only for the confidential use of the person addressed. If you are not the intended recipient, please notify the sender and delete this e-mail immediately. Thank you. From fdrake at acm.org Fri Mar 5 05:12:37 2010 From: fdrake at acm.org (Fred Drake) Date: Thu, 4 Mar 2010 23:12:37 -0500 Subject: [Expat-discuss] Heh guys and gals ... In-Reply-To: References: Message-ID: <9cee7ab81003042012i7fe171f6t7af0edf2d64375ed@mail.gmail.com> On Thu, Mar 4, 2010 at 7:13 AM, Madden, Paul wrote: > I am processing an XHTML document with expat. All goes fine til I hit an " " entity. The expat terminates the parse with error "undefined entity". This is expected. The nbsp entity is defined in the XHTML document type, and is not defined by the XML specification. If you're not parsing the XHTML document type, this can't be parsed. If you control the input data, you could use a reference to Unicode character itself instead of the HTML-centric entity: " " would be appropriate markup. Alternately, you could register handler to parse external entities (including the XHTML DTD) if references are provided from the document. The XML_UseForeignDTD API can be used to load a DTD if the document doesn't include an explicit reference to a DTD. -Fred -- Fred L. Drake, Jr. "Chaos is the score upon which reality is written." --Henry Miller From fdrake at acm.org Wed Mar 24 17:09:29 2010 From: fdrake at acm.org (Fred Drake) Date: Wed, 24 Mar 2010 12:09:29 -0400 Subject: [Expat-discuss] Expat on 64 bit Linux In-Reply-To: <201002050728.04544.jeremy.kloth@gmail.com> References: <2D9F6906-A811-4567-B0D6-1C0CA53B2C35@gmail.com> <4B6C1A27.9070906@waclawek.net> <201002050728.04544.jeremy.kloth@gmail.com> Message-ID: <9cee7ab81003240909i6b5e272drce88cb9f8d75bbf8@mail.gmail.com> On Fri, Feb 5, 2010 at 10:28 AM, Jeremy Kloth wrote: > It was done to allow Expat output to be mapped directly to Python's unicode > objects (which can be either UCS-2 or UCS-4). > > If desired, I can produce the patches required to add that support to the > Expat mainline. Hey Jeremy! Would the output type be controlled at compile time or at run time? This definitely is interesting to me. Do you also have a patched pyexpat that consumes the new output, or are you using a new Python extension to use this? -Fred -- Fred L. Drake, Jr. "Chaos is the score upon which reality is written." --Henry Miller From jeremy at omsys.com Sun Mar 28 23:37:37 2010 From: jeremy at omsys.com (Jeremy H. Griffith) Date: Sun, 28 Mar 2010 14:37:37 -0700 Subject: [Expat-discuss] Catalog resolution? Message-ID: <31ivq5h9o5dgh86sdh9be84gvke2vv9uth@4ax.com> I'm currently using expat to parse DITA documents by using XML_SetExternalEntityRefHandler to look at the system identifier, create a reference to a local dir, and parse it using XML_ExternalEntityParserCreate. (Shorthand version; I can post the code if necessary.) Now we need to support specializations, some of which use XML Catalogs. I don't see a catalog resolver in expat; am I missing something? Any ideas on how to add catalog resolution without using Java or any GPL software? Thanks! -- Jeremy H. Griffith, at Omni Systems Inc. http://www.omsys.com/ From karl at waclawek.net Mon Mar 29 16:32:07 2010 From: karl at waclawek.net (Karl Waclawek) Date: Mon, 29 Mar 2010 10:32:07 -0400 Subject: [Expat-discuss] Catalog resolution? In-Reply-To: <31ivq5h9o5dgh86sdh9be84gvke2vv9uth@4ax.com> References: <31ivq5h9o5dgh86sdh9be84gvke2vv9uth@4ax.com> Message-ID: <4BB0B9E7.1030601@waclawek.net> On 28/03/2010 5:37 PM, Jeremy H. Griffith wrote: > > I'm currently using expat to parse DITA documents by > using XML_SetExternalEntityRefHandler to look at the > system identifier, create a reference to a local dir, > and parse it using XML_ExternalEntityParserCreate. > (Shorthand version; I can post the code if necessary.) > > Now we need to support specializations, some of which > use XML Catalogs. I don't see a catalog resolver in > expat; am I missing something? Any ideas on how to > add catalog resolution without using Java or any GPL > software? My understanding is that a catalog simply maps public or system ids to locally reachable URIs. So, you can still use the external entity resolver (as you described above), and within that resolver you perform the mapping. Sometime before you start parsing, you need to load the catalog into a hash table, for instance, by parsing it like any XML document. The code should be very simple and boilerplate - and that is probably why nobody has created a public library for it. You should be able to create a re-usable module for it yourself. Karl From mjkhokhar at gmail.com Wed Mar 31 08:29:48 2010 From: mjkhokhar at gmail.com (Junaid Khokhar) Date: Wed, 31 Mar 2010 11:29:48 +0500 Subject: [Expat-discuss] Special Characters part of the word Message-ID: Hi, I am trying to parse xml file with expat. I have different equations and keywords containing special characters ( < , > , & ) e.g. AE. Expat returns a word whenever it finds a word delimiter. It also considers aforementioned special characters as word delimiters. I need to get the whole equation to match. Is there any way i can specify word delimiters in Expat ? Any help in this regard is highly appreciated. Thanks in advance. Junaid From nickmacd at gmail.com Wed Mar 31 23:43:43 2010 From: nickmacd at gmail.com (Nick MacDonald) Date: Wed, 31 Mar 2010 17:43:43 -0400 Subject: [Expat-discuss] Special Characters part of the word In-Reply-To: References: Message-ID: Junaid: eXpat is performing precisely as it should in this respect and I would advise you that you are attempting to parse invalid (mal-formed) XML files. There are a number of characters that a reserved in XML, most particularly the '<' and the '&' that you are no doubt trying to use incorrectly. These characters need to be escaped in XML, such a < and & . Please check the XML spec at w3c.org for more details. Additionally, eXpat does NOT guarantee how much data will be passed in each call via one of its callbacks, and if the string is being broken up by eXpat, you will need to write code to put it into your own buffer as the pieces arrive. There are completely logical reasons why this would be necessary, such as when a body of text is broken up with tags in the middle: some body text and yet more body text As you can see, if you expected to have in a buffer the contents "some body text and yet more body text" then you would need to do the concatenation all on your own... or else potentially use a DOM parser rather than a SAX parser like eXpat. Also, please note that in the past it has been a common error for people to assume that eXpat returns buffers that are zero terminated... it does NOT do this... it passes the length, and you are not allowed to use anything outside of the specified length. Good luck with your project, Nick On Wed, Mar 31, 2010 at 2:29 AM, Junaid Khokhar wrote: > I am trying to parse xml file with expat. I have different equations and > keywords containing special characters ( < , > , & ) e.g. AE. ?Expat > returns a word whenever it finds a word delimiter. It also considers > aforementioned special characters as word delimiters. I need to get the > whole equation to match. Is there any way i can specify word delimiters in > Expat ?