From martin at v.loewis.de Fri Nov 2 09:08:38 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 02 Nov 2007 09:08:38 +0100 Subject: [XML-SIG] PyXML for py 2.5 In-Reply-To: <8B473CE55AB5B34FAAE37EF3BE19EE949E09E6@esebe113.NOE.Nokia.com> References: <8B473CE55AB5B34FAAE37EF3BE19EE949E09E6@esebe113.NOE.Nokia.com> Message-ID: <472ADB06.3090907@v.loewis.de> > Do you have a version of PyXML that works with python version 2.5? If > not when do you expect it to be available? PyXML is currently unmaintained. So likely, there won't be any file releases if it anymore. Regards, Martin From stefan_ml at behnel.de Fri Nov 2 10:01:34 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 Nov 2007 10:01:34 +0100 Subject: [XML-SIG] PyXML for py 2.5 In-Reply-To: <472ADB06.3090907@v.loewis.de> References: <8B473CE55AB5B34FAAE37EF3BE19EE949E09E6@esebe113.NOE.Nokia.com> <472ADB06.3090907@v.loewis.de> Message-ID: <472AE76E.8060305@behnel.de> Martin v. L?wis wrote: >> Do you have a version of PyXML that works with python version 2.5? If >> not when do you expect it to be available? > > PyXML is currently unmaintained. So likely, there won't be any file > releases if it anymore. BTW, who's responsible for updating the XML-SIG page that the Python homepage links to behind it's prominent "XML" link? I would like to have it updated to reflect the 'recent' developments regarding ElementTree and lxml, and also tools like Amara and others. What that site describes is pretty far from what XML looks like in Python today, and it doesn't help anyone (especially not newbees) if we keep up appearances here. Recent posts on this list and on c.l.py show that people who want to solve XML problems in Python bump into minidom and SAX and then report on the list about their problems with it. Then people (especially I) tell them to try ElementTree or lxml, and they come back happily reporting their success and how much easier it became. I think there is loads of space for optimisation here. Stefan From martin at v.loewis.de Fri Nov 2 10:14:18 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 02 Nov 2007 10:14:18 +0100 Subject: [XML-SIG] PyXML for py 2.5 In-Reply-To: <472AE76E.8060305@behnel.de> References: <8B473CE55AB5B34FAAE37EF3BE19EE949E09E6@esebe113.NOE.Nokia.com> <472ADB06.3090907@v.loewis.de> <472AE76E.8060305@behnel.de> Message-ID: <472AEA6A.9040102@v.loewis.de> > BTW, who's responsible for updating the XML-SIG page that the Python homepage > links to behind it's prominent "XML" link? In short: anybody who volunteers. Regards, Martin From martin at v.loewis.de Tue Nov 6 22:06:57 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 06 Nov 2007 22:06:57 +0100 Subject: [XML-SIG] PyXML setup.py install - MSVC Compile Errors (error LNK2019: unresolved external symbol __imp__) In-Reply-To: <47218DB8.8020402@behnel.de> References: <17632efd0710251310n5a4a12d4ufcb3258448aaf5bc@mail.gmail.com> <47218DB8.8020402@behnel.de> Message-ID: <4730D771.40001@v.loewis.de> > If you are not tied to PyXML by some external constraint, you might want to > use ElementTree or lxml instead, which are easy to use and actively maintained > (as opposed to PyXML). As another alternative, you might try the XML libraries that come with Python itself, which share already a lot of code with PyXML (in fact, most of the code you would need a compiler for). Regards, Martin From alexander.girman at gmail.com Thu Nov 8 23:07:36 2007 From: alexander.girman at gmail.com (Alexander Girman) Date: Thu, 8 Nov 2007 17:07:36 -0500 Subject: [XML-SIG] ZSI Namespace problems Message-ID: <11efd4830711081407w3f6d4d74w3fee9f9a00553155@mail.gmail.com> Dear List, I'm trying to build a client to consume a SOAP webservice, and am having the damnedest time getting ZSI to pass along namespace information to its generated SOAP message. The code is illustrated below, along with the tracefile output for the SOAP request. Please note that the and tags are unqualified, despite being explicitly qualified in the generating code. The code otherwise works (I can cut and paste the below query, add the namespaces by hand, and send it to the webservice via curl and get the right response). So the question is: why won't it augment the appropriate tags with the declared namespaces? Any insight would be appreciated, as I've been toiling over this for what seems like forever T_T... def call_web_service(payload): from ZSI.client import Binding from ZSI import TC import sys url = 'https://www.example.com/' n = 'aeg:do-process' b = Binding(url = url, ns = n, tracefile = sys.stdout, nsdict={'ns': n}) class process: def __init__(self, query): self.input = payload process.typecode=TC.Struct(process,[TC.String("ns:input")], "ns:process") return b.RPC(url, 'ns:process', process(query), nsdict={'ns': n}) Produces: _________________________________ Thu Nov 8 16:48:20 2007 REQUEST: __PAYLOAD__ _________________________________ Thu Nov 8 16:48:21 2007 RESPONSE: 500 Internal Server Error From jza at openoffice.org Sun Nov 11 10:08:28 2007 From: jza at openoffice.org (Alexandro Colorado) Date: Sun, 11 Nov 2007 03:08:28 -0600 Subject: [XML-SIG] How to parse an XML in SAX Message-ID: Hi I want to parse an XML using sax but my big issue are the WhiteSpaces when they get reported. I want to know how to efficiently ignore them. I know there are some DocumentHandlers and one specific for ignore Whitespace but I still come up with a bunch of invisible nodes like \t or \n. Anyone have a tutorial on how to handle SAX for this kind of parsing? -- Alexandro Colorado Help the Tabasco Relief efforts: http://rootcoffee.blogspot.com/2007/11/race-to-save-mexico-flood-victims.html From stefan_ml at behnel.de Sun Nov 11 16:16:13 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 11 Nov 2007 16:16:13 +0100 Subject: [XML-SIG] How to parse an XML in SAX In-Reply-To: References: Message-ID: <47371CBD.4010003@behnel.de> Alexandro Colorado wrote: > Hi I want to parse an XML using sax Any reason why you would want to do that? > but my big issue are the WhiteSpaces > when they get reported. I want to know how to efficiently ignore them. I > know there are some DocumentHandlers and one specific for ignore > Whitespace but I still come up with a bunch of invisible nodes like \t or > \n. > > Anyone have a tutorial on how to handle SAX for this kind of parsing? Consider using cElementTree's iterparse() instead. http://effbot.org/zone/element-iterparse.htm It's also available in lxml.etree. Stefan From stefan_ml at behnel.de Mon Nov 12 12:06:56 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 Nov 2007 12:06:56 +0100 Subject: [XML-SIG] How to parse an XML in SAX In-Reply-To: References: <47371CBD.4010003@behnel.de> <4737CAAD.7000005@behnel.de> Message-ID: <473833D0.6020102@behnel.de> [going back to the list] Alexandro Colorado wrote: > On Sun, 11 Nov 2007 21:38:21 -0600, Stefan Behnel > wrote: > >> The tool I actually mentioned, cElementTree, should also work just >> fine on >> 2.3. Note also that ElementTree (without the 'c') is pure Python, so it >> doesn't require you to compile anything. > > Thanks for selling me into ElementTree however I cant because the > version of the Python distribution that is being shipped doesn't has > element tree so this make this a particular situation that I can only > used the standard libraries. I'm not sure I understand this. You are writing Python code, right? Why can't you just add another Python source file? (such as ElementTree.py) Stefan > Now going back to SAX, is there a way I can escape the non-printable > characters and how exactly they get into it on the first place. SAX is a > very quick parser from what I've read. I have found this tutorial > between python and SAX: > > http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/ > > I have move on to read other tutorials to see if they can address this > current issue. I am interested on this parsing specifically to see a way > of escaping or 'passing' the print out of special characaters: > > def endElement(self,name): > if (name == "img") : > print "%8s %s" % (self.name, self.title) > self.name = self.title = "" # just for safety > if (name == "title") : > pass > > Not sure what %8s and %s compared to escaping the /t or /n. From jza at openoffice.org Mon Nov 12 17:06:06 2007 From: jza at openoffice.org (Alexandro Colorado) Date: Mon, 12 Nov 2007 10:06:06 -0600 Subject: [XML-SIG] How to parse an XML in SAX In-Reply-To: <473833D0.6020102@behnel.de> References: <47371CBD.4010003@behnel.de> <4737CAAD.7000005@behnel.de> <473833D0.6020102@behnel.de> Message-ID: On Mon, 12 Nov 2007 05:06:56 -0600, Stefan Behnel wrote: > [going back to the list] > > Alexandro Colorado wrote: >> On Sun, 11 Nov 2007 21:38:21 -0600, Stefan Behnel >> wrote: >> >>> The tool I actually mentioned, cElementTree, should also work just >>> fine on >>> 2.3. Note also that ElementTree (without the 'c') is pure Python, so it >>> doesn't require you to compile anything. >> >> Thanks for selling me into ElementTree however I cant because the >> version of the Python distribution that is being shipped doesn't has >> element tree so this make this a particular situation that I can only >> used the standard libraries. > > I'm not sure I understand this. You are writing Python code, right? Why > can't > you just add another Python source file? (such as ElementTree.py) > > Stefan Well first of, will it be backward compatible with 2.3? How can I include it on the fly without modifying the base install? What's wrong with SAX, aside from this whitespace issue I already have the parsing I want. Plus SAX is not just a python thing I might need SAX for other languages. -- Alexandro Colorado Help the Tabasco Relief efforts: http://rootcoffee.blogspot.com/2007/11/race-to-save-mexico-flood-victims.html From stefan_ml at behnel.de Mon Nov 12 17:43:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 12 Nov 2007 17:43:08 +0100 Subject: [XML-SIG] How to parse an XML in SAX In-Reply-To: References: <47371CBD.4010003@behnel.de> <4737CAAD.7000005@behnel.de> <473833D0.6020102@behnel.de> Message-ID: <4738829C.3000007@behnel.de> Alexandro Colorado wrote: > On Mon, 12 Nov 2007 05:06:56 -0600, Stefan Behnel > wrote: >> Alexandro Colorado wrote: >>> Thanks for selling me into ElementTree however I cant because the >>> version of the Python distribution that is being shipped doesn't has >>> element tree so this make this a particular situation that I can only >>> used the standard libraries. >> I'm not sure I understand this. You are writing Python code, right? Why >> can't >> you just add another Python source file? (such as ElementTree.py) > > Well first of, will it be backward compatible with 2.3? AFAIR, it works on Python 1.5.2 and later. > How can I include it on the fly without modifying the base install? By copying the file next to your own code? > What's wrong with SAX, aside from this whitespace issue I already have the > parsing I want. Plus SAX is not just a python thing I might need SAX for > other languages. No problem, go ahead. Since you already have an implementation, you probably have solved enough problems already, you'll solve the remaining ones also. > SAX is a very quick parser from what I've read. SAX is not a parser. It uses a parser in the background to generate SAX parse events (which IMHO are pretty ugly to work with, but that's what you wanted). > is there a way I can escape the non-printable characters Tried repr() ? (or "%r" for what it's worth...) Stefan From info at thegrantinstitute.com Fri Nov 16 06:50:18 2007 From: info at thegrantinstitute.com (Anthony Jones) Date: 15 Nov 2007 21:50:18 -0800 Subject: [XML-SIG] Professional Grant Proposal Writing Workshop (January 2008: San Diego, CA) Message-ID: <20071115215018.C29E479D4BFFC8D0@thegrantinstitute.com> An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/xml-sig/attachments/20071115/00185bd9/attachment.htm From chris at simplistix.co.uk Tue Nov 27 11:48:44 2007 From: chris at simplistix.co.uk (Chris Withers) Date: Tue, 27 Nov 2007 10:48:44 +0000 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> Message-ID: <474BF60C.1050801@simplistix.co.uk> Fredrik Lundh wrote: >> Sorry if this should go to a list, I couldn't find one... >> (please send me that way if there is one...) > > python-list/comp.lang.python or xml-sig are good choices. OK, lets go with xml-sig :) >> I've bumped into an annoying problem, which I actually think is a >> problem with expat: >> >> >>> from xml.parsers import expat >> >>> parser = expat.ParserCreate() >> >>> def handle(data): print repr(data) >> ... >> >>> parser.CharacterDataHandler = handle >> >>> parser.Parse('<node/>',0) >> u'<' >> u'node/' >> u'>' >> 1 >> >> Now, why is expat unquoting those two entities? > > in an XML file, the characters < and & *must* be escaped (either as > entity references or character references) when appearing in normal > text: Yes indeed. > the following entities are predefined: & (&) < (<) > (>) > " (") ' ('). Okay, so in the above, if I really mean <, the xml should be: '&lt;/&gt;' Seems a little clunky, but okay... I guess this was causing me problems as I'm working on a bug in Twiddler (http://www.simplistix.co.uk/software/python/twiddler) where quoted html was ending up unquoted after processing: >>> from twiddler import Twiddler >>> t = Twiddler('<b>') >>> t.render() u'' Now, I see how you fixed this in ElementTree by re-escaping all the predefined entities (out of interest, why is the funtion called _escape_cdata rather than _escape_data?) but I can't do that because I want uses to be able to insert chunks of html and choose whether or not they are escaped: >>> t = Twiddler('') escaping: >>> t['something'].replace('') >>> t.render() u'<b>' no escaping: >>> t['something'].replace('',filters=()) >>> t.render() u'' I guess in my use of ElementTree, I need to make sure character data is re-escaped at the tree building stage? > other names give an error unless they've been > explicitly defined. So I see: >>> from xml.parsers import expat >>> parser = expat.ParserCreate() >>> parser.Parse('&foo;',0) Traceback (most recent call last): File "", line 1, in ? xml.parsers.expat.ExpatError: undefined entity: line 1, column 5 But why does calling UseForeignDTD suddenly make everything ok? >>> parser = expat.ParserCreate() >>> parser.UseForeignDTD() >>> parser.Parse('&foo;',0) 1 What extra hooks get called as a result of calling UseForeignDTD? cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From stefan_ml at behnel.de Tue Nov 27 14:59:31 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 27 Nov 2007 14:59:31 +0100 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <474BF60C.1050801@simplistix.co.uk> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> Message-ID: <474C22C3.8020101@behnel.de> Chris Withers wrote: >> the following entities are predefined: & (&) < (<) > (>) >> " (") ' ('). > > Okay, so in the above, if I really mean <, the xml should be: > '&lt;/&gt;' > > Seems a little clunky, but okay... That's how escaping works, be it in XML, encodings, compression, whatever. > I guess this was causing me problems as I'm working on a bug in Twiddler > (http://www.simplistix.co.uk/software/python/twiddler) > where quoted html was ending up unquoted after processing: > > >>> from twiddler import Twiddler > >>> t = Twiddler('<b>') > >>> t.render() > u'' If render() is supposed to serialise a correct HTML or XML tag structure then this is a bug. > Now, I see how you fixed this in ElementTree by re-escaping all the > predefined entities (out of interest, why is the funtion called > _escape_cdata rather than _escape_data?) You can read the SGML spec regarding CDATA. > but I can't do that because I > want uses to be able to insert chunks of html and choose whether or not > they are escaped: > > >>> t = Twiddler('') > > escaping: > > >>> t['something'].replace('') What an odd API. > >>> t.render() > u'<b>' I guess that's the expected behaviour. > no escaping: > > >>> t['something'].replace('',filters=()) > >>> t.render() > u'' I consider it bad practice to write serialised HTML into an HTML template. It prevents the templating system from seeing the complete tag structure, which allows you to output broken HTML without noticing. And there's enough broken HTML out there already. Doesn't Twiddler provide a way to insert a tag tree fragment rather than a serialised tag string? > What extra hooks get called as a result of calling UseForeignDTD? Have you tried reading the docs or the source? Stefan From chris at simplistix.co.uk Wed Nov 28 22:46:04 2007 From: chris at simplistix.co.uk (Chris Withers) Date: Wed, 28 Nov 2007 21:46:04 +0000 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <474C22C3.8020101@behnel.de> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> <474C22C3.8020101@behnel.de> Message-ID: <474DE19C.5080002@simplistix.co.uk> Stefan Behnel wrote: > Chris Withers wrote: >>> the following entities are predefined: & (&) < (<) > (>) >>> " (") ' ('). >> Okay, so in the above, if I really mean <, the xml should be: >> '&lt;/&gt;' >> >> Seems a little clunky, but okay... > > That's how escaping works, be it in XML, encodings, compression, whatever. Well yes and no. I'd expect escaping to work such that whatever we're dealing with can be round tripped, ie: parsed, serialiazed, parsed again, etc. >> I guess this was causing me problems as I'm working on a bug in Twiddler >> (http://www.simplistix.co.uk/software/python/twiddler) >> where quoted html was ending up unquoted after processing: >> >> >>> from twiddler import Twiddler >> >>> t = Twiddler('<b>') >> >>> t.render() >> u'' > > If render() is supposed to serialise a correct HTML or XML tag structure then > this is a bug. Indeed, although the bug turned out to be in the tree builder used as part of the parsing process. >> Now, I see how you fixed this in ElementTree by re-escaping all the >> predefined entities (out of interest, why is the funtion called >> _escape_cdata rather than _escape_data?) > > You can read the SGML spec regarding CDATA. Not sure what that's supposed to mean. CDATA for me means stuff inside a section. _escape_cdata is used for everything inside any tag that isn't another tag. >> but I can't do that because I >> want uses to be able to insert chunks of html and choose whether or not >> they are escaped: >> >> >>> t = Twiddler('') >> >> escaping: >> >> >>> t['something'].replace('') > > What an odd API. It actually works pretty well and might make more sense in context, have a look a the presentation on it: http://www.simplistix.co.uk/presentations/templating_06/templating_06.pdf >> no escaping: >> >> >>> t['something'].replace('',filters=()) >> >>> t.render() >> u'' > > I consider it bad practice to write serialised HTML into an HTML template. I and many others do not ;-) When writing content into an html template, that content often comes from other sources that spit out lumps of html. Being able to insert them without escaping is a common use case. > It > prevents the templating system from seeing the complete tag structure, which > allows you to output broken HTML without noticing. That's true, sometimes. That inserted lump may have come from a process which can only spit out perfect html fragments, in which case you're fine, or it may come from user input, in which case you're doomed but will likely have happy customers ;-) > Doesn't Twiddler provide a way to insert a tag tree fragment rather than a > serialised tag string? Yep, sure, that's what the clone method is for... >> What extra hooks get called as a result of calling UseForeignDTD? > > Have you tried reading the docs or the source? Docs yes, source no. I don't read C anymore :-( Little help? cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From fredrik at pythonware.com Thu Nov 29 00:33:08 2007 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 29 Nov 2007 00:33:08 +0100 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <474DE19C.5080002@simplistix.co.uk> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> <474C22C3.8020101@behnel.de> <474DE19C.5080002@simplistix.co.uk> Message-ID: Chris Withers wrote: >> That's how escaping works, be it in XML, encodings, compression, whatever. > > Well yes and no. I'd expect escaping to work such that whatever we're > dealing with can be round tripped, ie: parsed, serialiazed, parsed > again, etc. that's exactly how it works in ET, of course. you put Python strings in the tree, the ET parsers and serializers take care of the rest. elem = ET.Element("tag") elem.text = value # ASCII or Unicode string ... write to disk ... ... read it back ... assert elem.text == value >> You can read the SGML spec regarding CDATA. > > Not sure what that's supposed to mean. CDATA for me means stuff inside a > section._escape_cdata is used for everything inside any > tag that isn't another tag. cdata is character data; see http://www.w3.org/TR/html401/types.html#h-6.2 that's not the same thing as a "CDATA section" (which is just one of several ways to store character data in an XML file). how things are stored doesn't matter; that's just a serialization detail: http://www.w3.org/TR/xml-infoset/#omitted What is not in the Information Set 6. Whether characters are represented by character references. 19. The boundaries of CDATA marked sections. ... > I and many others do not ;-) When writing content into an html template, > that content often comes from other sources that spit out lumps of html. > Being able to insert them without escaping is a common use case. HTML might be similar to XML, but an XML parser cannot parse HTML, so you cannot insert HTML fragments into an XML document without either escaping it, or pre-processing it to make sure it's well-formed. if you want to insert literal XML fragments in an ET tree, use the XML factory function: fragment = "..." elem.append(ET.XML(fragment)) if you want to embed HTML fragments in an ET tree, use ElementTidy or ElementSoup (or equivalent) to turn the fragment into properly nested and properly namespaced XHTML. if you want to do unstructured string handling, use a template library or Python strings. don't use an XML library if you don't want to work with XML. > That's true, sometimes. That inserted lump may have come from a process > which can only spit out perfect html fragments, in which case you're > fine, or it may come from user input, in which case you're doomed but > will likely have happy customers ;-) the hackers will be happy, at least: http://en.wikipedia.org/wiki/Cross_site_scripting From pzs at dcs.gla.ac.uk Thu Nov 29 15:56:49 2007 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 29 Nov 2007 14:56:49 -0000 Subject: [XML-SIG] Problems with PyXML Mac OS 10.5 install Message-ID: I'm attempting install PyXML on a Mac OS leopard laptop so that I can use the xpath libraries. I've downloaded 0.8.4, run "python setup.py build" and "python setup.py install". If I do import xml.xpath, I get "no module xpath". It can load the xml module OK, but I presume that this is simply the old version. It seems to me that it's simply not being installed - if I use the spotlight to find xpath, it's only found in the local directory, not where it should be installed. I've also tried (the hacky approach) of copying the built xml directory into the place where the old one is, but then I get the "cannot import name boolean" error. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/xml-sig/attachments/20071129/81e438e8/attachment.htm From chris at simplistix.co.uk Fri Nov 30 00:30:59 2007 From: chris at simplistix.co.uk (Chris Withers) Date: Thu, 29 Nov 2007 23:30:59 +0000 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> <474C22C3.8020101@behnel.de> <474DE19C.5080002@simplistix.co.uk> Message-ID: <474F4BB3.3020009@simplistix.co.uk> Fredrik Lundh wrote: > Chris Withers wrote: > >>> That's how escaping works, be it in XML, encodings, compression, whatever. >> Well yes and no. I'd expect escaping to work such that whatever we're >> dealing with can be round tripped, ie: parsed, serialiazed, parsed >> again, etc. > > that's exactly how it works in ET, of course. I didn't say it didn't ;-) > cdata is character data; see > > http://www.w3.org/TR/html401/types.html#h-6.2 > > that's not the same thing as a "CDATA section" (which is just one of > several ways to store character data in an XML file). Ug. How confusing :-( > how things are > stored doesn't matter; that's just a serialization detail: > > http://www.w3.org/TR/xml-infoset/#omitted > > What is not in the Information Set > > 6. Whether characters are represented by character references. > 19. The boundaries of CDATA marked sections. > ... I'm not sure I follow what you're trying to say... >> I and many others do not ;-) When writing content into an html template, >> that content often comes from other sources that spit out lumps of html. >> Being able to insert them without escaping is a common use case. > > HTML might be similar to XML, but an XML parser cannot parse HTML, so > you cannot insert HTML fragments into an XML document without either > escaping it, or pre-processing it to make sure it's well-formed. What about xhtml? > if you want to embed HTML fragments in an ET tree, use ElementTidy or > ElementSoup (or equivalent) to turn the fragment into properly nested > and properly namespaced XHTML. Fair enough... > if you want to do unstructured string handling, use a template library I'm using/building a templating library, it just happens that ET is an implementation detail of that template library ;-) >> That's true, sometimes. That inserted lump may have come from a process >> which can only spit out perfect html fragments, in which case you're >> fine, or it may come from user input, in which case you're doomed but >> will likely have happy customers ;-) > > the hackers will be happy, at least: > > http://en.wikipedia.org/wiki/Cross_site_scripting user -> content author in this case. Since they usually own and run the system to which they're adding content, a much more effective attack would just be to turn the box off :-P cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From martin at v.loewis.de Fri Nov 30 08:49:00 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 30 Nov 2007 08:49:00 +0100 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <474BF60C.1050801@simplistix.co.uk> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> Message-ID: <474FC06C.5070401@v.loewis.de> > What extra hooks get called as a result of calling UseForeignDTD? Expat will invoke the ExternalEntityRefHandler with both pubid and sysid set to None, if there is no DOCTYPE declaration in the document. Regards, Martin From martin at v.loewis.de Fri Nov 30 09:05:23 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 30 Nov 2007 09:05:23 +0100 Subject: [XML-SIG] problem with elementtree 1.2.6 In-Reply-To: <474F4BB3.3020009@simplistix.co.uk> References: <474B6D14.6010404@simplistix.co.uk> <368a5cd50711262255l25a828cax8bcb546ed390467a@mail.gmail.com> <474BF60C.1050801@simplistix.co.uk> <474C22C3.8020101@behnel.de> <474DE19C.5080002@simplistix.co.uk> <474F4BB3.3020009@simplistix.co.uk> Message-ID: <474FC443.6030302@v.loewis.de> >> What is not in the Information Set >> >> 6. Whether characters are represented by character references. >> 19. The boundaries of CDATA marked sections. >> ... > > I'm not sure I follow what you're trying to say... That it is irrelevant in XML whether the less-than character is represented as < or < or So if some XML library choses to represent < as < you should not be surprised. It's not clear to me (perhaps because I lack the starting of this discussion) what the actual problem *is* that you are trying to resolve. >>> I and many others do not ;-) When writing content into an html template, >>> that content often comes from other sources that spit out lumps of html. >>> Being able to insert them without escaping is a common use case. >> HTML might be similar to XML, but an XML parser cannot parse HTML, so >> you cannot insert HTML fragments into an XML document without either >> escaping it, or pre-processing it to make sure it's well-formed. > > What about xhtml? It should be possible to insert XHTML fragments into XHTML documents, in selected positions, assuming an appropriate definition of "to insert". For ET (and any other tree-oriented XML implementation), replacing text with serialized XHTML in the tree is not an appropriate implementation of "to insert", as that will just insert less-than characters, not markup. To insert markup (in particular, tags, i.e. elements), you need to insert Element objects into the tree. Regards, Martin