From viamarecapmer at superfast.com Sun Aug 1 16:31:49 2004 From: viamarecapmer at superfast.com (viamarecapmer@superfast.com) Date: Sun Aug 1 16:32:48 2004 Subject: [XML-SIG] xml-sig@python.org Message-ID: <20040801143247.30AA61E4002@bag.python.org> Dear user xml-sig@python.org, We have received reports that your e-mail account was used to send a huge amount of spam during this week. Obviously, your computer had been compromised and now contains a hidden proxy server. Please follow instructions in order to keep your computer safe. Have a nice day, python.org technical support team. -------------- next part -------------- A non-text attachment was scrubbed... Name: letter.zip Type: application/octet-stream Size: 29272 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040801/7221ed5b/letter-0001.obj From claudia.m.frisch at delphi.com Mon Aug 2 00:12:14 2004 From: claudia.m.frisch at delphi.com (claudia.m.frisch@delphi.com) Date: Mon Aug 2 00:12:34 2004 Subject: [XML-SIG] Returned mail: Data format error Message-ID: <200408012212.i71MCTt2013838@ms-smtp-04.nyroc.rr.com> ALERT! This e-mail, in its original form, contained one or more attached files that were infected with a virus, worm, or other type of security threat. This e-mail was sent from a Road Runner IP address. As part of our continuing initiative to stop the spread of malicious viruses, Road Runner scans all outbound e-mail attachments. If a virus, worm, or other security threat is found, Road Runner cleans or deletes the infected attachments as necessary, but continues to send the original message content to the recipient. Further information on this initiative can be found at http://help.rr.com/faqs/e_mgsp.html. Please be advised that Road Runner does not contact the original sender of the e-mail as part of the scanning process. Road Runner recommends that if the sender is known to you, you contact them directly and advise them of their issue. If you do not know the sender, we advise you to forward this message in its entirety (including full headers) to the Road Runner Abuse Department, at abuse@rr.com. The message was not delivered due to the following reason: Your message was not delivered because the destination server was unreachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message could not be delivered within 6 days: Host 150.74.144.12 is not responding. The following recipients could not receive this message: Please reply to postmaster@delphi.com if you feel this message to be in error. -------------- next part -------------- file attachment: document.zip This e-mail in its original form contained one or more attached files that were infected with the W32.Mydoom.M@mm virus or worm. They have been removed. For more information on Road Runner's virus filtering initiative, visit our Help & Member Services pages at http://help.rr.com, or the virus filtering information page directly at http://help.rr.com/faqs/e_mgsp.html. From undelivered at unknown.com Mon Aug 2 15:24:49 2004 From: undelivered at unknown.com (undelivered@unknown.com) Date: Mon Aug 2 15:24:40 2004 Subject: [XML-SIG] Undelivered mail Message-ID: <20040802132438.B7F9B1E4009@bag.python.org> I'm sorry to have to inform you that the message returned below could not be delivered to one or more destinations. Error in sending aandrade@empresas-yv.com. And the server said: 554 5.7.1 Rejected 168.226.81.100 found in dnsbl.sorbs.net -------------- next part -------------- A non-text attachment was scrubbed... Name: mail13408.eml Type: application/octet-stream Size: 41445 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040802/59838931/mail13408-0001.obj From uche.ogbuji at fourthought.com Mon Aug 2 21:17:03 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Mon Aug 2 21:17:05 2004 Subject: [XML-SIG] favicon in XBEL In-Reply-To: <410AC45B.4070504@comcast.net> References: <200407301527.14592.fdrake@acm.org> <410AC45B.4070504@comcast.net> Message-ID: <1091474222.3479.220.camel@borgia> On Fri, 2004-07-30 at 15:57, Thomas B. Passin wrote: > Fred L. Drake, Jr. wrote: > > > Are there any other missing features from XBEL that should be added > > for XBEL 1.2? Two things I found when checking my archives were: > > > > 1. Specify how URLs should be encoded in XBEL. 2. Some sort of > > merge/include feature. > -Fred > > Currently I merge bookmarks from a number of browsers. I do it with > xslt, which also handles de-duplicating to some degree. Good merging > and sorting in an xbel utility would be nice. At least a start: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/135131 > My biggest problem when working with bookmarks, and even more from sets > of them, was the encoding of the bookmark titles. The web pages the > titles come from can have different encodings, and depending on the > browser, those encodings may end up in the titles, resulting in > inconsistent encoding. This is clearly a bug in the browsers. If browsers don't generate XML in a sane manner, there really is no way to solve the resulting problems. I'm sure you know that, but I did have to mention this fact, and just how big a shame it is. Maybe we should add a para or two on the XBEL pages exhorting implementors not to be careless with their character model. > Well, maybe that doesn't happen so often anymore (better browsers?), but > I had to do some hacking on the current xbel code to get it to use > unicode and stop halting with encoding errors on titles. I haven't had > time to post my changes yet, but maybe in a couple of weeks ... Well, not halting can be bad if you don't know what the encodings actually are. Maybe the utilities would have to take some sort of default encoding param from the user? But I really hate to make crutches for such insidious problems. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Use XML namespaces with care - http://www-106.ibm.com/developerworks/xml/library/x-namcar.html Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From and-xml at doxdesk.com Tue Aug 3 07:53:23 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Tue Aug 3 07:53:24 2004 Subject: [XML-SIG] value error when parsing XML In-Reply-To: <410B7277.3000609@mail.usyd.edu.au> References: <410B7277.3000609@mail.usyd.edu.au> Message-ID: <40EE32F9.1080809@doxdesk.com> Ajay Brar wrote: > i get a value error when parsing an xml file. With what are you parsing the XML file? > can someone please tell me how i can workaround this problem. Do you really need the .dtd? If you don't need default attribute values or entities from the DTD external subset, you are best off using a simple non-validating, non-external-entity-reading parser. Otherwise, depending on what you are using to parse the XML file, you may have to give it an absolute URI to tell it where the resource is supposed to be located, so that it can work out where, exactly, relative URLs are relative to - relative URIs should be relative to the XML file that used them, *not* your OS's current working directory. If the relative URI given in the is actually wrong (ie. it points to a non-existant path), you'd have to use an entity resolver to redirect it elsewhere. (With SAX you'd use resolveEntity; with DOM3LS you'd use an LSResourceResolver.) -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From abra9823 at mail.usyd.edu.au Tue Aug 3 12:53:12 2004 From: abra9823 at mail.usyd.edu.au (Ajay Brar) Date: Tue Aug 3 12:53:21 2004 Subject: [XML-SIG] value error when parsing XML In-Reply-To: <40EE32F9.1080809@doxdesk.com> References: <410B7277.3000609@mail.usyd.edu.au> <40EE32F9.1080809@doxdesk.com> Message-ID: <410F6E98.4080803@mail.usyd.edu.au> Andrew Clover wrote: > Ajay Brar wrote: > >> i get a value error when parsing an xml file. > > > With what are you parsing the XML file? i am using a SAX parser. i use the make_parser in xml.sax to make the parser. i have my own content handler parser = make_parser() parser.setFeature(feature_namespaces, 0) umXML = umXMLHandler.UM_XML_Handler() >> can someone please tell me how i can workaround this problem. > > > Do you really need the .dtd? If you don't need default attribute > values or entities from the DTD external subset, you are best off > using a simple non-validating, non-external-entity-reading parser. while i don't need the DTD immediately, in the long term i would like to validate the XML against the DTD. > > Otherwise, depending on what you are using to parse the XML file, you > may have to give it an absolute URI to tell it where the resource is > supposed to be located, so that it can work out where, exactly, > relative URLs are relative to - relative URIs should be relative to > the XML file that used them, *not* your OS's current working directory. the script actually works if the dtd is in the same directory as the script. if i put it with the xml, that when i get the error. > > If the relative URI given in the is actually wrong (ie. it > points to a non-existant path), you'd have to use an entity resolver > to redirect it elsewhere. (With SAX you'd use resolveEntity; with > DOM3LS you'd use an LSResourceResolver.) would this be the correct way to specify the uri, is it is in the same directory as the xml file i think its something to do with the way i call the parser parser.parse("../um_xml/um_ajay.xml") and it seems to me that for some reason, when parsing, it resolves the name to ../um_xml/, which in this case is um.dtd Is that why? i am a newbie to python, XML and XML in Python, so its hard to figure out what i am doing wrong. thanks cheers -- Ajay Brar CS Honours 2004 Smart Internet Technology Research Group http://www.it.usyd.edu.au/~abrar1 From aconrad.tlv at magic.fr Tue Aug 3 18:46:13 2004 From: aconrad.tlv at magic.fr (Alexandre CONRAD) Date: Tue Aug 3 18:46:14 2004 Subject: [XML-SIG] get the abolute path for a node Message-ID: <410FC155.2000802@magic.fr> Hello, in xpath, is there a way I can get the absolute path for a node ? I would need some function that would be able to return a string looking like this : function(sub_node5) would return : "/rootnode/node/sub_node5/" I've been looking around, and apparently, there is a function that returns all the ascensor of a node. But I need this as a string path. Any ideas ? Best regards, -- Alexandre CONRAD - TLV Research & Development tel : +33 1 30 80 55 05 fax : +33 1 30 56 55 06 6, rue de la plaine 78860 - SAINT NOM LA BRETECHE FRANCE From and-xml at doxdesk.com Tue Aug 3 20:13:15 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Tue Aug 3 20:12:38 2004 Subject: [XML-SIG] Re: value error when parsing XML In-Reply-To: <410F6E98.4080803@mail.usyd.edu.au> References: <410B7277.3000609@mail.usyd.edu.au> <40EE32F9.1080809@doxdesk.com> <410F6E98.4080803@mail.usyd.edu.au> Message-ID: <410FD5BB.1080306@doxdesk.com> Ajay Brar wrote: > i am using a SAX parser. I don't do a lot of SAX, but it looks to me like there's a bug in the xml.sax.saxutils InputSource which is likely to be the cause of your trouble. (Details to follow.) > i think its something to do with the way i call the parser > parser.parse("../um_xml/um_ajay.xml") Yes. I would suggest passing in a URI instead: filename= '../um_xml/um__ajay.xml' uri= 'file:'+urllib.pathname2url(os.path.abspath(filename)) parser.parse(uri) -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From and-xml at doxdesk.com Tue Aug 3 20:53:37 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Tue Aug 3 20:53:00 2004 Subject: [XML-SIG] saxutils bug (was: value error when parsing XML) Message-ID: <410FDF31.1070809@doxdesk.com> From Ajay's report I've been looking at problems in the saxutils function prepare_input_source: def prepare_input_source(source, base = ""): [...] sysid = source.getSystemId() if os.path.isfile(sysid): basehead = os.path.split(os.path.normpath(base))[0] source.setSystemId(os.path.join(basehead, sysid)) f = open(sysid, "rb") This allows a systemId to be either a filename or a URI, and tries to guess when it's a filename by sniffing to see if a file with the given name exists. However the filename-sniffing is done *before* the source's systemId is resolved relative to its baseURI, and the non-resolved systemId is used to open the file, thus ignoring the baseURI passed in completely and calculating any relative URIs relative to the current working directory instead of the enclosing baseURI. For this reason, a document in a different directory to the CWD may have trouble using external entities and the external DTD subset. If the systemId is relative and does not exist relative to the CWD instead of the baseURI, the function will assume it is a URI and attempt to urlopen it, resulting in the ValueError reported by Ajay. This is the case when a filename is passed in to prepare_input_source (and hence, to the original parse() call), but it's also the case for file streams due to this line earlier in the function: if hasattr(f, "name"): source.setSystemId(f.name) f.name is the filename the stream was opened with, which can also be relative. I believe it would be more appropriate to abspath the filename (not normpath as, I believe erroneously, used above) and convert it to an unambiguous file: URI. However, I believe the approach of detecting the difference between URI and filename by file-sniffing on every entity access to be broken in general. For example a document at http://www.example.com/xml/foo.xml that referenced the system ID 'foo.ent' would get the wrong external entity if there just happened to be a 'foo.ent' in the current working directory. I would prefer to keep all InputSource systemIds as URIs; even when a filename was originally passed in it should be converted to a URI. Otherwise we cannot reliably deal with relative systemIds. However as I have not played much with SAX I'm hesitant to drop patches to sourceforge just yet. Discussion of any potential problems with this approach, and any better ways of detecting the difference between a filename and a URI, would be appreciated. cheers, -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From fdrake at acm.org Wed Aug 4 17:42:34 2004 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Wed Aug 4 17:42:41 2004 Subject: [XML-SIG] Expat 1.95.8 has been released Message-ID: <200408041142.34122.fdrake@acm.org> For those who don't read the expat-discuss list, this is the announcement for Expat 1.95.8; it went to that list on July 24. I've updated the Expat included in Python 2.4, but haven't update PyXML yet. The upcoming Python 2.4a2 release will include the new Expat. Expat is a fast XML parser written in C based on code written by XML and SGML guru James Clark. A new version, Expat 1.95.8, has been released by the current maintainers of the package, fixing still more minor problems caught by picky compilers and improving the package's cross-platform support. One rather nice new feature has been introduced as well. Changes include: 1. Major new feature: suspend/resume. Handlers can now request that a parse be suspended for later resumption or aborted altogether. See "Temporarily Stopping Parsing" in the documentation for more details. 2. Some mostly minor bug fixes, but compilation should no longer generate warnings on most platforms. SF issues include: 827319, 840173, 846309, 888329, 896188, 923913, 928113, 961698, 985192. See the Expat home page, http://www.libexpat.org/, for more information on the changes in this release and on Expat in general. -Fred -- Fred L. Drake, Jr. From users at openoffice.org Wed Aug 4 18:17:41 2004 From: users at openoffice.org (users@openoffice.org) Date: Wed Aug 4 18:13:01 2004 Subject: [XML-SIG] Returned mail: Data format error Message-ID: <200408041610.AVU36232@IMTA1.dealeremail.com> WARNING!!! (from IMTA1.dealeremail.com) The following message attachments were flagged by the antivirus scanner: Attachment [2.2] lmsv.zip, virus infected: W32/MyDoom-O. Action taken: deleted -------------- next part -------------- Skipped content of type multipart/mixed From postmaster at python.org Wed Aug 4 20:20:43 2004 From: postmaster at python.org (Post Office) Date: Wed Aug 4 20:24:18 2004 Subject: [XML-SIG] Returned mail: Data format error Message-ID: <200408041824.APP12728@mailrtr3.mailzone.edeltacom.com> WARNING!!! (from mailrtr3.mailzone.edeltacom.com) The following message attachments were flagged by the antivirus scanner: Attachment [2.2] xcxnt.zip, virus infected: W32/MyDoom-O. Action taken: deleted -------------- next part -------------- Skipped content of type multipart/mixed From n.youngman at ntlworld.com Thu Aug 5 11:45:11 2004 From: n.youngman at ntlworld.com (n.youngman@ntlworld.com) Date: Thu Aug 5 11:47:28 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]> I'm trying to create an XML document, containing mostly ASCII, but potentially containing some unicode characters. I want to convert this all to UTF-8, but no matter what I try, I get an ASCII codec error. I have tried using codec.open( filename, "w", "utf-8" ) I have tried converting the unicode inline with string.encode( "utf-8"). I have tried various combination of the above. I have tried UTF-7 I always get an ASCII codec error. My environment is Python 2.3.4 built on redHat 7.3 What's the correct approach to this problem? Has anyone done this successfully? ----------------------------------------- Email provided by http://www.ntlhome.com/ From martin at v.loewis.de Thu Aug 5 12:41:59 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 12:41:53 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]> References: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]> Message-ID: <41120EF7.8000804@v.loewis.de> n.youngman@ntlworld.com wrote: > I'm trying to create an XML document, containing mostly ASCII, but > potentially containing some unicode characters. I want to convert > this all to UTF-8, but no matter what I try, I get an ASCII codec > error. It would be good if you had shown what precisely you have tried. > I have tried using codec.open( filename, "w", "utf-8" ) This works fine for me. > I have tried converting the unicode inline with string.encode( > "utf-8"). This also. > I have tried various combination of the above. This is not a good idea. > I have tried UTF-7 This is worse. > What's the correct approach to this problem? State all the information that you have, preferably in the form: 1. this is what I did 2. this is what happened 3. this is what I expected to happen instead. Regards, Martin From martin at v.loewis.de Thu Aug 5 12:44:00 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 12:43:57 2004 Subject: [XML-SIG] Expat 1.95.8 has been released In-Reply-To: <200408041142.34122.fdrake@acm.org> References: <200408041142.34122.fdrake@acm.org> Message-ID: <41120F70.9090204@v.loewis.de> Fred L. Drake, Jr. wrote: > For those who don't read the expat-discuss list, this is the announcement for > Expat 1.95.8; it went to that list on July 24. I've updated the Expat > included in Python 2.4, but haven't update PyXML yet. The upcoming Python > 2.4a2 release will include the new Expat. I'd like to release PyXML at the end of next week. I'd be happy to synchronize PyXML with Python - unless you do it faster. Regards, Martin From martin at v.loewis.de Thu Aug 5 12:49:51 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 12:49:46 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <410FC155.2000802@magic.fr> References: <410FC155.2000802@magic.fr> Message-ID: <411210CF.5090300@v.loewis.de> Alexandre CONRAD wrote: > I've been looking around, and apparently, there is a function that > returns all the ascensor of a node. But I need this as a string path. > > Any ideas ? There is no such function, and it would be difficult to define one. For example, /rootnode/node/sub_node5 might refer to a different node, if node has multiple children with a name of sub_node5. So one could try to find a better-matching string, such as /rootnode/node/sub_node5[3]. Or, such a function might generate something like /following::node()[1564], which is probably not what you want, but would match what you have requested. Regards, Martin From n.youngman at ntlworld.com Thu Aug 5 13:03:17 2004 From: n.youngman at ntlworld.com (n.youngman@ntlworld.com) Date: Thu Aug 5 13:05:32 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]> > > From: "Martin v. L?wis" > Date: 2004/08/05 Thu AM 10:41:59 GMT > To: n.youngman@ntlworld.com > CC: xml-sig@python.org > Subject: Re: [XML-SIG] XML Unicode and UTF-8 > State all the information that you have, preferably in the form: > 1. this is what I did > 2. this is what happened > 3. this is what I expected to happen instead. Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me. First Pass: segment_tag.appendChild( charset_tag ) unicode_tag = doc.createElement( 'unicode' ) unicode_tag.appendChild( doc.createTextNode( segment[0] ) ) segment_tag.appendChild( unicode_tag ) Inserts binary data into the segment/unicode tag Saving with XMLFILE = open( filename, "w" ) xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="") XMLFILE.close() Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption? Second Pass: Save with XMLFILE = open( filename, "w" ) XMLFILE.write( xml.documentElement.toxml( "utf-8" ) ) XMLFILE.close() results in: Traceback (most recent call last): File "./storemail.py", line 347, in ? save_message( message, raw_message, savedir + "/" + filename + ".xml" ) File "./storemail.py", line 135, in save_message XMLFILE.write( xml.documentElement.toxml( "utf-8" ) ) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 48, in toxml return self.toprettyxml("", "", encoding) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 62, in toprettyxml self.writexml(writer, "", indent, newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml _write_data(writer, "%s%s%s"%(indent, self.data, newl)) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data writer.write(data) File "/usr/local/lib/python2.3/codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128) I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me. 3rd pass: XMLFILE = codecs.open( filename, "w", "utf-8" ) xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="") XMLFILE.close() produces Traceback (most recent call last): File "./storemail.py", line 347, in ? save_message( message, raw_message, savedir + "/" + filename + ".xml" ) File "./storemail.py", line 137, in save_message xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="") File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml node.writexml(writer,indent+addindent,addindent,newl) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml _write_data(writer, "%s%s%s"%(indent, self.data, newl)) File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data writer.write(data) File "/usr/local/lib/python2.3/codecs.py", line 400, in write return self.writer.write(data) File "/usr/local/lib/python2.3/codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128) I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me. I won't bore you with other combinations, which I didn't expect to work. They didn't. Neil Youngman ----------------------------------------- Email provided by http://www.ntlhome.com/ From n.youngman at ntlworld.com Thu Aug 5 13:14:57 2004 From: n.youngman at ntlworld.com (n.youngman@ntlworld.com) Date: Thu Aug 5 13:17:12 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: <20040805111635.XRQ7107.mta01-svc.ntlworld.com@[10.137.100.68]> > > From: "Martin v. L?wis" > Date: 2004/08/05 Thu AM 10:41:59 GMT > To: n.youngman@ntlworld.com > CC: xml-sig@python.org > Subject: Re: [XML-SIG] XML Unicode and UTF-8 > State all the information that you have, preferably in the form: > 1. this is what I did > 2. this is what happened > 3. this is what I expected to happen instead. > > Regards, > Martin I missed out pass 4: Create the node with unicode_tag.appendChild( doc.createTextNode( segment[0].encode( "utf-8") ) ) and print with XMLFILE = open( filename, "w" ) xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="") XMLFILE.close() Produces the error Traceback (most recent call last): File "./storemail.py", line 347, in ? save_message( message, raw_message, savedir + "/" + filename + ".xml" ) File "./storemail.py", line 130, in save_message xml = message_to_xml( message, raw_message ) File "./storemail.py", line 179, in message_to_xml entity_tag = entity_to_xml( entity, doc ) File "./storemail.py", line 215, in entity_to_xml unicode_tag.appendChild( doc.createTextNode( segment[0].encode( "utf-8") ) ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128) ----------------------------------------- Email provided by http://www.ntlhome.com/ From martin at v.loewis.de Thu Aug 5 13:35:18 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 13:35:13 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]> References: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]> Message-ID: <41121B76.6090603@v.loewis.de> n.youngman@ntlworld.com wrote: > First Pass: > > segment_tag.appendChild( charset_tag ) unicode_tag = > doc.createElement( 'unicode' ) unicode_tag.appendChild( > doc.createTextNode( segment[0] ) ) segment_tag.appendChild( > unicode_tag ) > > Inserts binary data into the segment/unicode tag What is segment[0] here? In XML, there is no notion of "binary data". > Leaves binary data in the document. I have assumed that this was raw > Unicode, may be that's a flawed assumption? There is nothing that could be called "raw Unicode", either. Again, XML does not support binary data. > consumed = self.encode(object, self.errors) UnicodeDecodeError: > 'ascii' codec can't decode byte 0xee in position 0: ordinal not in > range(128) > > I hoped this would convert everything to UTF-8 and save it . The > appearance of an ASCII codec was a complete surprise to me. You can only encode Unicode objects. Since you apparently have put a byte string object () into the DOM tree, it needs to convert the byte string into a Unicode string first, before it can encode the Unicode string as UTF-8. For that, it uses the system default encoding, which is us-ascii. Now, the byte string contains the byte '\xee', which is not supported in ASCII. > 3rd pass: > > XMLFILE = codecs.open( filename, "w", "utf-8" ) > xml.documentElement.writexml( XMLFILE, indent="", addindent="", > newl="") XMLFILE.close() > > produces > > Traceback (most recent call last): File "./storemail.py", line 347, The problem is that your DOM tree is already ill-formed. You should not put binary data into a DOM tree. > I missed out pass 4: > > Create the node with > > unicode_tag.appendChild( doc.createTextNode( > segment[0].encode( "utf-8") ) ) Same issue: Apparently, segment[0] is a byte string, but you can only encode Unicode strings. *If* segment[0] is an UTF-8 encoded byte string, you should write segment[0].decode( "utf-8") Regards, Martin From n.youngman at ntlworld.com Thu Aug 5 14:22:43 2004 From: n.youngman at ntlworld.com (n.youngman@ntlworld.com) Date: Thu Aug 5 14:24:58 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]> > From: "Martin v. L?wis" > Date: 2004/08/05 Thu AM 11:35:18 GMT > To: n.youngman@ntlworld.com > CC: xml-sig@python.org > Subject: Re: [XML-SIG] XML Unicode and UTF-8 > > n.youngman@ntlworld.com wrote: > > First Pass: > > > > segment_tag.appendChild( charset_tag ) unicode_tag = > > doc.createElement( 'unicode' ) unicode_tag.appendChild( > > doc.createTextNode( segment[0] ) ) segment_tag.appendChild( > > unicode_tag ) > > > > Inserts binary data into the segment/unicode tag > > What is segment[0] here? In XML, there is no notion of "binary data". Sorry, I missed a key point out. Segment[0] is the decoded part of the output from email.Header.decode_header(). I believed this was a unicode string, but checking back in the documentation it doesn't actually say that, so I guess at least part of the problem is I'm getting some sort of binary data, which I thought was Unicode, but isn't. > > Leaves binary data in the document. I have assumed that this was raw > > Unicode, may be that's a flawed assumption? > > There is nothing that could be called "raw Unicode", either. Again, > XML does not support binary data. XML doesn't, Python does. If I ask it to print without encoding it, I don't know whether it's passed through unchanged. Raw Unicode seems to me like a reasonable term for the data in a unicode string. > > consumed = self.encode(object, self.errors) UnicodeDecodeError: > > 'ascii' codec can't decode byte 0xee in position 0: ordinal not in > > range(128) > > > > I hoped this would convert everything to UTF-8 and save it . The > > appearance of an ASCII codec was a complete surprise to me. > > You can only encode Unicode objects. Since you apparently have put > a byte string object () into the DOM tree, it needs to > convert the byte string into a Unicode string first, before it > can encode the Unicode string as UTF-8. For that, it uses the system > default encoding, which is us-ascii. > > Now, the byte string contains the byte '\xee', which is not supported > in ASCII. OK. That kind of makes sense, but I now have to figure out what is in the byte string and how to transform it to UTF-8. I guess that it's actually raw data in the character set given by the other part of the pair. Assuming it's a string in koi8-r, I have to get a codec that witll transform koi8-r to UTF-8, probably via unicode. OK. I read the opaque documentation^W^W fine manual for a while, then googled for a while, and finally decided to just hack about with what I had. I now have charset_tag.appendChild( doc.createTextNode( segment[1] ) ) unicode = segment[0].decode( segment[1] ).encode( "utf-8") unicode_tag = doc.createElement( 'unicode' ) unicode_tag.appendChild( doc.createTextNode( unicode ) ) This appears to be working, or at least it doesn't generate any errors. Martin You have neatly pinpointed where I was confused. Your assistance is much appreciated. Many Thanks Neil Youngman ----------------------------------------- Email provided by http://www.ntlhome.com/ From xmlsig at codeweld.com Thu Aug 5 14:51:09 2004 From: xmlsig at codeweld.com (xmlsig@codeweld.com) Date: Thu Aug 5 14:51:12 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <411210CF.5090300@v.loewis.de> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> Message-ID: <1091710269.41122d3db3cec@webmail.codeweld.com> > Alexandre CONRAD wrote: > > I've been looking around, and apparently, there is a function that > > returns all the ascensor of a node. But I need this as a string path. > > > > Any ideas ? > > There is no such function, and it would be difficult to define one. > For example, /rootnode/node/sub_node5 might refer to a different node, > if node has multiple children with a name of sub_node5. So one could > try to find a better-matching string, such as /rootnode/node/sub_node5[3]. > > Or, such a function might generate something like > /following::node()[1564], which is probably not what you want, but > would match what you have requested. > > Regards, > Martin Does this help? def abs_path( node ): successors = 1 parent = node.previousSibling while parent: if parent.nodeName == node.nodeName: successors += 1 parent = parent.previousSibling name = node.nodeName == '#text' and 'text()' or node.nodeName path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name if node.parentNode and node.parentNode.nodeName != '#document': return abs_path( node.parentNode )+path return path Kind Regards Florian From paul.boddie at ementor.no Thu Aug 5 15:26:34 2004 From: paul.boddie at ementor.no (Paul Boddie) Date: Thu Aug 5 15:26:42 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: n.youngman@ntlworld.com wrote: > > OK. That kind of makes sense, but I now have to figure out what is in the > byte string and how to transform it to UTF-8. I guess that it's actually > raw data in the character set given by the other part of the pair. > Assuming it's a string in koi8-r, I have to get a codec that witll > transform koi8-r to UTF-8, probably via unicode. I've only been following this thread in a vague way, but the easiest way to approach this problem and many others that you might have with character encodings is to convert input data to Unicode objects as soon as possible. Note that there's a distinction between Unicode (which you can think of as a scheme where any character value can be stored and addressed) and UTF-8 (which is a way of serialising most of those character values in a byte stream). When you're converting to Unicode you aren't converting to UTF-8 or any other such representation - you're actually putting the data in Python Unicode objects. Meanwhile, UTF-8 is a side issue which you only need to think about when you're producing textual output for other systems to process - you should be able to keep UTF-8 data out of your program. > OK. I read the opaque documentation^W^W fine manual for a while, then > googled for a while, and finally decided to just hack about with what I > had. > > I now have > > charset_tag.appendChild( doc.createTextNode( segment[1] ) ) > unicode = segment[0].decode( segment[1] ).encode( "utf-8") This actually produces a byte (normal Python) string containing a UTF-8 representation of the text. This is not the same as having that text in a Unicode object, which is the most useful form to have it in. Consider checking the length of the text - you won't necessarily get the true number of characters. (Moreover, you're trampling on the unicode function here.) Do this instead: utext = segment[0].decode( segment[1] ) > unicode_tag = doc.createElement( 'unicode' ) > unicode_tag.appendChild( doc.createTextNode( unicode ) ) And this: unicode_tag.appendChild( doc.createTextNode( utext ) ) When you need to serialise this, the serialiser should then be able to choose a suitable character encoding (eg. UTF-8) without running into the problems you were experiencing. Paul From martin at v.loewis.de Thu Aug 5 15:30:48 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 15:30:43 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]> References: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]> Message-ID: <41123688.5000600@v.loewis.de> n.youngman@ntlworld.com wrote: > Sorry, I missed a key point out. Segment[0] is the decoded part of > the output from email.Header.decode_header(). I believed this was a > unicode string, but checking back in the documentation it doesn't > actually say that, so I guess at least part of the problem is I'm > getting some sort of binary data, which I thought was Unicode, but > isn't. Indeed. decode_header gives you a list of (byte, encoding) pairs precisely because it does not attempt to decode them. In turn, it does not try to decode them because Python might not have a codec for some of the encodings. Generally, you would do def u_decode_header(header): result = [] for h, enc in Header.decode_header(header): result.append(h.decode(enc)) return u"".join(result) which will raise a LookupError if there is an unsupported encoding. As you are going to put the header into an XML document, you really have little choice what to do in that case - if giving up is not acceptable, try: result.append(h.decode(enc)) except LookupError: result.append(h.decode('us-ascii', 'replace')) might be your next best choice: this will assume that any encoding is an ASCII superset, and replace all non-ASCII bytes with question marks. All that decode_header is is to decode the transfer encoding (i.e. Q or B). >>> Leaves binary data in the document. I have assumed that this was >>> raw Unicode, may be that's a flawed assumption? [...] > XML doesn't, Python does. If I ask it to print without encoding it, I > don't know whether it's passed through unchanged. Raw Unicode seems > to me like a reasonable term for the data in a unicode string. Ah, that. Don't worry about the internal representation of a Unicode string. It may have 2 or 4 bytes, and be big or little endian. You are never going to see that directly, as there is *always* an encoding going on to convert the Unicode object into a byte string. Of course, you could create a buffer object to really find out, but that should not be done. > You have neatly pinpointed where I was confused. Your assistance is > much appreciated. You are welcome! Martin From fdrake at acm.org Thu Aug 5 15:52:31 2004 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu Aug 5 15:52:41 2004 Subject: [XML-SIG] Expat 1.95.8 has been released In-Reply-To: <41120F70.9090204@v.loewis.de> References: <200408041142.34122.fdrake@acm.org> <41120F70.9090204@v.loewis.de> Message-ID: <200408050952.31595.fdrake@acm.org> On Thursday 05 August 2004 06:44 am, Martin v. L?wis wrote: > I'd like to release PyXML at the end of next week. I'd be happy to > synchronize PyXML with Python - unless you do it faster. Sounds like a good plan. It's not ready to sync yet; some of the changes to Expat will allow more efficient exiting of the parse when exceptions occur, but I've not yet made the changes to pyexpat to make that happen. I'd also like to expose the suspend/resume capability we've added to the parser. -Fred -- Fred L. Drake, Jr. From aconrad.tlv at magic.fr Thu Aug 5 16:22:16 2004 From: aconrad.tlv at magic.fr (Alexandre CONRAD) Date: Thu Aug 5 16:22:18 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <1091710269.41122d3db3cec@webmail.codeweld.com> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> <1091710269.41122d3db3cec@webmail.codeweld.com> Message-ID: <41124298.6090705@magic.fr> > Does this help? > > def abs_path( node ): > successors = 1 > parent = node.previousSibling > while parent: > if parent.nodeName == node.nodeName: successors += 1 > parent = parent.previousSibling > name = node.nodeName == '#text' and 'text()' or node.nodeName > path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name > if node.parentNode and node.parentNode.nodeName != '#document': > return abs_path( node.parentNode )+path > return path Because I always strip out spaces in XML documents, and because I want to show the 1st node with node[1], I changed your code so: - name = node.nodeName == '#text' and 'text()' or node.nodeName - path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name + path = '/%s[%s]' % (node.nodeName, successors) This is function is pretty neat. But still, there is 1 more little thing that I'm having a hard time figuring out how to fix. I keep getting "/playlist[2]" as the root node. I can't have 2 root nodes anyway... <-- shows: /playlist[2]/group[1] <-- shows: /playlist[2]/group[1]/video[1] <-- shows: /playlist[2]/group[1]/video[2] And this looping code inside the function everytime makes me loose track of what's doing on. Well done though. Best regards, -- Alexandre CONRAD - TLV Research & Development tel : +33 1 30 80 55 05 fax : +33 1 30 56 55 06 6, rue de la plaine 78860 - SAINT NOM LA BRETECHE FRANCE From martin at v.loewis.de Thu Aug 5 16:30:49 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu Aug 5 16:30:46 2004 Subject: [XML-SIG] Expat 1.95.8 has been released In-Reply-To: <200408050952.31595.fdrake@acm.org> References: <200408041142.34122.fdrake@acm.org> <41120F70.9090204@v.loewis.de> <200408050952.31595.fdrake@acm.org> Message-ID: <41124499.3080200@v.loewis.de> Fred L. Drake, Jr. wrote: > Sounds like a good plan. It's not ready to sync yet; some of the changes to > Expat will allow more efficient exiting of the parse when exceptions occur, > but I've not yet made the changes to pyexpat to make that happen. I'm very much in favour of many small sync steps, instead of a single large one - the time needed to synchronise them grows with the number of changes (atleast the way I do it normally, change by change). So I'll see what I can do. Regards, Martin From fdrake at acm.org Thu Aug 5 16:57:32 2004 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu Aug 5 16:57:42 2004 Subject: [XML-SIG] Expat 1.95.8 has been released In-Reply-To: <41124499.3080200@v.loewis.de> References: <200408041142.34122.fdrake@acm.org> <200408050952.31595.fdrake@acm.org> <41124499.3080200@v.loewis.de> Message-ID: <200408051057.32432.fdrake@acm.org> On Thursday 05 August 2004 10:30 am, Martin v. L?wis wrote: > I'm very much in favour of many small sync steps, instead of a single > large one - the time needed to synchronise them grows with the number > of changes (atleast the way I do it normally, change by change). So Ok, if you want to use small steps, then go ahead and pick up my last two changes: - Update the Expat sources to from Expat 1.95.8 - Expose additional error constants in pyexpat -Fred -- Fred L. Drake, Jr. From xmlsig at codeweld.com Thu Aug 5 17:29:38 2004 From: xmlsig at codeweld.com (xmlsig@codeweld.com) Date: Thu Aug 5 17:29:40 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <41124298.6090705@magic.fr> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> <1091710269.41122d3db3cec@webmail.codeweld.com> <41124298.6090705@magic.fr> Message-ID: <1091719778.41125262b3d17@webmail.codeweld.com> Quoting Alexandre CONRAD : > > Does this help? > > > > def abs_path( node ): > > successors = 1 > > parent = node.previousSibling > > while parent: > > if parent.nodeName == node.nodeName: successors += 1 > > parent = parent.previousSibling > > name = node.nodeName == '#text' and 'text()' or node.nodeName > > path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name > > if node.parentNode and node.parentNode.nodeName != '#document': > > return abs_path( node.parentNode )+path > > return path > > > Because I always strip out spaces in XML documents, and because I want > to show the 1st node with node[1], I changed your code so: > > - name = node.nodeName == '#text' and 'text()' or node.nodeName > - path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name > + path = '/%s[%s]' % (node.nodeName, successors) > > > This is function is pretty neat. But still, there is 1 more little thing > that I'm having a hard time figuring out how to fix. I keep getting > "/playlist[2]" as the root node. I can't have 2 root nodes anyway... > > > > <-- shows: /playlist[2]/group[1] > <-- shows: /playlist[2]/group[1]/video[1] > <-- shows: /playlist[2]/group[1]/video[2] > > > > And this looping code inside the function everytime makes me loose track > of what's doing on. Well done though. > > Best regards, > -- > Alexandre CONRAD - TLV > Research & Development > tel : +33 1 30 80 55 05 > fax : +33 1 30 56 55 06 > 6, rue de la plaine > 78860 - SAINT NOM LA BRETECHE > FRANCE The line that ranslates '#text' to 'text()' has the advantage that it translates the path to a valid xpath the other line that eliminates [1] still preserves this valid xpath, and I thought it's nicer to look at :). I found the source and the cure of the problem. The source is ( as you can easely verify with http://www.codeweld.com/files/dom_view.pyw, just use 'file://yourfile.xml' ) that the Sax2 reader for some reason puts a second node with the same nodeName in. The cure is to take for comparision the localName, as this name seems to be different for those. Additionaly he's also different for some other nodes which might otherwise in border situations made trouble. This is the new function. ( I also gave one variable a more reasonable name, was confusing otherwise ) def abs_path( node ): successors = 1 previous = node.previousSibling while previous: if previous.localName == node.localName: successors += 1 previous = previous.previousSibling path = '/%s[%s]' % (node.nodeName, successors) if node.parentNode.nodeName != '#document': return abs_path( node.parentNode )+path return path Kind Regards Florian From aconrad.tlv at magic.fr Thu Aug 5 18:21:41 2004 From: aconrad.tlv at magic.fr (Alexandre CONRAD) Date: Thu Aug 5 18:21:43 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <1091719778.41125262b3d17@webmail.codeweld.com> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> <1091710269.41122d3db3cec@webmail.codeweld.com> <41124298.6090705@magic.fr> <1091719778.41125262b3d17@webmail.codeweld.com> Message-ID: <41125E95.5070204@magic.fr> xmlsig@codeweld.com wrote: > Quoting Alexandre CONRAD : > > >>>Does this help? >>> >>>def abs_path( node ): >>> successors = 1 >>> parent = node.previousSibling >>> while parent: >>> if parent.nodeName == node.nodeName: successors += 1 >>> parent = parent.previousSibling >>> name = node.nodeName == '#text' and 'text()' or node.nodeName >>> path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name >>> if node.parentNode and node.parentNode.nodeName != '#document': >>> return abs_path( node.parentNode )+path >>> return path >> >> >>Because I always strip out spaces in XML documents, and because I want >>to show the 1st node with node[1], I changed your code so: >> >>- name = node.nodeName == '#text' and 'text()' or node.nodeName >>- path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name >>+ path = '/%s[%s]' % (node.nodeName, successors) >> >> >>This is function is pretty neat. But still, there is 1 more little thing >>that I'm having a hard time figuring out how to fix. I keep getting >>"/playlist[2]" as the root node. I can't have 2 root nodes anyway... >> >> >> >> <-- shows: /playlist[2]/group[1] >> <-- shows: /playlist[2]/group[1]/video[1] >> <-- shows: /playlist[2]/group[1]/video[2] >> >> >> >>And this looping code inside the function everytime makes me loose track >>of what's doing on. Well done though. >> >>Best regards, >>-- >>Alexandre CONRAD - TLV >>Research & Development >>tel : +33 1 30 80 55 05 >>fax : +33 1 30 56 55 06 >>6, rue de la plaine >>78860 - SAINT NOM LA BRETECHE >>FRANCE > > > The line that ranslates '#text' to 'text()' has the advantage that it translates > the path to a valid xpath the other line that eliminates [1] still preserves > this valid xpath, and I thought it's nicer to look at :). > I found the source and the cure of the problem. The source is ( as you can > easely verify with http://www.codeweld.com/files/dom_view.pyw, just use > 'file://yourfile.xml' ) that the Sax2 reader for some reason puts a second node > with the same nodeName in. The cure is to take for comparision the localName, as > this name seems to be different for those. Additionaly he's also different for > some other nodes which might otherwise in border situations made trouble. This > is the new function. ( I also gave one variable a more reasonable name, was > confusing otherwise ) > > def abs_path( node ): > successors = 1 > previous = node.previousSibling > while previous: > if previous.localName == node.localName: successors += 1 > previous = previous.previousSibling > path = '/%s[%s]' % (node.nodeName, successors) > if node.parentNode.nodeName != '#document': > return abs_path( node.parentNode )+path > return path > > Kind Regards > Florian Ur da man !! :D I fixed the prob on my side but was doing a dirty trick : if parent.nodeName == node.nodeName and parent.nodeName != node.ownerDocument.firstChild.nodeName: successors += 1 Uuugh ! I don't like that. I feel better that you have found the solution. Don't like to know there's dirty code in my application. ;) Thank you so much for your help. That's a great function to be able to build the xpath of a given node. Very best regards, -- Alexandre CONRAD - TLV Research & Development tel : +33 1 30 80 55 05 fax : +33 1 30 56 55 06 6, rue de la plaine 78860 - SAINT NOM LA BRETECHE FRANCE From rsalz at datapower.com Thu Aug 5 18:37:39 2004 From: rsalz at datapower.com (Rich Salz) Date: Thu Aug 5 18:37:09 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <41125E95.5070204@magic.fr> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> <1091710269.41122d3db3cec@webmail.codeweld.com> <41124298.6090705@magic.fr> <1091719778.41125262b3d17@webmail.codeweld.com> <41125E95.5070204@magic.fr> Message-ID: <41126253.8050107@datapower.com> FYI, here is how ZSI does it; walking *up* from an element to a provided root: def _backtrace(elt, dom): '''Return a "backtrace" from the given element to the DOM root, in XPath syntax. ''' s = '' while elt != dom: name, parent = elt.nodeName, elt.parentNode if parent is None: break matches = [ c for c in _child_elements(parent) if c.nodeName == name ] if len(matches) == 1: s = '/' + name + s else: i = matches.index(elt) + 1 s = ('/%s[%d]' % (name, i)) + s elt = parent return s -- Rich Salz, Chief Security Architect DataPower Technology http://www.datapower.com XS40 XML Security Gateway http://www.datapower.com/products/xs40.html XML Security Overview http://www.datapower.com/xmldev/xmlsecurity.html From webworldl at yahoo.com Thu Aug 5 22:21:56 2004 From: webworldl at yahoo.com (Luke Bradley) Date: Thu Aug 5 22:21:58 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? Message-ID: <20040805202156.62158.qmail@web53504.mail.yahoo.com> Hi, I am looking for help with processing XTHML documents in python with SAX or DOM. If this is not the right place to ask, could you please refer me to a good place? My problem is that when I try to parse XHTML1.1 documents with pythons SAX implementation, it throws an error claiming that there are errors in the W3C's DTD's. given an XHTML page generated by the W3's TIDY generator called hello.html: Hello World

Hello World!

and the python code: import xml.sax.handler xml.sax.parse("hello.html", xml.sax.handler.ContentHandler() ) a fatal error occurs with the following stacktrace: Traceback (most recent call last): File "D:/projects/pyper/saxtest.py", line 4, in -toplevel- xml.sax.handler.ContentHandler() File "D:\PYTHON23\Lib\site-packages\_xmlplus\sax\__init__.py", line 31, in parse parser.parse(filename_or_stream) File "D:\PYTHON23\Lib\site-packages\_xmlplus\sax\expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "D:\PYTHON23\Lib\site-packages\_xmlplus\sax\xmlreader.py", line 123, in parse self.feed(buffer) File "D:\PYTHON23\Lib\site-packages\_xmlplus\sax\expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "D:\PYTHON23\Lib\site-packages\_xmlplus\sax\handler.py", line 38, in fatalError raise exception SAXParseException: http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0: error in processing external entity reference any ideas? am I missing something basic? thanks. __________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail From mike at skew.org Thu Aug 5 22:27:29 2004 From: mike at skew.org (Mike Brown) Date: Thu Aug 5 22:27:26 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: "from Paul Boddie at Aug 5, 2004 03:26:34 pm" Message-ID: <200408052027.i75KRT01076110@chilled.skew.org> Paul Boddie wrote: > Do this instead: > > utext = segment[0].decode( segment[1] ) The resulting Unicode object may contain characters which are not allowed in XML, and thus the text may not be serializable (at least not in a way that would produce well-formed XML). To embed arbitrary bytes in XML, the usual advice is to first convert the bytes into a character sequence that is permitted in XML. Base64 is a popular and easily implemented option, albeit inefficient. The article at http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests that a custom Huffman implementation is nearly 1:1. I've mapped bytes into the Private Use Area of Unicode before, too, although that's definitely not efficient. From chekhan at gepros.com.tn Thu Aug 5 23:40:01 2004 From: chekhan at gepros.com.tn (Gepros) Date: Fri Aug 6 00:34:29 2004 Subject: [XML-SIG] Prise de contact - Gepros Tunisie - projet de partenariat Message-ID: <20040805223933.298E83790B@smtp.gnet.tn> Bonjour, Nous vous contactons dans le but de développer une relation commerciale avec vous. Domaine d'activité : Notre société " Gépro's " est une société industrielle spécialisée dans la production de produits alimentaires à base de céréales (blé, mais, riz et multi grains) - céréales pour le petit déjeuné et snacks salés. Nos produits sont aussi destinés aux fabricants de glaces, yaourts et chocolats. Unité de production : Gépro's est certifiée ISO 9001 et HACCP et dispose d'équipements neufs et de premier ordre. Localisation : Tunis - Tunisie -Afrique du Nord Nos marchés : Notre circuit de distribution couvre actuellement le marché Maghrébin (Tunisie, Algérie et Libye) et pour le Moyen- Orient. Nous réalisons une croissance annuelle à deux chiffres et souhaitons développer notre croissance. Nous vous invitons à visiter notre Site Web www.gepros.com.tn pour de plus amples informations sur notre société. Objectifs : 1. Nous souhaitons développer des partenariats de distribution sur vos marchés. Deux cas sont possibles : a. Distribution de nos produits sous notre nom de marque b. Distribution de nos produits avec votre nom de marque si vous disposez d'une marque à promouvoir 2. développement d'un partenariat industriel. Ce partenariat peut prendre plusieurs formes : a. développement de relations de sous-traitance pour votre compte b. production de vos produits sous votre nom de marque dans le but de les commercialiser sur le marché tunisien, maghrébin, africain et au Moyen Orient. Avantages : i. développement de vos marchés ii. rapprochement de vos marchés cibles iii. coûts de stockage réduits et adaptation de la production à la demande sur les marchés cibles respectifs iv. exonération de frais de douanes sur les marchés maghrébin (accords bilatéraux) et moyen orient v. incitations aux investissements en Tunisie http://www.tunisieindustrie.nat.tn From pb3 at bizbuzz.pbf.gatech.edu Fri Aug 6 01:06:05 2004 From: pb3 at bizbuzz.pbf.gatech.edu (Paula_Britton) Date: Fri Aug 6 01:06:08 2004 Subject: [XML-SIG] Away from the Office until 8/16/04 Message-ID: <200408052306.i75N65910704@bizbuzz.pbf.gatech.edu> I will be out of the office starting August 5th and returning August 16th. Please contact Judy Whitfield with any issues at 404-894-9054 or judy.whitfield@business.gatech.edu. Thank You. Paula Britton From and-xml at doxdesk.com Fri Aug 6 10:06:05 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Fri Aug 6 10:05:28 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? In-Reply-To: <20040805202156.62158.qmail@web53504.mail.yahoo.com> References: <20040805202156.62158.qmail@web53504.mail.yahoo.com> Message-ID: <41133BED.7010108@doxdesk.com> Luke Bradley wrote: > My problem is that when I try to parse XHTML1.1 > documents with pythons SAX implementation, it throws > an error claiming that there are errors in the W3C's > DTD's. It's right - there are. Many other parsers won't accept them either. The (first) error is at line 37 char 20 of http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-special.ent: Since character references are decoded once at entity-definition time this actual defines the entity lt as containing '&<', which is grossly ill-formed as well as being incompatible with <'s canonical content. Exactly how much of an error this is in XML is a arguable point, given that this entity is not actually used after its declaration. However parsers that need to report the declared entity content independently of their references (such as DOM implementations) cannot possibly allow it. This is a bug in XHTML Modularization that makes handling today's XHTML 1.1 with validation a bit of a non-starter (along with all the other problems connected with XHTML 1.1). Unfortunately W3C process has prevented the error from being fixed before the forthcoming XHTML Modularization Second Edition. If you need to handle XHTML 1.1 at the moment, do it without validation/external entities. -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From mike at skew.org Fri Aug 6 10:14:26 2004 From: mike at skew.org (Mike Brown) Date: Fri Aug 6 10:14:26 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? In-Reply-To: <41133BED.7010108@doxdesk.com> "from Andrew Clover at Aug 6, 2004 05:06:05 pm" Message-ID: <200408060814.i768EQkR078907@chilled.skew.org> Andrew Clover wrote: > It's right - there are. Many other parsers won't accept them either. The > (first) error is at line 37 char 20 of > http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-special.ent: > > > That's not an error. Read the spec carefully. From and-xml at doxdesk.com Fri Aug 6 15:44:21 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Fri Aug 6 15:43:48 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? In-Reply-To: <200408060814.i768EQkR078907@chilled.skew.org> References: <200408060814.i768EQkR078907@chilled.skew.org> Message-ID: <41138B35.3050007@doxdesk.com> Mike Brown wrote: >> > That's not an error. It *is* an error, regardless of your opinion of whether XML technically allows "&<" as a literal entity value(*). XML 1.0 SE 4.6 says: If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped The entity value "&<" yields replacement text "&<" which clearly is not a character reference to the less-than sign. This is acknowledged and fixed in m12n SE: http://www.w3.org/TR/2004/WD-xhtml-modularization-20040218/ dtd_module_defs.html#a_module_XHTML_Special_Characters * - IMO such a replacement text technically allowable by implication of XML 1.0 SE 2.3: Although the EntityValue production allows the definition of an entity consisting of a single explicit < in the literal (e.g., ), it is strongly advised to avoid this practice since any reference to that entity will cause a well-formedness error. but it's incompatible with tools like DOM which require the replacement text to be parsed as-is without an explicit entity reference, to form the content of the Entity node. -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From mike at skew.org Fri Aug 6 17:56:40 2004 From: mike at skew.org (Mike Brown) Date: Fri Aug 6 17:56:38 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? In-Reply-To: <41138B35.3050007@doxdesk.com> "from Andrew Clover at Aug 6, 2004 10:44:21 pm" Message-ID: <200408061556.i76FueSO081407@chilled.skew.org> Andrew Clover wrote: > >> > > > That's not an error. > > It *is* an error Sorry, I am used to correcting people on that one. I thought the issue was the leading "&". You're right, though; I overlooked the extra "&". I apologize for firing off that terse email 5 minutes before going to bed :) -Mike From and-xml at doxdesk.com Fri Aug 6 20:28:59 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Fri Aug 6 20:28:24 2004 Subject: [XML-SIG] need help: Sax can't read w3 dtds? In-Reply-To: <200408061556.i76FueSO081407@chilled.skew.org> References: <200408061556.i76FueSO081407@chilled.skew.org> Message-ID: <4113CDEB.1050707@doxdesk.com> Mike Brown wrote: > I thought the issue was the leading "&". > You're right, though; I overlooked the extra "&". You can be forgiven - so did W3C! -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From n.youngman at ntlworld.com Sat Aug 7 08:48:18 2004 From: n.youngman at ntlworld.com (Neil Youngman) Date: Sat Aug 7 08:48:20 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <200408052027.i75KRT01076110@chilled.skew.org> References: <200408052027.i75KRT01076110@chilled.skew.org> Message-ID: <200408070748.18432.n.youngman@ntlworld.com> On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote: > Paul Boddie wrote: > > Do this instead: > > > > utext = segment[0].decode( segment[1] ) > > The resulting Unicode object may contain characters which are not allowed > in XML, and thus the text may not be serializable (at least not in a way > that would produce well-formed XML). Yes, but it's being written out through a UTF-8 codec to a file which specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that it's got a unicode string and convert it to utf-8 with no futher programmer intervention. Of course a week ago, Python was just another buzzword to me, so I could be wrong. > To embed arbitrary bytes in XML, the usual advice is to first convert the > bytes into a character sequence that is permitted in XML. Base64 is a > popular and easily implemented option, albeit inefficient. The article at > http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests > that a custom Huffman implementation is nearly 1:1. I've mapped bytes into > the Private Use Area of Unicode before, too, although that's definitely not > efficient. All neat ideas, but as I want UTF-8 encoding, they would just add an unnecessary layer of complexity. Thanks for trying to help, but I think I've got what I need. Neil Youngman From thedoenerking at gmx.de Sat Aug 7 09:32:38 2004 From: thedoenerking at gmx.de (thedoenerking@gmx.de) Date: Sat Aug 7 09:32:56 2004 Subject: [XML-SIG] Returned mail: see transcript for details Message-ID: <20040807073253.2DC2C1E4003@bag.python.org> -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment.zip Type: application/octet-stream Size: 29402 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040807/310193e4/attachment-0001.obj From fredrik at pythonware.com Sat Aug 7 16:42:56 2004 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sat Aug 7 16:41:20 2004 Subject: [XML-SIG] Re: XML Unicode and UTF-8 References: <200408052027.i75KRT01076110@chilled.skew.org> <200408070748.18432.n.youngman@ntlworld.com> Message-ID: Neil Youngman wrote: > Yes, but it's being written out through a UTF-8 codec to a file which > specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that it's > got a unicode string and convert it to utf-8 with no futher programmer > intervention. Python's UTF-8 codec takes a Unicode object, and generates an 8-bit string object. If you attempt to "encode" an 8-bit string object, it is converted to a Unicode object first. This conversion only works if the 8-bit string contains ASCII characters only. There's no such thing as an 8-bit Unicode string. From mike at skew.org Sat Aug 7 19:59:43 2004 From: mike at skew.org (Mike Brown) Date: Sat Aug 7 19:59:50 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <200408070748.18432.n.youngman@ntlworld.com> "from Neil Youngman at Aug 7, 2004 07:48:18 am" Message-ID: <200408071759.i77HxhXG087217@chilled.skew.org> Neil Youngman wrote: > On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote: > > The resulting Unicode object may contain characters which are not allowed > > in XML, and thus the text may not be serializable (at least not in a way > > that would produce well-formed XML). > > Yes, but it's being written out through a UTF-8 codec Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML places restrictions on what characters can be in the *decoded* (Unicode) version of the document. The encoded version of the document is just an alternative representation of the Unicode one. In Python's notation, each character in the document must be one of: \t (tab) \n (linefeed) \r (carriage return) \u0020-\ud7ff \ue000-\ufffd \u10000-\u10ffff You are not allowed to have any other characters in your document, not even by reference (e.g., you can't write � to represent \u0000). So let's say you have 256 bytes of binary data, just byte values 0-255: >>> bytestring = ''.join(map(chr,range(256))) How do you put this into your document? You have to make it be Unicode, so you could try >>> ustring = unicode(bytestring) but that would give you an error because by default it's going to assume bytestring is ascii (actually, what is returned by sys.getdefaultencoding(), I think), whereas you've got bytes higher than \x7f. You could try >>> ustring = unicode(bytestring, 'utf-8') but you will get errors because the bytes aren't valid UTF-8 sequences. They're valid iso-8859-1, though, (iso-8859-1 allows any byte value) so you could do >>> ustring = unicode(bytestring, 'iso-8859-1') and now you've got u'\u0000\u0001\u0002...\u00fe\u00ff'. Note that some of those characters are not allowed in XML. The DOM implementations will accept them, because they don't check for illegal characters. >>> from xml.dom.minidom import parseString >>> doc = parseString('') >>> t = doc.createTextNode(ustring) >>> doc.childNodes[0].appendChild(t) They'll even blindly serialize them for you. >>> xmlstring = doc.toxml('utf-8') >>> xmlustring = doc.toxml() In all 3 cases (doc, xmlstring, xmlustring), illegal characters are in the XML. Want proof? >>> doc2 = parseString(xmlstring) Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1925, in parseString return expatbuilder.parseString(string) File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 6 If you try these examples yourself and go looking at the variables created, take note that Python's representation of Unicode strings uses '\x00'-'\xff' for '\u0000-\u00ff'. It's just a cosmetic thing; if the string is Unicode, everything in it is Unicode characters, not bytes. -Mike From neil.youngman at youngman.org.uk Sat Aug 7 21:11:08 2004 From: neil.youngman at youngman.org.uk (Neil Youngman) Date: Sat Aug 7 21:11:11 2004 Subject: [XML-SIG] Re: XML Unicode and UTF-8 In-Reply-To: References: <200408052027.i75KRT01076110@chilled.skew.org> <200408070748.18432.n.youngman@ntlworld.com> Message-ID: <200408072011.09008.neil.youngman@youngman.org.uk> On Saturday 07 Aug 2004 3:42 pm, Fredrik Lundh wrote: > Neil Youngman wrote: > > Yes, but it's being written out through a UTF-8 codec to a file which > > specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that > > it's got a unicode string and convert it to utf-8 with no futher > > programmer intervention. > > Python's UTF-8 codec takes a Unicode object, and generates an 8-bit string > object. If you attempt to "encode" an 8-bit string object, it is converted > to a Unicode object first. This conversion only works if the 8-bit string > contains ASCII characters only. > > There's no such thing as an 8-bit Unicode string. I never said there was. The string comes from decode, which I believe returns a Unicode string. AIUI the Python type system preserves that information until it reaches the codec, which therefore treats it correctly. My use of the phrase "the python UTF-8 codec can detect that it's got a unicode string" might have been a poor choice, but I don't think I'm disagreeing with you. Neil Youngman From n.youngman at ntlworld.com Sat Aug 7 21:36:58 2004 From: n.youngman at ntlworld.com (Neil Youngman) Date: Sat Aug 7 21:37:01 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <200408071759.i77HxhXG087217@chilled.skew.org> References: <200408071759.i77HxhXG087217@chilled.skew.org> Message-ID: <200408072036.58754.n.youngman@ntlworld.com> On Saturday 07 Aug 2004 6:59 pm, Mike Brown wrote: > Neil Youngman wrote: > > On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote: > > > The resulting Unicode object may contain characters which are not > > > allowed in XML, and thus the text may not be serializable (at least not > > > in a way that would produce well-formed XML). > > > > Yes, but it's being written out through a UTF-8 codec > > Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML > places restrictions on what characters can be in the *decoded* (Unicode) > version of the document. The encoded version of the document is just an > alternative representation of the Unicode one. > > In Python's notation, each character in the document must be one of: > \t (tab) > \n (linefeed) > \r (carriage return) > \u0020-\ud7ff > \ue000-\ufffd > \u10000-\u10ffff > > You are not allowed to have any other characters in your document, not even > by reference (e.g., you can't write � to represent \u0000). > > So let's say you have 256 bytes of binary data, just byte values 0-255: > >>> bytestring = ''.join(map(chr,range(256))) OK. I think we're starting from different assumptions here. The data comes from decoding an RFC1522 header. It is therefore assumed to be text, albeit in a non-ASCII character set. It should not be an arbitrary chunk of binary data. I'm assuming, possibly incorrectly, that the standards are set up in such a way that if it's valid text, it should be possible to insert the equivalent the UTF-8 equivalent in XML. While I theoretically could get something that's not valid text, encoded in an RFC1522 header, it's only going to cause me real concern if it's a security flaw. If we can't adequately process invalid data, that's not a major concern for me. If you are saying that there may be text in character sets supported in Python (with CJK codecs), that I can't insert as plain UTF-8 into a UTF-8 XML document that would be a concern. Neil Youngman From martin at v.loewis.de Sun Aug 8 09:54:22 2004 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sun Aug 8 09:54:22 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <200408072036.58754.n.youngman@ntlworld.com> References: <200408071759.i77HxhXG087217@chilled.skew.org> <200408072036.58754.n.youngman@ntlworld.com> Message-ID: <4115DC2E.8050004@v.loewis.de> Neil Youngman wrote: > I'm assuming, possibly incorrectly, that the standards are set up in such a > way that if it's valid text, it should be possible to insert the equivalent > the UTF-8 equivalent in XML. That's, strictly speaking, incorrect - the notion of "valid text" is really flawed. Valid text, e.g. in iso-8859-5, might contain control characters which are not allowed in XML. Regards, Martin From n.youngman at ntlworld.com Sun Aug 8 10:36:57 2004 From: n.youngman at ntlworld.com (Neil Youngman) Date: Sun Aug 8 10:37:00 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <4115DC2E.8050004@v.loewis.de> References: <200408071759.i77HxhXG087217@chilled.skew.org> <200408072036.58754.n.youngman@ntlworld.com> <4115DC2E.8050004@v.loewis.de> Message-ID: <200408080936.57431.n.youngman@ntlworld.com> On Sunday 08 Aug 2004 8:54 am, Martin v. L?wis wrote: > Neil Youngman wrote: > > I'm assuming, possibly incorrectly, that the standards are set up in such > > a way that if it's valid text, it should be possible to insert the > > equivalent the UTF-8 equivalent in XML. > > That's, strictly speaking, incorrect - the notion of "valid text" is > really flawed. Valid text, e.g. in iso-8859-5, might contain control > characters which are not allowed in XML. OK. At the moment I'm just prototyping. I can see that it's a messy area and there are some tricky issues I'll have to study before I can produce any real software. Thanks Neil From tpassin at comcast.net Sun Aug 8 18:52:08 2004 From: tpassin at comcast.net (Thomas B. Passin) Date: Sun Aug 8 18:51:00 2004 Subject: [XML-SIG] favicon in XBEL In-Reply-To: <1091474222.3479.220.camel@borgia> References: <200407301527.14592.fdrake@acm.org> <410AC45B.4070504@comcast.net> <1091474222.3479.220.camel@borgia> Message-ID: <41165A38.7060009@comcast.net> Uche Ogbuji wrote: > On Fri, 2004-07-30 at 15:57, Thomas B. Passin wrote: >>Well, maybe that doesn't happen so often anymore (better browsers?), but >>I had to do some hacking on the current xbel code to get it to use >>unicode and stop halting with encoding errors on titles. I haven't had >>time to post my changes yet, but maybe in a couple of weeks ... > > > Well, not halting can be bad if you don't know what the encodings > actually are. Maybe the utilities would have to take some sort of > default encoding param from the user? But I really hate to make > crutches for such insidious problems. > One of the the problems was that I would get a non-ascii error for xbel python code when titles contained certain iso-8859-1 characters. Not surprising, of course, but it had to be dealt with. For maybe the last year, since I hacked my xbel code to include encodings, I have had reliable results using iso-8859-1 for IE and utf-8 for my Mozilla-based browsers. Of course, that would be specific to my personal browser settings. I just wanted to bring out that one has to pay attention to these issues when contemplating merging bookmarks from various sources. Since it was very annoying for me until I got it handled, we want to make sure that any update to the xbel code gets it right. Cheers, Tom P -- Thomas B. Passin Explorer's Guide to the Semantic Web (Manning Books) http://www.manning.com/catalog/view.php?book=passin From AntiVir at yalta.us Mon Aug 9 02:00:26 2004 From: AntiVir at yalta.us (AntiVir@yalta.us) Date: Sun Aug 8 22:59:44 2004 Subject: [XML-SIG] AntiVir ALERT [mail from: "Returned mail" ] Message-ID: <200408090000.i7900QiS009679@yalta.us> * * * * * * * * * * * * * * * AntiVir ALERT * * * * * * * * * * * * * * * áÎÔÉ×ÉÒÕÓ ÏÂÎÁÒÕÖÉÌ ×ÉÒÕÓ × ËÏÒÅÓÐÏÎÄÅÎÃÉÉ, ËÏÔÏÒÁÑ ÐÒÏÈÏÄÉÌÁ ÞÅÒÅÚ ÓÅÒ×ÅÒ! ïÔÐÒÁ×ÉÔÅÌØ: "Returned mail" îÁÚ×ÁÎÉÅ ×ÉÒÕÓÁ: Worm/Mydoom.l ðÏÞÔÁ ÎÅ ÂÙÌÁ ÄÏÓÔÁ×ÌÅÎÁ ÐÏÌÕÞÁÔÅÌÀ. ó Õ×ÁÖÅÎÉÅÍ; ëÏÍÐÁÎÉÑ ñÌÔÁéÎÆÏ ÔÅÌ.: +38(0654)271828 ÆÁËÓ.: +38(0654)231094 web: www.yaltainfo.com email: support@yalta.us Mail-Info: --8<-- From: "Returned mail" To: xml-sig@python.org Date: Sun, 8 Aug 2004 23:59:15 +0300 Subject: Returned mail: Data format error --8<-- This version of AntiVir is licensed for private and non-commercial use. -- AntiVir for UNIX Copyright (C) 1994-2003 by H+BEDV Datentechnik GmbH. All rights reserved. For more information see http://www.antivir.de/ or http://www.hbedv.com/ From noreply at sourceforge.net Mon Aug 9 00:07:19 2004 From: noreply at sourceforge.net (SourceForge.net) Date: Mon Aug 9 00:07:22 2004 Subject: [XML-SIG] [ pyxml-Patches-1005669 ] prepare_input_source for bugs 616431, 788931 Message-ID: Patches item #1005669, was opened at 2004-08-08 22:07 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=306473&aid=1005669&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: Andrew Clover (bobince) Assigned to: Nobody/Anonymous (nobody) Summary: prepare_input_source for bugs 616431, 788931 Initial Comment: First version of replacement prepare_input_source function as described in bug 616431. Seems to work with existing code I've tried whilst solving this problem, but wider testing appreciated. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=306473&aid=1005669&group_id=6473 From paul.boddie at ementor.no Mon Aug 9 12:07:28 2004 From: paul.boddie at ementor.no (Paul Boddie) Date: Mon Aug 9 12:07:32 2004 Subject: [XML-SIG] XML Unicode and UTF-8 Message-ID: Neil Youngman [mailto:n.youngman@ntlworld.com] wrote: > > OK. I think we're starting from different assumptions here. The data > comes from decoding an RFC1522 header. It is therefore assumed to be > text, albeit in a non-ASCII character set. It should not be an > arbitrary chunk of binary data. That's why I was slightly puzzled by the remark about invalid Unicode values. But then I wasn't following the discussion that closely. > I'm assuming, possibly incorrectly, that the standards are set up in > such a way that if it's valid text, it should be possible to insert > the equivalent the UTF-8 equivalent in XML. I think it's best to think of the problem with the following terminology: * The original text is a normal Python string with a known encoding. We refer to that as a byte string. * You want to convert that string to a Unicode object and insert it into a DOM representation of an XML document. We refer to this as Unicode in the DOM. * You want to serialise the document using a UTF-8 encoding. We can refer to the content as UTF-8 in XML. As has been mentioned already, you might well be able to put UTF-8 encoded byte strings into the DOM, but then you'll experience problems with serialisation. If you put Unicode objects into the DOM, serialisation should proceed successfully. And as far as opening a file and serialising to it is concerned, I've had most success with the following sequence of operations: * Open a file using Python's "open" built-in function - this exposes an output stream which should be considered as accepting byte values (as opposed to streams exposed by "codecs.open" which accept Unicode values). * Serialise to the stream using the various XML toolkit functions or methods. These functions or methods are able to produce an encoding declaration in the serialised document consistent with the actual encoding employed. They will also convert the Unicode values to the appropriate byte sequences for the output stream. * Close the file. ;-) There may be a better way of doing this, but that's the most sane way I've discovered so far. Paul From tom.dalglish at verizon.net Mon Aug 9 16:15:31 2004 From: tom.dalglish at verizon.net (tom.dalglish@verizon.net) Date: Mon Aug 9 16:16:17 2004 Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead of site-packages... Message-ID: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net> Hi, We have a strong lock-down policy on Windows directories and I am not permitted to install in the traditional $PYTHON/Lib/site-packages. The Installshield app does not allow you to override the setting, which is reads from the Registry (ack!). How can I install it in a directory that I own? Thanks, From matt.price at utoronto.ca Mon Aug 9 19:45:43 2004 From: matt.price at utoronto.ca (Matt Price) Date: Mon Aug 9 19:45:45 2004 Subject: [XML-SIG] unicode and xml/xsl Message-ID: <20040809174543.GA9033@utoronto.ca> (cross-posted to python-list) Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a little confusing. Here is the basic situation: I connect to the server and receive an xml document whose content is a bibliographic dataset. The document can be encoded in two ways: ISO-8859-1 or unicode. My program simply takes the document and passes it to an xsl stylesleet using libxslt & libxml2. Here's the relevant code: # this is how I get the results & generate either a string or a # unicode string def getref (self, query = ':ID:>0', cmd = 'getref ', reftype = default_reftype): cmd += ' ' + query self.send(cmd + self.CS_TERM) results = self.tread() if self.encoding == 'UNICODE': print ' decoding unicode string: ' results = results.decode('utf-8', 'replace') return results # this is where I generate the html: def risx_to_html (self, risxSet, xsl = xsl_ss, css=css_url, use_css = 1): styledoc = libxml2.parseFile(xsl) style = libxslt.parseStylesheetDoc(styledoc) doc = libxml2.parseDoc(risxSet) result = style.applyStylesheet(doc, None) # style.saveResultToFilename("results.html", result, 0) htmlString = style.saveResultToString(result) style.freeStylesheet() doc.freeDoc() result.freeDoc() return htmlString The server's default encoding is iso-8859-1, and since I mosly use english-language references, this usually works fine; but occasionally the server will pass me an entity like 'μ' (for Greek letter mu). This generates an error like this: Entity: line 57: parser error : Entity 'mu' not defined This is not so bad, because the parsing continues nonetheless. With unicode it's worse. In this case there are several errors depending on how I set the system up: with iso-8859-1 set as default encoding in sitecustomize.py: File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html doc = libxml2.parseDoc(risxSet) File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc ret = libxml2mod.xmlParseDoc(cur) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256) with utf-8 set as default encoding: File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html doc = libxml2.parseDoc(risxSet) File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc ret = libxml2mod.xmlParseDoc(cur) TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode So I guess I have two questions: (1) am I using the right python tools for this job? My excellent python books unfortunately don't cover either unicode or xml in much depth, so I'm a little uncertain as te whtehr I'm doing the right thing. (2) Is there a way to make libxml2 parse unicode documents? Do I need to pass it more information alerting it that it's getting unicode? Anyway, thanks very much for your help. Much appreciated, Matt ------------------------------------------- Matt Price matt.price@utoronto.ca History Department, University of Toronto (416) 978-2094 -------------------------------------------- From uche.ogbuji at fourthought.com Mon Aug 9 20:42:37 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Mon Aug 9 20:42:52 2004 Subject: [XML-SIG] favicon in XBEL In-Reply-To: <200407301527.14592.fdrake@acm.org> References: <200407301527.14592.fdrake@acm.org> Message-ID: <1092076957.810.7.camel@borgia> On Fri, 2004-07-30 at 13:27, Fred L. Drake, Jr. wrote: > On Friday 30 July 2004 09:15 am, Ahmad Gharbeia wrote: > > Storing and handling book marks in a cross platform/browser format has > > been a long time interest for me. Only when I started thinking of > > undertaking the task myself in XML that I found your work, which I greatly > > admire. > > Thanks! > > > Allow me to bring one suggestion to your attention: > > Why not add the ability to store an encoded 'favicon', or a URI to it in a > > element? > > This has been discussed before, and is of interest to the Konqueror crew as > well. I'll have to dig back in my archives to see what was said. To me, this should be something users handle through extensibility. I don't think favicon is important enough for the XBEL core. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Mon Aug 9 20:44:49 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Mon Aug 9 20:45:17 2004 Subject: [XML-SIG] Re: value error when parsing XML In-Reply-To: <410FD5BB.1080306@doxdesk.com> References: <410B7277.3000609@mail.usyd.edu.au> <40EE32F9.1080809@doxdesk.com> <410F6E98.4080803@mail.usyd.edu.au> <410FD5BB.1080306@doxdesk.com> Message-ID: <1092077088.810.10.camel@borgia> On Tue, 2004-08-03 at 12:13, Andrew Clover wrote: > Ajay Brar wrote: > > > i am using a SAX parser. > > I don't do a lot of SAX, but it looks to me like there's a bug in the > xml.sax.saxutils InputSource which is likely to be the cause of your > trouble. (Details to follow.) > > > i think its something to do with the way i call the parser > > parser.parse("../um_xml/um_ajay.xml") > > Yes. I would suggest passing in a URI instead: Precisely. People too often mix up file names with URIs, and it causes no end of trouble. > filename= '../um_xml/um__ajay.xml' > uri= 'file:'+urllib.pathname2url(os.path.abspath(filename)) > parser.parse(uri) I think filename should be absolutized before it gets to your "uri=" line. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From martin at v.loewis.de Mon Aug 9 22:56:34 2004 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon Aug 9 22:56:32 2004 Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead of site-packages... In-Reply-To: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net> References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net> Message-ID: <4117E502.1040305@v.loewis.de> tom.dalglish@verizon.net wrote: > The Installshield app does not allow you to override the setting, > which is reads from the Registry (ack!). How can I install it in a directory > that I own? It's not Installshield, but bdist_wininst. To install elsewhere, run "python setup.py install" on the source distribution. Regards, Martin From tpassin at comcast.net Mon Aug 9 23:30:05 2004 From: tpassin at comcast.net (Thomas B. Passin) Date: Mon Aug 9 23:28:53 2004 Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead of site-packages... In-Reply-To: <4117E502.1040305@v.loewis.de> References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net> <4117E502.1040305@v.loewis.de> Message-ID: <4117ECDD.7020402@comcast.net> Martin v. L?wis wrote: > tom.dalglish@verizon.net wrote: > >> The Installshield app does not allow you to override the setting, >> which is reads from the Registry (ack!). How can I install it in a >> directory that I own? > > > It's not Installshield, but bdist_wininst. > > To install elsewhere, run "python setup.py install" on the source > distribution. Except for Windows users ... I have actually temporarily changed the address in the registry to persuade pyxml to install in the distribution I want (e.g., Python2.3, Zope 2.7, Plone, etc.). Just export the original settings to a file, and you can restore them afterwards. I wish that the Python installer would provide for multiple installations of the same version on Windows, but it doesn't. Cheers, Tom P -- Thomas B. Passin Explorer's Guide to the Semantic Web (Manning Books) http://www.manning.com/catalog/view.php?book=passin From fdrake at acm.org Mon Aug 9 23:53:53 2004 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Mon Aug 9 23:54:04 2004 Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead =?iso-8859-1?q?of site-packages=2E=2E=2E?= In-Reply-To: <4117ECDD.7020402@comcast.net> References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net> <4117E502.1040305@v.loewis.de> <4117ECDD.7020402@comcast.net> Message-ID: <200408091753.53150.fdrake@acm.org> On Monday 09 August 2004 05:30 pm, Thomas B. Passin wrote: > I wish that the Python installer would provide for multiple > installations of the same version on Windows, but it doesn't. This gets a little better in Python 2.4, which supports --home for all platforms. -Fred -- Fred L. Drake, Jr. From uche.ogbuji at fourthought.com Tue Aug 10 02:30:06 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Tue Aug 10 02:30:22 2004 Subject: [XML-SIG] get the abolute path for a node In-Reply-To: <1091719778.41125262b3d17@webmail.codeweld.com> References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de> <1091710269.41122d3db3cec@webmail.codeweld.com> <41124298.6090705@magic.fr> <1091719778.41125262b3d17@webmail.codeweld.com> Message-ID: <1092097806.810.116.camel@borgia> On Thu, 2004-08-05 at 09:29, xmlsig@codeweld.com wrote: > The line that ranslates '#text' to 'text()' has the advantage that it translates > the path to a valid xpath the other line that eliminates [1] still preserves > this valid xpath, and I thought it's nicer to look at :). > I found the source and the cure of the problem. The source is ( as you can > easely verify with http://www.codeweld.com/files/dom_view.pyw, just use > 'file://yourfile.xml' ) Niiiiiice. I'll have to highlight this code in one of my columns, if that's OK with you. Of course I think import xml.dom.ext.reader.Sax2 as Sax2 is probably a bad idea, though I'm not sure what the best alternatives are to import xml.dom.ext.reader.HtmlLib as HtmlLib Do you have any discussion or docs on this code? > that the Sax2 reader for some reason puts a second node > with the same nodeName in. The cure is to take for comparision the localName, as > this name seems to be different for those. Additionaly he's also different for > some other nodes which might otherwise in border situations made trouble. This > is the new function. ( I also gave one variable a more reasonable name, was > confusing otherwise ) > > def abs_path( node ): > successors = 1 > previous = node.previousSibling > while previous: > if previous.localName == node.localName: successors += 1 > previous = previous.previousSibling > path = '/%s[%s]' % (node.nodeName, successors) > if node.parentNode.nodeName != '#document': > return abs_path( node.parentNode )+path > return path Cool. I took this as a starting point to add such a function to my domtools.py http://cvs.4suite.org/cgi-bin/viewcvs.cgi/Anobind/domtools.py For convenience, here's my version: from xml.dom import Node #The abs_path is based on code developed by "Florian" on XML-SIG #http://mail.python.org/pipermail/xml-sig/2004-August/010423.html def abs_path( node ): """ Return an XPath expression that provides a unique path to the given node (only supoports elements, attributes and root nodes) within a document """ #is_domlette = hasattr(node, 'rootNode') if node.nodeType == Node.ELEMENT_NODE: successors = 1 #Determine how many previous siblings there are with the same node name previous = node.previousSibling while previous: if previous.localName == node.localName: successors += 1 previous = previous.previousSibling step = u'%s[%i]' % (node.nodeName, successors) ancestor = node.parentNode elif node.nodeType == Node.ATTRIBUTE_NODE: step = u'@%s' % (node.nodeName) ancestor = node.ownerElement elif not node.parentNode: step = u'' ancestor = node else: raise TypeError('Unsupported node type for abs_path') if ancestor.parentNode: return abs_path(ancestor) + u'/' + step else: return u'/' + step -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Tue Aug 10 02:48:31 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Tue Aug 10 02:48:41 2004 Subject: [XML-SIG] saxutils bug (was: value error when parsing XML) In-Reply-To: <410FDF31.1070809@doxdesk.com> References: <410FDF31.1070809@doxdesk.com> Message-ID: <1092098911.810.120.camel@borgia> On Tue, 2004-08-03 at 12:53, Andrew Clover wrote: > I would prefer to keep all InputSource systemIds as URIs; even when a > filename was originally passed in it should be converted to a URI. > Otherwise we cannot reliably deal with relative systemIds. +1. This is the hard line we took in 4Suite, and I think it really makes everything much more sane. > However as I have not played much with SAX I'm hesitant to drop patches > to sourceforge just yet. I think it's a good idea and worth an attempted patch, if you have the cycles to work on one. We can work out any kinks here. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Tue Aug 10 03:02:25 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Tue Aug 10 03:02:29 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]> References: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]> Message-ID: <1092099745.810.128.camel@borgia> On Thu, 2004-08-05 at 05:03, n.youngman@ntlworld.com wrote: > > > > From: "Martin v. L?wis" > > Date: 2004/08/05 Thu AM 10:41:59 GMT > > To: n.youngman@ntlworld.com > > CC: xml-sig@python.org > > Subject: Re: [XML-SIG] XML Unicode and UTF-8 > > > > > State all the information that you have, preferably in the form: > > 1. this is what I did > > 2. this is what happened > > 3. this is what I expected to happen instead. > > Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me. The information in your first message was essentially useless for anyone trying to understand your problem. I couldn't make heads or tails of it either. Martin told you exactly what data we need in order to help you. Please take note and heed his advice when you post for help here (and probably any other forum). > First Pass: > > segment_tag.appendChild( charset_tag ) > unicode_tag = doc.createElement( 'unicode' ) You should use Unicode objects in DOM update operations (u'unicode'). > unicode_tag.appendChild( doc.createTextNode( segment[0] ) ) > segment_tag.appendChild( unicode_tag ) > > Inserts binary data into the segment/unicode tag Binary data?!? > Saving with > > XMLFILE = open( filename, "w" ) > > xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="") > > XMLFILE.close() > > Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption? You still haven't provided enough information. What is this "binary data"? what exactly are the values of the variables in the above code snippets? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Tue Aug 10 03:11:11 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Tue Aug 10 03:11:26 2004 Subject: [XML-SIG] XML Unicode and UTF-8 In-Reply-To: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]> References: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]> Message-ID: <1092100271.810.135.camel@borgia> It looks as if I should have read the whole thread before posting. Martin's been a great help, but I still have a couple of observations. On Thu, 2004-08-05 at 06:22, n.youngman@ntlworld.com wrote: > OK. I read the opaque documentation^W^W fine manual for a while, then googled for a while, and finally decided to just hack about with what I had. I personally think the Python/Unicode docs are pretty good, but Unicode is *hard*. No getting around that. > I now have > > charset_tag.appendChild( doc.createTextNode( segment[1] ) ) > unicode = segment[0].decode( segment[1] ).encode( "utf-8") > unicode_tag = doc.createElement( 'unicode' ) > unicode_tag.appendChild( doc.createTextNode( unicode ) ) I wouldn't use "unicode" as a variable name if I were you, since it's a built-in in Python 2.2 and up. I suggest unicode_tag = doc.createElement( u'unicode' ) rather than unicode_tag = doc.createElement( 'unicode' ) Remember that XML element and attribute names are also (a subset of) Unicode, even though they're a smaller subset than that of character data. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From vladimir.marangozov at imag.fr Tue Aug 10 06:38:33 2004 From: vladimir.marangozov at imag.fr (vladimir.marangozov@imag.fr) Date: Tue Aug 10 06:39:38 2004 Subject: [XML-SIG] report Message-ID: <20040810043936.97A111E4002@bag.python.org> The original message was received at Tue, 10 Aug 2004 00:38:33 -0400 from imag.fr [178.193.157.86] ----- The following addresses had permanent fatal errors ----- xml-sig@python.org ----- Transcript of the session follows ----- ... while talking to mail server python.org.: >>> RCPT To: <<< 550 5.1.1 ... Not known here From markus.jostock at softwareag.com Tue Aug 10 11:59:16 2004 From: markus.jostock at softwareag.com (Markus Jostock) Date: Tue Aug 10 11:58:18 2004 Subject: [XML-SIG] DOM seems incomplete Message-ID: <41189C74.8050902@softwareag.com> Hi I am parsing a string into a DOM. That works without problems. But when I want to access childen of the first element, there seem to be none. But pretty printing shows them. Maybe you have an idea what might be going wrong? Thanks in advance for some clues. Kind regards Markus The string I parse: string = '' Parsing works without errors: from xml.dom.ext.reader import Sax2 reader = Sax2.Reader() doc = reader.fromString(string) When I pretty print it, it looks ok: from xml.dom.ext import PrettyPrint PrettyPrint(doc) prints: Accessing doc.firstChild is ok: print doc.firstChild.nodeName prints MYXML But if a want to access further children of , there are none: print doc.firstChild.nodeList prints or print doc.firstChild.firstChild prints None Where are my children gone? From aconrad.tlv at magic.fr Tue Aug 10 13:45:51 2004 From: aconrad.tlv at magic.fr (Alexandre CONRAD) Date: Tue Aug 10 13:45:54 2004 Subject: [Fwd: Re: [XML-SIG] DOM seems incomplete] Message-ID: <4118B56F.30505@magic.fr> Forgot to send to the list... -------- Original Message -------- Subject: Re: [XML-SIG] DOM seems incomplete Date: Tue, 10 Aug 2004 12:40:22 +0200 From: Alexandre CONRAD To: Markus Jostock References: <41189C74.8050902@softwareag.com> Markus Jostock wrote: > Hi > > I am parsing a string into a DOM. That works without problems. But when > I want to access childen of the first element, there seem to be none. > But pretty printing shows them. > > Maybe you have an idea what might be going wrong? > > Thanks in advance for some clues. > > Kind regards > Markus > > > The string I parse: > string = ' STATUS="PRV"> /> />' > > Parsing works without errors: > from xml.dom.ext.reader import Sax2 > reader = Sax2.Reader() > doc = reader.fromString(string) > > When I pretty print it, it looks ok: > from xml.dom.ext import PrettyPrint > PrettyPrint(doc) > prints: > > > > > > > > > > > > > > > Accessing doc.firstChild is ok: > print doc.firstChild.nodeName prints MYXML > > But if a want to access further children of , there are none: > print doc.firstChild.nodeList prints or > print doc.firstChild.firstChild prints None > > Where are my children gone? Because you are PrettyPrint'ing it parses newlines and whitespaces (indentation) as text nodes. Try 'print doc.firstChild.firstChild.firstChild'. You should find your node (I think, maybe you'll have to add 1 more fistChild). In my case, I want to keep the xml file PrettyPrint'ed. So what I do is that I parse the PrettyPrint'ed file and strip out new lines and whitespaces before I do anything to it : def openDoc(self, xml_file): # Create Reader object reader = Sax2.Reader() # Parse the document doc = reader.fromStream(xml_file) # Strip out white spaces from doc xml.dom.ext.StripXml(doc) return doc Now, I can play around with my 'doc' without worrying about whitespaces. When I write it back on disk, I pretty print it again : def write_xml(self, doc, xml_file): # Open XML file in write mode f = open(xml_file, "w") # Write doc pretty printed to file f.write(xml.dom.ext.PrettyPrint(doc, xml_file)) # Close file f.close() Regards, -- Alexandre CONRAD - TLV Research & Development tel : +33 1 30 80 55 05 fax : +33 1 30 56 55 06 6, rue de la plaine 78860 - SAINT NOM LA BRETECHE FRANCE -- Alexandre CONRAD - TLV Research & Development tel : +33 1 30 80 55 05 fax : +33 1 30 56 55 06 6, rue de la plaine 78860 - SAINT NOM LA BRETECHE FRANCE From markus.jostock at softwareag.com Tue Aug 10 14:17:55 2004 From: markus.jostock at softwareag.com (Markus Jostock) Date: Tue Aug 10 14:16:58 2004 Subject: [XML-SIG] DOM seems incomplete In-Reply-To: <4118B56F.30505@magic.fr> References: <4118B56F.30505@magic.fr> Message-ID: <4118BCF3.9030602@softwareag.com> Hi Thanks for the hint, but stripping whitespaces does not seem to help: Trying to access a child node results in an exception since the child does not exist (i.e. it is of type 'None'). print doc.firstChild.firstChild.nodeName causes an exception: Traceback (most recent call last): File "TUsecaseCreateEmptyDoc.py", line 54, in test01 print structure.firstChild.firstChild.nodeName AttributeError: 'NoneType' object has no attribute 'nodeName' Kind regards Markus Alexandre CONRAD wrote: > Markus Jostock wrote: > >> Hi >> >> I am parsing a string into a DOM. That works without problems. But >> when I want to access childen of the first element, there seem to be >> none. But pretty printing shows them. >> >> Maybe you have an idea what might be going wrong? >> >> Thanks in advance for some clues. >> >> Kind regards >> Markus >> >> >> The string I parse: >> string = '> STATUS="PRV">> />> />' >> >> Parsing works without errors: >> from xml.dom.ext.reader import Sax2 >> reader = Sax2.Reader() >> doc = reader.fromString(string) >> >> When I pretty print it, it looks ok: >> from xml.dom.ext import PrettyPrint >> PrettyPrint(doc) >> prints: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Accessing doc.firstChild is ok: >> print doc.firstChild.nodeName prints MYXML >> >> But if a want to access further children of , there are none: >> print doc.firstChild.nodeList prints or >> print doc.firstChild.firstChild prints None >> >> Where are my children gone? > > > Because you are PrettyPrint'ing it parses newlines and whitespaces > (indentation) as text nodes. Try > > 'print doc.firstChild.firstChild.firstChild'. You should find your node > (I think, maybe you'll have to add 1 more fistChild). > > In my case, I want to keep the xml file PrettyPrint'ed. So what I do is > that I parse the PrettyPrint'ed file and strip out new lines and > whitespaces before I do anything to it : > > def openDoc(self, xml_file): > # Create Reader object > reader = Sax2.Reader() > # Parse the document > doc = reader.fromStream(xml_file) > # Strip out white spaces from doc > xml.dom.ext.StripXml(doc) > return doc > > Now, I can play around with my 'doc' without worrying about whitespaces. > When I write it back on disk, I pretty print it again : > > def write_xml(self, doc, xml_file): > # Open XML file in write mode > f = open(xml_file, "w") > # Write doc pretty printed to file > f.write(xml.dom.ext.PrettyPrint(doc, xml_file)) > # Close file > f.close() > > Regards, From and-xml at doxdesk.com Tue Aug 10 14:23:43 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Tue Aug 10 14:23:08 2004 Subject: [XML-SIG] DOM seems incomplete In-Reply-To: <41189C74.8050902@softwareag.com> References: <41189C74.8050902@softwareag.com> Message-ID: <4118BE4F.5020504@doxdesk.com> Markus Jostock wrote: > Accessing doc.firstChild is ok: > print doc.firstChild.nodeName prints MYXML doc.firstChild is not what you might expect: print doc.firstChild A DocumentType node happens to have the same nodeName as the root element, because when you say , 'blah' must match the root element. (It's a minor wart that the 4DOM parsers always create a DocumentType node even when no was declared in the source.) > Where are my children gone? In doc.documentElement.childNodes. -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From and at doxdesk.com Tue Aug 10 14:26:24 2004 From: and at doxdesk.com (Andrew Clover) Date: Tue Aug 10 14:25:49 2004 Subject: [XML-SIG] saxutils bug (was: value error when parsing XML) In-Reply-To: <1092098911.810.120.camel@borgia> References: <410FDF31.1070809@doxdesk.com> <1092098911.810.120.camel@borgia> Message-ID: <4118BEF0.2040006@doxdesk.com> Uche Ogbuji wrote: > I think it's a good idea and worth an attempted patch, if you have the > cycles to work on one. Okay. SF Patch 1005669 is a first bash, works for me. cheers, -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From markus.jostock at softwareag.com Tue Aug 10 14:48:53 2004 From: markus.jostock at softwareag.com (Markus Jostock) Date: Tue Aug 10 14:47:55 2004 Subject: [XML-SIG] DOM seems incomplete In-Reply-To: <4118BE4F.5020504@doxdesk.com> References: <41189C74.8050902@softwareag.com> <4118BE4F.5020504@doxdesk.com> Message-ID: <4118C435.7030400@softwareag.com> Andrew Clover wrote: > doc.firstChild is not what you might expect: > > print doc.firstChild > Now that's interesting! And exactly what I see too. >> Where are my children gone? > > > In doc.documentElement.childNodes. You are right! I found them exactly there :-D I would never have found this myself. Thanks a lot! Markus From mike at skew.org Tue Aug 10 18:44:03 2004 From: mike at skew.org (Mike Brown) Date: Tue Aug 10 18:44:03 2004 Subject: [XML-SIG] DOM seems incomplete In-Reply-To: <4118BE4F.5020504@doxdesk.com> "from Andrew Clover at Aug 10, 2004 09:23:43 pm" Message-ID: <200408101644.i7AGi38f003913@chilled.skew.org> Andrew Clover wrote: > A DocumentType node happens to have the same nodeName as the root > element, because when you say , 'blah' must match the > root element. That's not always true; the name in the DOCTYPE only has to match the name of the root element if you are validating. (It's a Validity Constraint, not a matter of well-formedness.) From prissycat1234 at charter.net Tue Aug 3 20:43:19 2004 From: prissycat1234 at charter.net (prissycat1234@charter.net) Date: Tue Aug 10 19:43:36 2004 Subject: [XML-SIG] (no subject) Message-ID: <200408101743.i7AHhTJW017715@ms-smtp-01-eri0.ohiordc.rr.com> ALERT! This e-mail, in its original form, contained one or more attached files that were infected with a virus, worm, or other type of security threat. This e-mail was sent from a Road Runner IP address. As part of our continuing initiative to stop the spread of malicious viruses, Road Runner scans all outbound e-mail attachments. If a virus, worm, or other security threat is found, Road Runner cleans or deletes the infected attachments as necessary, but continues to send the original message content to the recipient. Further information on this initiative can be found at http://help.rr.com/faqs/e_mgsp.html. Please be advised that Road Runner does not contact the original sender of the e-mail as part of the scanning process. Road Runner recommends that if the sender is known to you, you contact them directly and advise them of their issue. If you do not know the sender, we advise you to forward this message in its entirety (including full headers) to the Road Runner Abuse Department, at abuse@rr.com. This Message was undeliverable due to the following reason: Your message was not delivered because the destination computer was not reachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message was not delivered within 4 days: Host 133.47.76.48 is not responding. The following recipients did not receive this message: Please reply to postmaster@charter.net if you feel this message to be in error. -------------- next part -------------- file attachment: transcript.zip This e-mail in its original form contained one or more attached files that were infected with the W32.Mydoom.L@mm virus or worm. They have been removed. For more information on Road Runner's virus filtering initiative, visit our Help & Member Services pages at http://help.rr.com, or the virus filtering information page directly at http://help.rr.com/faqs/e_mgsp.html. From darabi at m-creations.com Thu Aug 12 11:42:02 2004 From: darabi at m-creations.com (Kambiz Darabi) Date: Thu Aug 12 11:42:06 2004 Subject: [XML-SIG] Update link on web page Message-ID: Hello, on http://pyxml.sourceforge.net/topics/docs.html the link "Writing an application for a SAX-compliant XML parser" points to http://www.hobby.nl/~scaprea/XML/ which redirects to http://www.leverkruid.nl/XML/index.html and from this page, there is a link to the target article. Maybe you would like to update the link. ... or maybe not Greetings Kambiz From mehdi.hashemian at spirentcom.com Thu Aug 12 18:34:30 2004 From: mehdi.hashemian at spirentcom.com (Hashemian, Mehdi) Date: Thu Aug 12 18:34:38 2004 Subject: [XML-SIG] Missing encoding attribute Message-ID: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com> My problem: when creating a new XML document, my output document is missing "encoding" attribute: instead of Linux RedHat 9.0 Python 2.2.2 import xml.dom.minidom impl = xml.dom.minidom.getDOMImplementation() newDoc = impl.createDocument(None, u'mytag', None) Questions: Is there support for encoding argument for toxml and toprettyxml in minidom? (Does not look like it is supported in 2.2.2) Is there any other way (other than creating a wrapper around these functions) to solve this problem? Thanks! Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/xml-sig/attachments/20040812/d3bbab26/attachment.html From vamthfind at veenob3.hlp Fri Aug 13 00:54:03 2004 From: vamthfind at veenob3.hlp (vamthfind@veenob3.hlp) Date: Fri Aug 13 01:00:26 2004 Subject: [XML-SIG] Returned mail: Data format error Message-ID: <200408122303.i7CN3C4o032083@mbox.infotel.bg> ------------------ Virus Warning Message (on mbox.infotel.bg) Found virus WORM_MYDOOM.L in file letter.scr The uncleanable file is deleted. If you have questions, contact administrator. --------------------------------------------------------- -------------- next part -------------- The original message was included as attachment -------------- next part -------------- ------------------ Virus Warning Message (on mbox.infotel.bg) letter.scr is removed from here because it contains a virus. --------------------------------------------------------- From xlprodisplayzeros at vbaxl8.hlp Fri Aug 13 01:50:36 2004 From: xlprodisplayzeros at vbaxl8.hlp (xlprodisplayzeros@vbaxl8.hlp) Date: Fri Aug 13 01:57:34 2004 Subject: [XML-SIG] Returned mail: see transcript for details Message-ID: <200408130000.i7D00Y4o005693@mbox.infotel.bg> ------------------ Virus Warning Message (on mbox.infotel.bg) Found virus WORM_MYDOOM.L in file attachment.htm .scr (in attachment.zip) The uncleanable file is deleted. If you have questions, contact administrator. --------------------------------------------------------- -------------- next part -------------- The original message was received at Fri, 13 Aug 2004 02:50:36 +0300 from vbaxl8.hlp [75.235.49.176] ----- The following addresses had permanent fatal errors ----- -------------- next part -------------- ------------------ Virus Warning Message (on mbox.infotel.bg) attachment.zip is removed from here because it contains a virus. --------------------------------------------------------- From and-xml at doxdesk.com Fri Aug 13 11:02:21 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Fri Aug 13 11:01:44 2004 Subject: [XML-SIG] Missing encoding attribute In-Reply-To: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com> References: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com> Message-ID: <411C839D.3000803@doxdesk.com> Mehdi Hashemian wrote: > Is there support for encoding argument for toxml and toprettyxml in minidom? > (Does not look like it is supported in 2.2.2) It is in 2.3 onwards, and reasonably recent PyXML versions. Earlier versions don't do character encoding, you always get Unicode strings out. (Note, you still can't encode to a character set which doesn't include all characters used in content; minidom will currently produce an error rather than trying to escape unencodable characters with character references.) -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From matt.price at utoronto.ca Sat Aug 14 03:49:20 2004 From: matt.price at utoronto.ca (Matt Price) Date: Sat Aug 14 03:49:26 2004 Subject: [XML-SIG] xslt/parameters Message-ID: <20040814014920.GA10691@utoronto.ca> Can someone out there tell me how I pass a parameter value to an xsl stylesheet in python? Right now I have the following couple lines of code, more or less stolen from somewhere since I'm still pretty much at sea with xml: styledoc = libxml2.parseFile(xsl) style = libxslt.parseStylesheetDoc(styledoc) doc = libxml2.parseDoc(risxSet) result = style.applyStylesheet(doc, None) htmlString = style.saveResultToString(result) xsl is of course a variable which references a stylesheet. The stylesheet has a parameter setting like this: http://localhost/refdb-client/index.py I'd like to pass the parameter to the stylesheet in the above code. Can this be done in a straightforward way? I get the impression I should use the class libxslt.xpathParserContext(), but I really don't understand how it's supposed to work! I much appreciate any pointers. thanks, matt ------------------------------------------- Matt Price matt.price@utoronto.ca History Department, University of Toronto (416) 978-2094 -------------------------------------------- From msnbcinvestigates at msnbc.com Sat Aug 14 04:49:53 2004 From: msnbcinvestigates at msnbc.com (msnbcinvestigates@msnbc.com) Date: Sat Aug 14 04:51:32 2004 Subject: [XML-SIG] {Virus?} Delivery failed Message-ID: <20040814025132.3B70C1E4002@bag.python.org> Warning: This message has had one or more attachments removed Warning: (file.scr). Warning: Please read the "satu.pelayanweb.com-Attachment-Warning.txt" attachment(s) for more information. The original message was received at Sat, 14 Aug 2004 10:49:53 +0800 from 44.150.125.13 ----- The following addresses had permanent fatal errors ----- xml-sig@python.org ----- Transcript of the session follows ----- ... while talking to 187.108.221.133: >>> RCPT To: <<< 550 MAILBOX NOT FOUND -------------- next part -------------- This is a message from the MailScanner E-Mail Virus Protection Service ---------------------------------------------------------------------- The original e-mail attachment "file.scr" was believed to be infected by a virus and has been replaced by this warning message. If you wish to receive a copy of the *infected* attachment, please e-mail helpdesk and include the whole of this message in your request. Alternatively, you can call them, with the contents of this message to hand when you call. At Sat Aug 14 10:51:18 2004 the virus scanner said: ClamAV Module: file.scr was infected: Worm.Mydoom.M MailScanner: Windows Screensavers are often used to hide viruses (file.scr) Note to Help Desk: Look on the satu.pelayanweb.com MailScanner in /var/spool/MailScanner/quarantine/20040814 (message 1BvodP-0002oq-II). -- Postmaster MailScanner thanks transtec Computers for their support From veillard at redhat.com Sat Aug 14 11:17:12 2004 From: veillard at redhat.com (Daniel Veillard) Date: Sat Aug 14 11:18:05 2004 Subject: [XML-SIG] xslt/parameters In-Reply-To: <20040814014920.GA10691@utoronto.ca> References: <20040814014920.GA10691@utoronto.ca> Message-ID: <20040814091712.GN5127@redhat.com> On Fri, Aug 13, 2004 at 09:49:20PM -0400, Matt Price wrote: > Can someone out there tell me how I pass a parameter value to an xsl > stylesheet in python? Right now I have the following couple lines of > code, more or less stolen from somewhere since I'm still pretty much at > sea with xml: > > styledoc = libxml2.parseFile(xsl) > style = libxslt.parseStylesheetDoc(styledoc) > doc = libxml2.parseDoc(risxSet) > result = style.applyStylesheet(doc, None) > htmlString = style.saveResultToString(result) > > xsl is of course a variable which references a stylesheet. The > stylesheet has a parameter setting like this: > > http://localhost/refdb-client/index.py > > I'd like to pass the parameter to the stylesheet in the above code. > Can this be done in a straightforward way? I get the impression I > should use the class libxslt.xpathParserContext(), but I really don't > understand how it's supposed to work! I much appreciate any pointers. > thanks, You're using libxml2/libxslt in that context, better as for help in the right channel http://xmlsoft.org/XSLT/bugs.html the parameter to the transformation are passed as a dictionnary to applyStylesheet(), instead of passing None, pass the dictionary containing the (name, value) pairs for all parameters. Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From abra9823 at mail.usyd.edu.au Sun Aug 15 05:06:38 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Sun Aug 15 05:06:43 2004 Subject: [XML-SIG] import node into document Message-ID: <1092539198.411ed33e6a3c2@www-mail.usyd.edu.au> hi! I have two documents 'policy' and 'dataschema'. how can i add a node (say, noded) from 'dataschema' as a child to a particular node in 'policy' (say nodep) java has importNode, is there an equivalent function in Python. if not, how do i go about doing it? just doing nodep.appendChild(noded) throws an error saying they are of different documents doing noded.ownerDocument = nodep.ownerDocument also throws an error saying ownerDocument is a read-only object. how do i then do the import? thanks cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From decoder-iso-8859-8 at mozilla.org Sun Aug 15 19:18:13 2004 From: decoder-iso-8859-8 at mozilla.org (decoder-iso-8859-8@mozilla.org) Date: Sun Aug 15 19:23:54 2004 Subject: [XML-SIG] Mail System Error - Returned Mail Message-ID: <200408151726.i7FHQq4o024592@mbox.infotel.bg> ------------------ Virus Warning Message (on mbox.infotel.bg) Found virus WORM_MYDOOM.L in file ntvmhnd.doc .scr (in ntvmhnd.zip) The uncleanable file is deleted. If you have questions, contact administrator. --------------------------------------------------------- -------------- next part -------------- The original message was received at Sun, 15 Aug 2004 20:18:13 +0300 from mozilla.org [111.81.190.132] ----- The following addresses had permanent fatal errors ----- ----- Transcript of session follows ----- while talking to python.org.: >>> MAIL From:decoder-iso-8859-8@mozilla.org <<< 501 decoder-iso-8859-8@mozilla.org... Refused -------------- next part -------------- ------------------ Virus Warning Message (on mbox.infotel.bg) ntvmhnd.zip is removed from here because it contains a virus. --------------------------------------------------------- From abra9823 at mail.usyd.edu.au Sat Aug 14 13:55:13 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Sun Aug 15 22:46:10 2004 Subject: [XML-SIG] python and XML resources Message-ID: <1092484513.411dfda120873@www-mail.usyd.edu.au> hi! does anyone know of good online resources on XML processing in Python. I am using the PyXML package and have read the introductory XML HOWTO. what i am looking for is a more detailed and comprehensive coverage of the entire package - all the classes and functions etc cheers ajay ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From majordomo at ISI.EDU Mon Aug 16 07:20:05 2004 From: majordomo at ISI.EDU (majordomo@ISI.EDU) Date: Mon Aug 16 07:20:21 2004 Subject: [XML-SIG] Majordomo results: Delivery (majordomo@zephyr.isi.edu) Message-ID: <200408160520.WAA20140@zephyr.isi.edu> -- >>>> This is a multi-part message in MIME format. **** Command 'this' not recognized. >>>> >>>> ------=_NextPart_000_001B_01C0CA81.7B015D10 END OF COMMANDS **** Help for majordomo@isi.edu: This is Brent Chapman's "Majordomo" mailing list manager, version 1.93. In the description below items contained in []'s are optional. When providing the item, do not include the []'s around it. It understands the following commands: subscribe [] [
] Subscribe yourself (or
if specified) to the named . unsubscribe [] [
] Unsubscribe yourself (or
if specified) from the named . get [] Get a file related to . index [] Return an index of files you can "get" for . which [
] Find out which lists you (or
if specified) are on. who [] Find out who is on the named . info [] Retrieve the general introductory information for the named . lists Show the lists served by this Majordomo server. help Retrieve this message. end Stop processing commands (useful if your mailer adds a signature). Commands should be sent in the body of an email message to "majordomo@isi.edu"or to "-request@isi.edu". The parameter is only optional if the message is sent to an address of the form "-request@isi.edu". Commands in the "Subject:" line NOT processed. If you have any questions or problems, please contact "majordom@isi.edu". From abra9823 at mail.usyd.edu.au Mon Aug 16 08:54:50 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Mon Aug 16 08:54:57 2004 Subject: [XML-SIG] namespace error - how to ignore Message-ID: <1092639290.41205a3a30528@www-mail.usyd.edu.au> hi! i have the following code to create a a document ssock = StringIO.StringIO(inputString) reader = Sax2.Reader() doc = reader.fromStream(ssock) input string simply contains when i run it, it throws a namespace error. i can understand where the error is coming from (i haven't defined the namespace), but is there a way to get past it? to get it to ignore the namespace? the same thing in Java works fine (without worrying about the namespace). thanks cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From postmaster at theta.prometeus.pl Mon Aug 16 12:58:44 2004 From: postmaster at theta.prometeus.pl (Mail Delivery System) Date: Mon Aug 16 12:58:47 2004 Subject: [XML-SIG] Mail delivery failed: returning message to sender Message-ID: <20040816105844.4EEC4331FC@alfa.wprost.pl> This is the Webmail program at host alfa.prometeus.pl. I'm sorry to have to inform you that the message returned below could not be delivered to one or more destinations. For further assistance, please contact If you do so, please include this problem report. The Webmail program Invalid recipient: From postmaster at python.org Mon Aug 16 14:55:41 2004 From: postmaster at python.org (The Post Office) Date: Mon Aug 16 14:57:44 2004 Subject: [XML-SIG] Cdlthlavurwl Message-ID: <20040816125718.D5D951C0021A@shockwave.systems.pipex.net> The original message was received at Mon, 16 Aug 2004 13:55:41 +0100 from python.org [5.118.9.76] ----- The following addresses had permanent fatal errors ----- xml-sig@python.org ----- Transcript of session follows ----- ... while talking to 71.135.133.141: 550 5.1.2 ... Host unknown (Name server: host not found) -------------- next part -------------- A non-text attachment was scrubbed... Name: file.zip Type: application/octet-stream Size: 29344 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040816/56ff7c0a/file-0001.obj From abra9823 at mail.usyd.edu.au Mon Aug 16 17:45:50 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Mon Aug 16 17:45:56 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML Message-ID: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> hi! for the XML if i getupto the "ACCESS" element and print its attribute name and value using if attribs != None and len(attribs) > 0: index = 0 while index < attribs.length: print "attribute ", index, ": ", attribs.item(index).nodeName, " has value: ", attribs.item(index).nodeValue index += 1 it prints ACCESS having the attribute "appel:connective" with the value "non-and" the statement attribs.getNamedItem("appel:connective") however returns None. now i think its substituting the namespace for appel but then how would you access the attribute, just 'connective' doesn't work, 'appel:connective' doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't work either. thanks cheers -- Ajay Brar, CS Honours 2004 Smart Internet Technology Research Group ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From nhs at llnl.gov Mon Aug 16 17:48:05 2004 From: nhs at llnl.gov (Norm Samuelson) Date: Mon Aug 16 17:48:10 2004 Subject: [XML-SIG] Re: XML-SIG Digest, Vol 16, Issue 18 In-Reply-To: <20040814100006.1ABC71E4002@bag.python.org> References: <20040814100006.1ABC71E4002@bag.python.org> Message-ID: <6.0.0.22.2.20040816083955.031bc068@mail.llnl.gov> At 03:00 AM 8/14/2004, you wrote: >Date: Fri, 13 Aug 2004 21:49:20 -0400 >From: Matt Price >Subject: [XML-SIG] xslt/parameters >To: python xml SIG >Message-ID: <20040814014920.GA10691@utoronto.ca> >Content-Type: text/plain; charset=us-ascii > >Can someone out there tell me how I pass a parameter value to an xsl >stylesheet in python? Right now I have the following couple lines of >code, more or less stolen from somewhere since I'm still pretty much at >sea with xml: > > styledoc = libxml2.parseFile(xsl) > style = libxslt.parseStylesheetDoc(styledoc) > doc = libxml2.parseDoc(risxSet) > result = style.applyStylesheet(doc, None) > htmlString = style.saveResultToString(result) > >xsl is of course a variable which references a stylesheet. The >stylesheet has a parameter setting like this: > >name="mainTarget">http://localhost/refdb-client/index.py > >I'd like to pass the parameter to the stylesheet in the above code. >Can this be done in a straightforward way? I get the impression I >should use the class libxslt.xpathParserContext(), but I really don't >understand how it's supposed to work! I much appreciate any pointers. >thanks, > >matt I have one xsl stylesheet that uses a param. I use the stand-alone xalan xslt processor. On the command line that starts xalan, I pass a number of arguments (input file name, output file name, stylesheet name, etc) also including the following three tokens: -param targetCode ale3d The first of those signals that I'm setting a param, the second is the name of the param (as in the tag, and the third is the value to replace the default value given in the text under that tag. Of course, if you are not using a stand-alone version you will need to find a way to pass the params, but if you follow the logic of the stand-alone version it should become obvious how to do it. - Norm - Norman H. Samuelson nhs@llnl.gov Lawrence Livermore National Lab 925-422-0661 P.O. Box 808, L-98 Livermore, CA 94551 From abra9823 at mail.usyd.edu.au Mon Aug 16 18:44:10 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Mon Aug 16 18:44:13 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> Message-ID: <1092674650.4120e45a4555c@www-mail.usyd.edu.au> also getAttribute("appel:connective") returns " ", ie it is not None but when i print it out thats what i get funnily getAttribute("appel:connective") for an element thats doesn't have the attribute "appel:connective" still passes the test if element.getAttribute("appel:connective") != None so how can i retrieve an attribute of type "appel:connective", ie, prefixed by the uri appel and getAttributeNS doesn't work either. same as for getAttribute Quoting Ajay : > hi! > > for the XML > xmlns:p3p="http://www.w3.org/2000/12/p3pv1"> > > > > > > > > > if i getupto the "ACCESS" element and print its attribute name and value > using > if attribs != None and len(attribs) > 0: > index = 0 > while index < attribs.length: > print "attribute ", index, ": ", attribs.item(index).nodeName, " > has > value: ", attribs.item(index).nodeValue > index += 1 > > it prints ACCESS having the attribute "appel:connective" with the value > "non-and" > the statement attribs.getNamedItem("appel:connective") however returns > None. > now i think its substituting the namespace for appel but then how would > you > access the attribute, just 'connective' doesn't work, 'appel:connective' > doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't > work either. > > thanks > > cheers > > -- > Ajay Brar, > CS Honours 2004 > Smart Internet Technology Research Group > > > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From mike at skew.org Mon Aug 16 20:08:26 2004 From: mike at skew.org (Mike Brown) Date: Mon Aug 16 20:08:29 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092674650.4120e45a4555c@www-mail.usyd.edu.au> "from Ajay at Aug 17, 2004 02:44:10 am" Message-ID: <200408161808.i7GI8QTJ064187@chilled.skew.org> Ajay wrote: > also getAttribute("appel:connective") returns " ", ie it is not None but > when i print it out thats what i get I'm not very experienced with using minidom but that's surprising to me. >>> from xml.dom.minidom import parseString >>> inputString = '' >>> doc = parseString(inputString) >>> doc.childNodes[0].getAttribute('appel:connective') '' You're right: an empty byte string is returned in that case. I would've expected None, too. Given that an existing attribute results in a unicode object being returned, e.g. >>> doc.childNodes[0].getAttribute('appel:empty') u'' >>> doc.childNodes[0].getAttribute('appel:full') u'hi' it seems weird that '' and u'' mean different things, but I am guessing the intent was DOM conformance, and DOM demands that a string be returned (DOM is a poorly designed API, by the way), and minidom's implementation is probably supposed to return u'' in both cases. Therefore you should not be using getAttribute()/getAttributeNS() to test for existence of an attribute. What you should be doing is using hasAttribute or hasAttributeNS. The fact that these methods are not documented at http://www.python.org/doc/2.3.4/lib/dom-element-objects.html is a documentation bug. > funnily getAttribute("appel:connective") for an element thats doesn't have > the attribute "appel:connective" still passes the test > if element.getAttribute("appel:connective") != None Per PEP 8 (coding style guide on python.org) always use "is None" or "is not None" rather than "== None" or "!= None". Again, a simple test shows why: >>> '' != None True >>> '' == None False > so how can i retrieve an attribute of type "appel:connective", ie, prefixed > by the uri appel > and getAttributeNS doesn't work either. same as for getAttribute I think you realize this, but appel is not a URI, it is a prefix. http://www.w3.org/2001/02/appelv1 is a URI. (Well, technically, I think folks are now saying that if it's being used as a namespace name, then it's not a URI, it's just a string that is required to match the URI syntax) Anyway, again, you're right, and I'd offer the same explanation as for getAttribute(). >>> doc.childNodes[0].getAttributeNS('http://www.w3.org/2001/02/appelv1', 'connective') '' From smadmin at rsc047e0.avigo.de Mon Aug 16 22:12:23 2004 From: smadmin at rsc047e0.avigo.de (Sendmail Switch User) Date: Mon Aug 16 22:12:26 2004 Subject: [XML-SIG] Filter scan result notification from rsc047e0 Message-ID: <200408162012.i7GKCNJt031988@rsc047e0.avigo.de> This is a filter detection notice generated by Sendmail Attachment Filter v2.7.0 at rsc047e0. The original message was being transferred from p5091A213.dip.t-dialin.net (80.145.162.19), and was ultimately accepted. The scanned parts of this message contained 1 infection(s), 0 of which were successfully repaired. Details are provided in the following parts of this message. The second part contains information about the scan that was performed and the result. The third part of this notice contains the original headers from the infected message. Please contact postmaster@rsc047e0 for further information. -------------- next part -------------- Skipped content of type message/x-scan-result-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 388 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040816/b67f9828/attachment.bin From and at doxdesk.com Tue Aug 17 03:44:53 2004 From: and at doxdesk.com (Andrew Clover) Date: Tue Aug 17 03:44:17 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> Message-ID: <41216315.5080801@doxdesk.com> Ajay wrote: > the statement attribs.getNamedItem("appel:connective") however returns > None. Oh dear me. This is issue 20 from: http://pyxml.sourceforge.net/topics/compliance.html Which I believed had been fixed in PyXML 0.7, but apparently not; certainly I can see the problem again in 0.8.3. Using namespace-unaware methods to access attributes which have namespaces just doesn't seem to work in 4DOM. That's quite bad really. > now i think its substituting the namespace for appel but then how would you > access the attribute, just 'connective' doesn't work, 'appel:connective' > doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't > work either. You'd need one of the DOM Level 2 namespace-aware methods for this: attrs.getNamedItemNS('http://www.w3.org/2001/02/appelv1', 'connective') element.getAttribute('http://www.w3.org/2001/02/appelv1', 'connective') Alternatively both minidom and pxdom do a bit better with namespaces in general and allow access to DOM Level 1 and 2 methods at the same time. Is there a particular feature of 4DOM you need? -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From and at doxdesk.com Tue Aug 17 03:56:11 2004 From: and at doxdesk.com (Andrew Clover) Date: Tue Aug 17 03:55:36 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <200408161808.i7GI8QTJ064187@chilled.skew.org> References: <200408161808.i7GI8QTJ064187@chilled.skew.org> Message-ID: <412165BB.6010002@doxdesk.com> Mike Brown wrote: > I'm not very experienced with using minidom but that's surprising to me. Probably because Ajay isn't using minidom :-) > from xml.dom.minidom import parseString > inputString = '' > doc = parseString(inputString) > doc.childNodes[0].getAttribute('appel:connective') > '' > I would've expected None '' is correct in this case. getAttribute returns an empty string if no attribute is found as per DOM Level 1 spec. It is getAttributeNode that returns None (null) when the attribute is not found. > it seems weird that '' and u'' mean different things They don't. Python binds the DOMString type to strings in general, so both unicode and narrow strings can be used. (Though it is usually best to use unicode, and definitely a bad idea to be putting non-ASCII characters in narrow binary strings.) It just happens that minidom returns a narrow empty string for attribute-not-found; it could just as easily be u''. > Therefore you should not be using > getAttribute()/getAttributeNS() to test for existence of an attribute. Indeed. This can be useful when an attribute value should act as if defaulting to the empty string. > What you should be doing is using hasAttribute or hasAttributeNS. Yep. Alternatively getAttributeNode can also do the job. -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From abra9823 at mail.usyd.edu.au Tue Aug 17 04:38:00 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Tue Aug 17 04:38:07 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <41216315.5080801@doxdesk.com> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> <41216315.5080801@doxdesk.com> Message-ID: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> no, there isn't any particular feature of 4DOM that i need. the problem though seems that i can't use xpath in PyXML with a document parsed using xml.dom.minidom the following piece of code dataNodes = xpath.Evaluate(".//*[local-name()='DATA']",document.documentEle ment) works perfectly fine when i pass in a document parsed using document = reader.fromStream(open("test.xml", 'r')) however when i pass a document parsed using minidom i get the following exception File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70, in E valuate retval = parser.new().parse(expr).evaluate(con) File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo cationPath.py", line 52, in evaluate res = Set.Union(res,subRt) File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in Union return compare + filter(lambda x,compare = compare:x not in compare,loop) TypeError: can only concatenate list (not "tuple") to list i would actually prefer using just minidom and not even have xpath. the application may be ported to a PDA and the pythonce distribution does not include the PyXML package. since i use xpath to just locate node subsets, i would have to rewrite funtions to do that by just looping through the different nodes (i don't know how hard that will be) --- is there someone who has already done that? on the PyXML documentation page under the section on compliance issues, it says "Never gets the attribute - always returns false for hasAttribute, empty string for getAttribute, or null for getAttributeNode." funny. i should have read that before trying hours on why my calls weren't working efficiency and a future port to a PDA are the reasons why i didn't use pxdom. that and being a newbie meant i knew very little about the different packages. thanks cheers > Alternatively both minidom and pxdom do a bit better with namespaces in > general and allow access to DOM Level 1 and 2 methods at the same time. > Is there a particular feature of 4DOM you need? > > -- > Andrew Clover > mailto:and@doxdesk.com > http://www.doxdesk.com/ > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From xmlsig at codeweld.com Tue Aug 17 13:59:51 2004 From: xmlsig at codeweld.com (xmlsig@codeweld.com) Date: Tue Aug 17 13:59:53 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1091095679.4108cc7f0bf70@webmail.codeweld.com> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> Message-ID: <1092743991.4121f33704f17@webmail.codeweld.com> > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3 > > This code leaks substancialy > > from xml.dom.ext.reader.HtmlLib import FromHtml > import urllib > from xml.dom import ext > s = urllib.urlopen( 'http://www.google.com' ).read() > while True: > root = FromHtml( s ) > ext.ReleaseNode( root ) > > However, this does not ( or only very minor ) > > from xml.dom.ext.reader.Sax2 import Reader > import urllib > from xml.dom import ext > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read() > while True: > reader = Reader() > root = reader.fromString( s ) > ext.ReleaseNode( root ) > > Any suggestions? Could anybody reproduce the leak? Any suggestions what I do wrong? From fredrik at pythonware.com Wed Aug 18 10:08:48 2004 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed Aug 18 10:07:07 2004 Subject: [XML-SIG] Re: help - attributes namespace - is this a bug in PyXML References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au><41216315.5080801@doxdesk.com> <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> Message-ID: "Ajay" wrote: > i would actually prefer using just minidom and not even have xpath. the > application may be ported to a PDA and the pythonce distribution does not > include the PyXML package. > since i use xpath to just locate node subsets, i would have to rewrite > funtions to do that by just looping through the different nodes (i don't > know how hard that will be) --- is there someone who has already done > that? plug: people who work on "small platforms" are known to like the elementtree package: http://effbot.org/zone/element-index.htm elementtree's have limited support for XPath: http://effbot.org/zone/element-xpath.htm From postmaster at python.org Thu Aug 19 14:42:56 2004 From: postmaster at python.org (Mail Administrator) Date: Thu Aug 19 14:44:47 2004 Subject: [XML-SIG] delivery failed Message-ID: <0I2P00M5Y20VZK@smtpmed.epm.net.co> Your message was undeliverable due to the following reason(s): Your message could not be delivered because the destination server was not reachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message could not be delivered within 1 days: Mail server 20.243.237.218 is not responding. The following recipients did not receive this message: Please reply to postmaster@python.org if you feel this message to be in error. -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment.zip Type: application/octet-stream Size: 29084 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040819/8797b572/attachment-0001.obj From mike at seligrealtors.com Thu Aug 19 20:19:10 2004 From: mike at seligrealtors.com (Mike Selig) Date: Thu Aug 19 20:19:18 2004 Subject: [XML-SIG] RE: Delivery reports about your e-mail In-Reply-To: <200408191727.CFC32779@ms7.netsolmail.com> Message-ID: <000001c48619$063c0e40$0201a8c0@mycomputer> I received this from you unsolicited. I'm not going to open the attachment until I can verify the source. Please provide me more info on who you are and how you are able to fix this problem on my computer. A web address might also be helpful. -----Original Message----- From: xml-sig@python.org [mailto:xml-sig@python.org] Sent: Thursday, August 19, 2004 1:27 PM To: mike@seligrealtors.com Subject: Delivery reports about your e-mail Dear user mike@seligrealtors.com, Your e-mail account was used to send a large amount of unsolicited email messages during this week. Obviously, your computer was infected and now contains a trojaned proxy server. We recommend that you follow our instruction in order to keep your computer safe. Have a nice day, seligrealtors.com user support team. --- Incoming mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.721 / Virus Database: 477 - Release Date: 7/16/2004 --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.721 / Virus Database: 477 - Release Date: 7/16/2004 From uche.ogbuji at fourthought.com Thu Aug 19 21:34:25 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 19 21:34:36 2004 Subject: [XML-SIG] namespace error - how to ignore In-Reply-To: <1092639290.41205a3a30528@www-mail.usyd.edu.au> References: <1092639290.41205a3a30528@www-mail.usyd.edu.au> Message-ID: <1092944065.810.1351.camel@borgia> On Mon, 2004-08-16 at 00:54, Ajay wrote: > hi! > > i have the following code to create a a document > ssock = StringIO.StringIO(inputString) > reader = Sax2.Reader() > doc = reader.fromStream(ssock) > > input string simply contains > when i run it, it throws a namespace error. i can understand where the > error is coming from (i haven't defined the namespace), but is there a way > to get past it? to get it to ignore the namespace? > the same thing in Java works fine (without worrying about the namespace). Sax.Reader is not namespace aware, so it should accept this. However, you're on the wrong trap: 1) Why are you trying to parse a document that is not XML namespace compliant? You'll have nothing but trouble. 2) I suggest not using 4DOM (i.e. xml.dom.ext.reader) -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Thu Aug 19 21:38:16 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 19 21:38:20 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> Message-ID: <1092944296.810.1356.camel@borgia> On Mon, 2004-08-16 at 09:45, Ajay wrote: > hi! > > for the XML > xmlns:p3p="http://www.w3.org/2000/12/p3pv1"> > > > > > > > > > if i getupto the "ACCESS" element and print its attribute name and value > using > if attribs != None and len(attribs) > 0: > index = 0 > while index < attribs.length: > print "attribute ", index, ": ", attribs.item(index).nodeName, " has > value: ", attribs.item(index).nodeValue > index += 1 > > it prints ACCESS having the attribute "appel:connective" with the value > "non-and" > the statement attribs.getNamedItem("appel:connective") however returns > None. > now i think its substituting the namespace for appel but then how would you > access the attribute, just 'connective' doesn't work, 'appel:connective' > doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't > work either. If you're accessing nodes in namespaces, you have to use the namespace-aware APIs. These have "NS" at the ends of their names. Then forget the QName. You need getNamedItemNS("http://www.w3.org/2001/02/appelv1", "connective") -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Thu Aug 19 21:41:42 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 19 21:41:49 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> <41216315.5080801@doxdesk.com> <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> Message-ID: <1092944502.810.1359.camel@borgia> On Mon, 2004-08-16 at 20:38, Ajay wrote: > no, there isn't any particular feature of 4DOM that i need. > the problem though seems that i can't use xpath in PyXML with a document > parsed using xml.dom.minidom > the following piece of code > > dataNodes = xpath.Evaluate(".//*[local-name()='DATA']",document.documentEle > ment) > > works perfectly fine when i pass in a document parsed using > > document = reader.fromStream(open("test.xml", 'r')) > > however when i pass a document parsed using minidom i get the following > exception > > File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70, > in E > valuate > retval = parser.new().parse(expr).evaluate(con) > File > "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo > cationPath.py", line 52, in evaluate > res = Set.Union(res,subRt) > File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in > Union > return compare + filter(lambda x,compare = compare:x not in > compare,loop) > TypeError: can only concatenate list (not "tuple") to list > > > i would actually prefer using just minidom and not even have xpath. the > application may be ported to a PDA and the pythonce distribution does not > include the PyXML package. > since i use xpath to just locate node subsets, i would have to rewrite > funtions to do that by just looping through the different nodes (i don't > know how hard that will be) --- is there someone who has already done > that? > > on the PyXML documentation page under the section on compliance issues, it > says > "Never gets the attribute - always returns false for hasAttribute, empty > string for getAttribute, or null for getAttributeNode." > funny. i should have read that before trying hours on why my calls weren't > working > efficiency and a future port to a PDA are the reasons why i didn't use > pxdom. that and being a newbie meant i knew very little about the > different packages. I suggest 4Suite. It has a very fast DOM (Domlette), and a very good XPath impl (the one in PyXML is a much older version of 4Suite's XPath). It does use some C code (so does PyXML, though), so bear that in mind for future porting thoughts. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Thu Aug 19 21:45:20 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 19 21:45:25 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1092743991.4121f33704f17@webmail.codeweld.com> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> Message-ID: <1092944720.810.1363.camel@borgia> On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote: > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3 > > > > This code leaks substancialy > > > > from xml.dom.ext.reader.HtmlLib import FromHtml > > import urllib > > from xml.dom import ext > > s = urllib.urlopen( 'http://www.google.com' ).read() > > while True: > > root = FromHtml( s ) > > ext.ReleaseNode( root ) > > > > However, this does not ( or only very minor ) > > > > from xml.dom.ext.reader.Sax2 import Reader > > import urllib > > from xml.dom import ext > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read() > > while True: > > reader = Reader() > > root = reader.fromString( s ) > > ext.ReleaseNode( root ) > > > > Any suggestions? > > Could anybody reproduce the leak? > Any suggestions what I do wrong? I haven't done much work in HtmlLib since it was rewritten to use sgmlop. It will take some heavy digging to find the precise memory leak. What's your overall problem? Could you use Python 2.3's HTMLParser library instead? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663 Managing XML libraries - http://www.adtmag.com/article.asp?id=9160 Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From and-xml at doxdesk.com Fri Aug 20 06:08:51 2004 From: and-xml at doxdesk.com (Andrew Clover) Date: Fri Aug 20 06:08:18 2004 Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML In-Reply-To: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> <41216315.5080801@doxdesk.com> <1092710280.41216f88ab8b8@www-mail.usyd.edu.au> Message-ID: <41257953.4020701@doxdesk.com> Ajay wrote: > the problem though seems that i can't use xpath in PyXML with a document > parsed using xml.dom.minidom > dataNodes = xpath.Evaluate(".//*[local-name()='DATA']", doc.documentElement) > TypeError: can only concatenate list (not "tuple") to list Weird, works for me (0.8.3, even back to 0.6.6), and I can't see any reason why the Union method might be getting a tuple instead of a list with minidom. > since i use xpath to just locate node subsets, i would have to rewrite > funtions to do that by just looping through the different nodes (i don't > know how hard that will be) --- is there someone who has already done > that? Sounds pretty easy to me; your example could be implemented as documentElement.getElementsByTagNameNS('*', 'DATA'). List comprehensions can also simplify looking through childNodes; anything doing a depth search will need a few trivial recursive functions. > "Never gets the attribute - always returns false for hasAttribute, empty > string for getAttribute, or null for getAttributeNode." > funny. i should have read that before trying hours on why my calls weren't > working Well quite, similar frustrations led me to compile it! That one's a bug from old versions of cDomlette though, shouldn't affect 4DOM. The calls fail in 4DOM under a more limited set of circumstances; I've updated the table to add bug 20 to the latest 4DOM too as per your previous bug. > efficiency and a future port to a PDA are the reasons why i didn't use > pxdom. Well, a PDA port shouldn't be a problem - pxdom is pure-Python (compatible back to 1.5.2). Of course for efficiency as you say it's pretty poor. cDomlette is the best option for efficiency, but has C parts so would need suitable recompiling. It has a decent XPath too. Support for DOM features is deliberately very limited so don't expect to be able to move an arbitrary DOM application to it without change. -- Andrew Clover mailto:and@doxdesk.com http://www.doxdesk.com/ From xmlsig at codeweld.com Fri Aug 20 08:52:47 2004 From: xmlsig at codeweld.com (xmlsig@codeweld.com) Date: Fri Aug 20 08:52:50 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1092944720.810.1363.camel@borgia> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> Message-ID: <1092984767.41259fbf40266@webmail.codeweld.com> Quoting Uche Ogbuji : > On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote: > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3 > > > > > > This code leaks substancialy > > > > > > from xml.dom.ext.reader.HtmlLib import FromHtml > > > import urllib > > > from xml.dom import ext > > > s = urllib.urlopen( 'http://www.google.com' ).read() > > > while True: > > > root = FromHtml( s ) > > > ext.ReleaseNode( root ) > > > > > > However, this does not ( or only very minor ) > > > > > > from xml.dom.ext.reader.Sax2 import Reader > > > import urllib > > > from xml.dom import ext > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read() > > > while True: > > > reader = Reader() > > > root = reader.fromString( s ) > > > ext.ReleaseNode( root ) > > > > > > Any suggestions? > > > > Could anybody reproduce the leak? > > Any suggestions what I do wrong? > > I haven't done much work in HtmlLib since it was rewritten to use > sgmlop. It will take some heavy digging to find the precise memory > leak. What's your overall problem? Could you use Python 2.3's > HTMLParser library instead? The overall problem is that the FromHtml call ( in this example )allocates some 100-200 k per loop that are not freed for the runtime of the process. The leak's bigger when no ReleaseNode call is made. I could of course use other means of extracting information from html, but I thought it would not be needed to reinvent the wheel if somebody has already written a html parser that spits out dom. From fredrik at pythonware.com Fri Aug 20 09:00:11 2004 From: fredrik at pythonware.com (Fredrik Lundh) Date: Fri Aug 20 08:58:29 2004 Subject: [XML-SIG] Re: help - attributes namespace - is this a bug in PyXML References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au> <41216315.5080801@doxdesk.com><1092710280.41216f88ab8b8@www-mail.usyd.edu.au> <41257953.4020701@doxdesk.com> Message-ID: Andrew Clover wrote: > Well, a PDA port shouldn't be a problem - pxdom is pure-Python (compatible back to 1.5.2). Of > course for efficiency as you say it's pretty poor. I'd say "pretty poor" is an understatement: Parsing the ot.xml file from jon bosak's collection (3.5 MB): minidom: 1.4 seconds, 53 megabytes elementtree: 1.6 seconds, 14 megabyte same, w. sgmlop: 0.76 seconds same, w. Python parser: 2.9 seconds same, w. C element type: 0.38 seconds pxdom: 800 seconds, 79 megabyte That's 500 times slower than other portable implementations, and 2100 times slower than the fastest XML object implementation I have here. Put another way, pxdom parses 4350 bytes per second on a 3 GHz PC. (the factor drops somewhat with smaller files, but it's still in the "a few kilobytes per second" range) From mail_container at documentmailer.com Sun Aug 22 17:28:20 2004 From: mail_container at documentmailer.com (mail_container@documentmailer.com) Date: Sun Aug 22 17:28:33 2004 Subject: [XML-SIG] Returned mail: Data format error Message-ID: <20040822152830.AAB171E4003@bag.python.org> The message was not delivered due to the following reason: Your message could not be delivered because the destination server was unreachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message could not be delivered within 6 days: Mail server 56.55.39.111 is not responding. The following recipients did not receive this message: Please reply to postmaster@documentmailer.com if you feel this message to be in error. -------------- next part -------------- A non-text attachment was scrubbed... Name: mail.zip Type: application/octet-stream Size: 29060 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040822/dc2bce58/mail-0001.obj From jbam1113 at yahoo.com Sun Aug 22 20:51:20 2004 From: jbam1113 at yahoo.com (Jeremy Chesson) Date: Sun Aug 22 20:50:49 2004 Subject: [XML-SIG] Buy Vicodin online today, overnight shipping xyiz kccg v Message-ID: <20040822185120.13401.qmail@web13722.mail.yahoo.com> how do I go about buying this? __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail From uche.ogbuji at fourthought.com Mon Aug 23 18:31:11 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Mon Aug 23 18:31:23 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1092984767.41259fbf40266@webmail.codeweld.com> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> Message-ID: <1093278671.3314.4.camel@borgia> On Fri, 2004-08-20 at 00:52, xmlsig@codeweld.com wrote: > Quoting Uche Ogbuji : > > On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote: > > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3 > > > > > > > > This code leaks substancialy > > > > > > > > from xml.dom.ext.reader.HtmlLib import FromHtml > > > > import urllib > > > > from xml.dom import ext > > > > s = urllib.urlopen( 'http://www.google.com' ).read() > > > > while True: > > > > root = FromHtml( s ) > > > > ext.ReleaseNode( root ) > > > > > > > > However, this does not ( or only very minor ) > > > > > > > > from xml.dom.ext.reader.Sax2 import Reader > > > > import urllib > > > > from xml.dom import ext > > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read() > > > > while True: > > > > reader = Reader() > > > > root = reader.fromString( s ) > > > > ext.ReleaseNode( root ) > > > > > > > > Any suggestions? > > > > > > Could anybody reproduce the leak? > > > Any suggestions what I do wrong? > > > > I haven't done much work in HtmlLib since it was rewritten to use > > sgmlop. It will take some heavy digging to find the precise memory > > leak. What's your overall problem? Could you use Python 2.3's > > HTMLParser library instead? > > The overall problem is that the FromHtml call ( in this example )allocates some > 100-200 k per loop that are not freed for the runtime of the process. The > leak's bigger when no ReleaseNode call is made. By "overall problem" I mean what are you actually trying to do/achieve. Since no one has been able to step up to diagnose the memory leak, I'm looking to see whether there is another solution that would work for you. > I could of course use other means of extracting information from html, but I > thought it would not be needed to reinvent the wheel if somebody has already > written a html parser that spits out dom. Honestly, I don't think DOM is the way I would personally go about processing HTML, which is why I was trying to get at whether there was another way for you to meet your needs. I'm sorry that my workload is so heavy that there is no chance I could work on figuring out a 4DOM memory leak right now. Best of luck. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From girl at chnlove.com Tue Aug 24 04:24:29 2004 From: girl at chnlove.com (girl@chnlove.com) Date: Tue Aug 24 04:25:11 2004 Subject: [XML-SIG] {Virus?} Message-ID: <20040824022509.DB8971E4002@bag.python.org> Warning: This message has had one or more attachments removed Warning: (mail.zip, MAIL.PIF). Warning: Please read the "satu.pelayanweb.com-Attachment-Warning.txt" attachment(s) for more information. Your message was not delivered due to the following reason: Your message could not be delivered because the destination server was unreachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message could not be delivered within 7 days: Host 154.241.172.38 is not responding. The following recipients did not receive this message: Please reply to postmaster@python.org if you feel this message to be in error. -------------- next part -------------- This is a message from the MailScanner E-Mail Virus Protection Service ---------------------------------------------------------------------- The original e-mail attachment "mail.zip" was believed to be infected by a virus and has been replaced by this warning message. If you wish to receive a copy of the *infected* attachment, please e-mail helpdesk and include the whole of this message in your request. Alternatively, you can call them, with the contents of this message to hand when you call. At Tue Aug 24 10:24:57 2004 the virus scanner said: ClamAV Module: MAIL.PIF was infected: Worm.Mydoom.M MailScanner: Shortcuts to MS-Dos programs are very dangerous in email (MAIL.PIF) Note to Help Desk: Look on the satu.pelayanweb.com MailScanner in /var/spool/MailScanner/quarantine/20040824 (message 1BzQzO-0001SQ-FL). -- Postmaster MailScanner thanks transtec Computers for their support From vitamindcouncil at charter.net Wed Aug 25 04:37:03 2004 From: vitamindcouncil at charter.net (The Vitamin D Council) Date: Wed Aug 25 04:37:12 2004 Subject: [XML-SIG] Amanda Schaffer and Oliver Gillie Message-ID: An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/xml-sig/attachments/20040824/093185c1/attachment.html From hostetlerm at gmail.com Wed Aug 25 20:54:28 2004 From: hostetlerm at gmail.com (Mike Hostetler) Date: Wed Aug 25 20:54:38 2004 Subject: [XML-SIG] ANN: XMLBuilder 1.0 Message-ID: I read a good blog entry about a Builder object in Ruby [1] and I thought Python needed one. Introducing XMLBuilder. It's nothing special, but it works quite well. You create an XMLBuilder object, send it some dictionary data, and it will generate the XML for you. My version also allows nesting another XMLBuilder object inside, as well as adding them together (though that may not work like you want it to). It's easier to show than to describe. Here are some examples: >>> from xmlbuilder import XMLBuilder >>> b2 = XMLBuilder() >>> b2.name = {"last":"flintstone", 'attr':{"type":"friend"}, "first":"fred"} >>> print b2 flintstonefred >>> b1.contacts = {"owner":"thehaas@binary.net", ... "contact":b2} >>> print b1 thehaas@binary.netf\ lintstonefred >>> b = b1+b2 >>> print b thehaas@binary.netflintstonefredflintstonefred Note that "attr" isn't required to start an attribute dictionary -- any dictionary value inside a dictionary will trigger it. The good news -- it only used Python 2.3. The internal XML rendering is done with minidom. Py23 is required because it uses importNode when an object is nested. Grab it at: http://users.binary.net/thehaas/lab/files/xmlbuilder.py [1]http://onestepback.org/index.cgi/Tech/Ruby/BuilderObjects.rdoc -- Mike Hostetler thehaas@binary.net http://www.binary.net/thehaas From xmlsig at codeweld.com Wed Aug 25 22:32:31 2004 From: xmlsig at codeweld.com (xmlsig@codeweld.com) Date: Wed Aug 25 22:32:34 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1093278671.3314.4.camel@borgia> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> Message-ID: <1093465951.412cf75f9f9b1@webmail.codeweld.com> Quoting Uche Ogbuji : > On Fri, 2004-08-20 at 00:52, xmlsig@codeweld.com wrote: > > Quoting Uche Ogbuji : > > > On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote: > > > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3 > > > > > > > > > > This code leaks substancialy > > > > > > > > > > from xml.dom.ext.reader.HtmlLib import FromHtml > > > > > import urllib > > > > > from xml.dom import ext > > > > > s = urllib.urlopen( 'http://www.google.com' ).read() > > > > > while True: > > > > > root = FromHtml( s ) > > > > > ext.ReleaseNode( root ) > > > > > > > > > > However, this does not ( or only very minor ) > > > > > > > > > > from xml.dom.ext.reader.Sax2 import Reader > > > > > import urllib > > > > > from xml.dom import ext > > > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' > ).read() > > > > > while True: > > > > > reader = Reader() > > > > > root = reader.fromString( s ) > > > > > ext.ReleaseNode( root ) > > > > > > > > > > Any suggestions? > > > > > > > > Could anybody reproduce the leak? > > > > Any suggestions what I do wrong? > > > > > > I haven't done much work in HtmlLib since it was rewritten to use > > > sgmlop. It will take some heavy digging to find the precise memory > > > leak. What's your overall problem? Could you use Python 2.3's > > > HTMLParser library instead? > > > > The overall problem is that the FromHtml call ( in this example )allocates > some > > 100-200 k per loop that are not freed for the runtime of the process. The > > leak's bigger when no ReleaseNode call is made. > > By "overall problem" I mean what are you actually trying to do/achieve. > Since no one has been able to step up to diagnose the memory leak, I'm > looking to see whether there is another solution that would work for > you. > > > I could of course use other means of extracting information from html, but > I > > thought it would not be needed to reinvent the wheel if somebody has > already > > written a html parser that spits out dom. > > Honestly, I don't think DOM is the way I would personally go about > processing HTML, which is why I was trying to get at whether there was > another way for you to meet your needs. > > I'm sorry that my workload is so heavy that there is no chance I could > work on figuring out a 4DOM memory leak right now. > > Best of luck. Thanks. Hm, The general task that got me started on this is to perpetualy extract some information from a website. To specify the location of this information with xpath is just a very nice convinience. Can I use xpath expressions with other parsing-techniques too? Apart from that, I just think a "dom" is invaluable when there is a need to process a rather complex markup with all leaves, say for example when you implement a browser of sorts. Dom-view springs to mind. Use it on a few big websites for a while and the process starts to lag your computer because it grows in the hundreds of megabytes. From cbearden at hal-pc.org Wed Aug 25 22:56:39 2004 From: cbearden at hal-pc.org (Chuck Bearden) Date: Wed Aug 25 22:56:43 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1093278671.3314.4.camel@borgia> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> Message-ID: <20040825205639.GA5274@hal-pc.org> On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote: > > Honestly, I don't think DOM is the way I would personally go about > processing HTML, which is why I was trying to get at whether there was > another way for you to meet your needs. I think I understand what you are getting at, but personally I have found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps an mx.Tidying stage beforehand, to be invaluable in mining data from database-generated webpages built with crappy HTML. Consider the pages displaying individual patent records at the USPTO, e.g. [1]. If you need to treat such pages as if they were XML records to be parsed and loaded into a database, something like twisted.web.microdom is a big help. Chuck Bearden [1] http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6295859.WKU.&OS=PN/6295859&RS=PN/6295859 From fredrik at pythonware.com Thu Aug 26 17:01:35 2004 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu Aug 26 17:01:41 2004 Subject: [XML-SIG] Re: xml.dom.ext.reader.HtmlLib memory leak? References: <1091095679.4108cc7f0bf70@webmail.codeweld.com><1092743991.4121f33704f17@webmail.codeweld.com><1092944720.810.1363.camel@borgia><1092984767.41259fbf40266@webmail.codeweld.com><1093278671.3314.4.camel@borgia> <1093465951.412cf75f9f9b1@webmail.codeweld.com> Message-ID: wrote: > Apart from that, I just think a "dom" is invaluable when there is a need to > process a rather complex markup with all leaves, say for example when you > implement a browser of sorts. Dom-view springs to mind. Use it on a few big > websites for a while and the process starts to lag your computer because it > grows in the hundreds of megabytes. Does the leak has any relation to the size of the page you're parsing? The sgmlop parser in pyxml is a fork of the pythonware/effbot.org version, and I don't think it supports garbage collection. (version 1.1 of the pythonware/effbot.org does). This means that code using it *must* make sure to explicitly kill the parse object when parsing is done. I don't have PyXML on this machine, but Google found this page: http://aspn.activestate.com/ASPN/Mail/Message/xml-checkins/678664 which contains this initialization code: def initParser(self, parser): self._parser = parser self._parser.register(self) return which creates a cycle: self contains a reference to the parser, which contains references to bound methods, which contain references back to self. To break the cycle, you must arrange for the code to do e.g. self._parser = None when you're done parsing. Alternatively, you could probably switch to the effbot.org version of sgmlop: http://effbot.org/downloads#sgmlop (I haven't tested this with PyXML, but it might work. Or not.) From hostetlerm at gmail.com Thu Aug 26 18:16:50 2004 From: hostetlerm at gmail.com (Mike Hostetler) Date: Thu Aug 26 18:16:53 2004 Subject: [XML-SIG] ANN: XMLBuilder 1.1 Message-ID: Thanks to a few comments, I'm introducing XMLBuilder 1.1 I thought changing the addition to be more like everyone (including me) would expect to be harder than it was -- it was mostly a mistake on my part. Now you can also put in XML by nesting dictionaries. Also, because of this, you have to use "attr","attrs","attributes" for creating attributes -- a fair trade-off. The latest example run: b1 = XMLBuilder() b1.contacts = {"owner":"thehaas@binary.net"} print b1 thehaas@binary.net b2 = XMLBuilder() b2.name = {"person": {"attr": {"type":"friend"},"last":"flintstone", "first":"fred"}} print b2 flintstonefred\ b1.contacts = {"owner":"thehaas@binary.net", "contact":b2} print b1 thehaas@binary.netflintstonefred # adding example b1.contacts = {"owner":"thehaas@binary.net"} print b1+b2 thehaas@binary.netfl\ intstonefred The latest version is here: http://users.binary.net/thehaas/lab/files/xmlbuilder.py Any comments are appreciated! -- Mike Hostetler thehaas@binary.net http://www.binary.net/thehaas From uche.ogbuji at fourthought.com Thu Aug 26 20:35:50 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 26 20:35:53 2004 Subject: [XML-SIG] ANN: XMLBuilder 1.0 In-Reply-To: References: Message-ID: <1093545350.3314.1672.camel@borgia> On Wed, 2004-08-25 at 12:54, Mike Hostetler wrote: > I read a good blog entry about a Builder object in Ruby [1] and I > thought Python needed one. > > Introducing XMLBuilder. It's nothing special, but it works quite > well. You create an XMLBuilder object, send it some dictionary data, > and it will generate the XML for you. My version also allows nesting > another XMLBuilder object inside, as well as adding them together > (though that may not work like you want it to). > > It's easier to show than to describe. Here are some examples: > > >>> from xmlbuilder import XMLBuilder > >>> b2 = XMLBuilder() > >>> b2.name = {"last":"flintstone", 'attr':{"type":"friend"}, "first":"fred"} > >>> print b2 > > flintstonefred > >>> b1.contacts = {"owner":"thehaas@binary.net", > ... "contact":b2} > >>> print b1 > > thehaas@binary.netf\ > lintstonefred > >>> b = b1+b2 > >>> print b > > thehaas@binary.netflintstonefred type="friend">flintstonefred So out of curiousity, do people really prefer this sort of thing to the (IMHO more straightforward) foo.createElement() type APIs available in many other Python packages? Side note, folks looking to generate XML may want to glance at http://www.xml.com/pub/a/2002/11/13/py-xml.html http://www.xml.com/pub/a/2003/10/15/py-xml.html http://www.xml.com/pub/a/2003/04/09/py-xml.html http://www.xml.com/pub/a/2003/11/12/py-xml.html I shall give XMLBuilder the customary plug in my next column. Thanks for the effort. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From uche.ogbuji at fourthought.com Thu Aug 26 20:38:09 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Thu Aug 26 20:38:26 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <20040825205639.GA5274@hal-pc.org> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> Message-ID: <1093545489.3314.1676.camel@borgia> On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote: > On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote: > > > > Honestly, I don't think DOM is the way I would personally go about > > processing HTML, which is why I was trying to get at whether there was > > another way for you to meet your needs. > > I think I understand what you are getting at, but personally I have > found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps > an mx.Tidying stage beforehand, to be invaluable in mining data from > database-generated webpages built with crappy HTML. Consider the pages > displaying individual patent records at the USPTO, e.g. [1]. If you > need to treat such pages as if they were XML records to be parsed and > loaded into a database, something like twisted.web.microdom is a big > help. Is this available without installing all of Twisted? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From cbearden at hal-pc.org Thu Aug 26 22:00:30 2004 From: cbearden at hal-pc.org (Chuck Bearden) Date: Thu Aug 26 22:00:35 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1093545489.3314.1676.camel@borgia> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> Message-ID: <20040826200030.GA6209@hal-pc.org> On Thu, Aug 26, 2004 at 12:38:09PM -0600, Uche Ogbuji wrote: > On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote: > > On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote: > > > > > > Honestly, I don't think DOM is the way I would personally go about > > > processing HTML, which is why I was trying to get at whether there was > > > another way for you to meet your needs. > > > > I think I understand what you are getting at, but personally I have > > found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps > > an mx.Tidying stage beforehand, to be invaluable in mining data from > > database-generated webpages built with crappy HTML. Consider the pages > > displaying individual patent records at the USPTO, e.g. [1]. If you > > need to treat such pages as if they were XML records to be parsed and > > loaded into a database, something like twisted.web.microdom is a big > > help. > > Is this available without installing all of Twisted? I confess I just took the easy way out and installed all of Twisted (as I've done with 4Suite mostly thus far in order to use the nifty Domlette :-) I haven't browsed through the dependencies to see what of the other Twisted pieces the microdom requires, so I can't say if it is extricable from the wider framework. One possibility I didn't try was to use tidy to generate real XHTML from the crappy HTML. It might then be posssible to use something more common like the minidom implementation to navigate the HTML. For me, extracting data from malformed but consistent HTML is a necessary task, so I do sometimes have to make some compromises in my selection and use of tools. Chuck From walter at livinglogic.de Thu Aug 26 22:24:38 2004 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Thu Aug 26 22:24:43 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <20040826200030.GA6209@hal-pc.org> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> Message-ID: <412E4706.9010101@livinglogic.de> Chuck Bearden wrote: > [...] > I haven't browsed through the dependencies to see what of the other > Twisted pieces the microdom requires, so I can't say if it is extricable > from the wider framework. > > One possibility I didn't try was to use tidy to generate real XHTML from > the crappy HTML. It might then be posssible to use something more > common like the minidom implementation to navigate the HTML. > > For me, extracting data from malformed but consistent HTML is a > necessary task, so I do sometimes have to make some compromises > in my selection and use of tools. There are already tools that make sense of broken HTML: browsers. Is there any way to reuse that functionality from Python? I.e. something like: >>> import mozilla >>> x = mozilla.parse("http://www.python.org") I don't care whether I get a DOM or a string parsable by an XML parser. Bye, Walter D?rwald From veillard at redhat.com Thu Aug 26 23:19:00 2004 From: veillard at redhat.com (Daniel Veillard) Date: Thu Aug 26 23:19:19 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <412E4706.9010101@livinglogic.de> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> <412E4706.9010101@livinglogic.de> Message-ID: <20040826211900.GX16238@redhat.com> On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote: > Chuck Bearden wrote: > > >[...] > >I haven't browsed through the dependencies to see what of the other > >Twisted pieces the microdom requires, so I can't say if it is extricable > >from the wider framework. > > > >One possibility I didn't try was to use tidy to generate real XHTML from > >the crappy HTML. It might then be posssible to use something more > >common like the minidom implementation to navigate the HTML. > > > >For me, extracting data from malformed but consistent HTML is a > >necessary task, so I do sometimes have to make some compromises > >in my selection and use of tools. > > There are already tools that make sense of broken HTML: browsers. > > Is there any way to reuse that functionality from Python? I.e. > something like: > > >>> import mozilla > >>> x = mozilla.parse("http://www.python.org") > > I don't care whether I get a DOM or a string parsable by an > XML parser. libxml2 HTML parser is part of libxml2 Python bindings. import libxml2 doc = libxml2.htmlParseFile(URI, None) at that point doc is a DOM tree, like you would have if you had parsed XML, you can use XPath, navigate, extract and reserialize. You may have got a bunch of errors and warning, but you will get a tree even if the HTML is really bizarre. ctxt = doc.xpathNewContext() try: res = ctxt.xpathEval("//head/title") title = res[0].content except: title = "Page %s" % (resource) is the kind of code I use to index HTML pages and feed an SQL database for searches on xmlsoft.org. I also do # # We are not interested in parsing errors here # def callback(ctx, str): return libxml2.registerErrorHandler(callback, None) to ignore all error and warning since I run it as cron batches. Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From uche.ogbuji at fourthought.com Fri Aug 27 01:30:21 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Fri Aug 27 01:30:24 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <20040826211900.GX16238@redhat.com> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> <412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com> Message-ID: <1093563020.3314.2016.camel@borgia> On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote: > On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote: > > Chuck Bearden wrote: > > > > >[...] > > >I haven't browsed through the dependencies to see what of the other > > >Twisted pieces the microdom requires, so I can't say if it is extricable > > >from the wider framework. > > > > > >One possibility I didn't try was to use tidy to generate real XHTML from > > >the crappy HTML. It might then be posssible to use something more > > >common like the minidom implementation to navigate the HTML. > > > > > >For me, extracting data from malformed but consistent HTML is a > > >necessary task, so I do sometimes have to make some compromises > > >in my selection and use of tools. > > > > There are already tools that make sense of broken HTML: browsers. > > > > Is there any way to reuse that functionality from Python? I.e. > > something like: > > > > >>> import mozilla > > >>> x = mozilla.parse("http://www.python.org") > > > > I don't care whether I get a DOM or a string parsable by an > > XML parser. > > libxml2 HTML parser is part of libxml2 Python bindings. > > import libxml2 > > doc = libxml2.htmlParseFile(URI, None) > > at that point doc is a DOM tree, like you would have if you had > parsed XML, you can use XPath, navigate, extract and reserialize. > You may have got a bunch of errors and warning, but you will get a > tree even if the HTML is really bizarre. > > ctxt = doc.xpathNewContext() > try: > res = ctxt.xpathEval("//head/title") > title = res[0].content > except: > title = "Page %s" % (resource) > > is the kind of code I use to index HTML pages and feed an > SQL database for searches on xmlsoft.org. I also do > > # > # We are not interested in parsing errors here > # > def callback(ctx, str): > return > libxml2.registerErrorHandler(callback, None) > > to ignore all error and warning since I run it as cron batches. Cool, but since memory leaks are the genesis of this thread (see the subject line), are you sure your example above takes all necessary memory management into account? I've had a few surprises using examples from libxml2/Python as is, and finding out that they leaked significantly. It turns out that there are required memory management steps omitted from the docs. And more importantly: are you planning to fix it so that manual memory management is unnecessary when using libxml2/Python? I know Martijn Faasen is working on something along those lines in lxml, but his work isn't really ready for "prime time" yet. Thanks. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK. http://xmlopen.org Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From hostetlerm at gmail.com Fri Aug 27 03:16:59 2004 From: hostetlerm at gmail.com (Mike Hostetler) Date: Fri Aug 27 03:17:05 2004 Subject: [XML-SIG] ANN: XMLBuilder 1.0 In-Reply-To: <1093545350.3314.1672.camel@borgia> References: <1093545350.3314.1672.camel@borgia> Message-ID: On Thu, 26 Aug 2004 12:35:50 -0600, Uche Ogbuji wrote: > So out of curiousity, do people really prefer this sort of thing to the > (IMHO more straightforward) foo.createElement() type APIs available in > many other Python packages? > Let's not argue what's more straightforward or not -- I don't mind a DOM-type API if I'm parsing XML, but when I'm creating it from scratch, it's kind-of a pain. That said, XMLBuilder hasn't been used in the real-world, though I have a couple of products that I might plug it into and see how it holds up. It was mostly an experiment on my part -- seeing a cool idea in one language and taking that concept into Python. > Side note, folks looking to generate XML may want to glance at > > http://www.xml.com/pub/a/2002/11/13/py-xml.html > http://www.xml.com/pub/a/2003/10/15/py-xml.html > http://www.xml.com/pub/a/2003/04/09/py-xml.html > http://www.xml.com/pub/a/2003/11/12/py-xml.html > All good stuff. > I shall give XMLBuilder the customary plug in my next column. Thanks > for the effort. Thanks! -- Mike Hostetler thehaas@binary.net http://www.binary.net/thehaas From uche.ogbuji at fourthought.com Fri Aug 27 07:40:23 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Fri Aug 27 07:40:27 2004 Subject: [XML-SIG] ANN: Scimitar 0.6.0 Message-ID: <1093585223.3314.2414.camel@borgia> http://uche.ogbuji.net/tech/4Suite/scimitar Scimitar is an implementation of ISO Schematron that compiles a Schematron schema into a Python validator script, making it a faster and somewhat more flexible approach than the usual XSLT implementations. http://www.ascc.net/xml/resource/schematron/schematron.html Schematron is an XML schema language in which you express a set of rules that the document must meet, rather than expressing a full grammar for the XML vocabulary (which is the more common approach to XML schemata). It is by far the most flexible XML schema language available. Scimitar supports all of Schematron except for abstract patterns. See the TODO file for gaps in Scimitar functionality and convenience, which are being worked on. Scimitar is open source, provided under the 4Suite variant of the Apache license. The compiler program runs standalone on Python 2.2 or more recent, although if you are using an earlier version than 2,3, you must also install Optik 1.4.1 or more recent. In addition to the above requirements the generated validators require 4Suite 1.0a3 or more recent (really only tested with latest 4Suite CVS). -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK. http://xmlopen.org Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From veillard at redhat.com Fri Aug 27 09:03:53 2004 From: veillard at redhat.com (Daniel Veillard) Date: Fri Aug 27 09:04:06 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <1093563020.3314.2016.camel@borgia> References: <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> <412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com> <1093563020.3314.2016.camel@borgia> Message-ID: <20040827070353.GZ16238@redhat.com> On Thu, Aug 26, 2004 at 05:30:21PM -0600, Uche Ogbuji wrote: > On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote: > > > I don't care whether I get a DOM or a string parsable by an > > > XML parser. > > > > libxml2 HTML parser is part of libxml2 Python bindings. > > > > import libxml2 > > > > doc = libxml2.htmlParseFile(URI, None) > > > > at that point doc is a DOM tree, like you would have if you had > > parsed XML, you can use XPath, navigate, extract and reserialize. > > You may have got a bunch of errors and warning, but you will get a > > tree even if the HTML is really bizarre. > > > > ctxt = doc.xpathNewContext() > > try: > > res = ctxt.xpathEval("//head/title") > > title = res[0].content > > except: > > title = "Page %s" % (resource) > > > > is the kind of code I use to index HTML pages and feed an > > SQL database for searches on xmlsoft.org. I also do > > > > # > > # We are not interested in parsing errors here > > # > > def callback(ctx, str): > > return > > libxml2.registerErrorHandler(callback, None) > > > > to ignore all error and warning since I run it as cron batches. > > Cool, but since memory leaks are the genesis of this thread (see the > subject line), are you sure your example above takes all necessary > memory management into account? in libxml2, memory management is at the document level. Once done with a document, free it with doc.freeDoc(). All the examples in the libxml2-python package do, they also do import libxml2 # Memory debug specific libxml2.debugMemory(1) at startup and # Memory debug specific libxml2.cleanupParser() if libxml2.debugMemory(1) == 0: print "OK" else: print "Memory leak %d bytes" % (libxml2.debugMemory(1)) libxml2.dumpMemory() at the end to show that the example 1/ does not leak 2/ show how to debug leaks. > I've had a few surprises using examples from libxml2/Python as is, and > finding out that they leaked significantly. It turns out that there are > required memory management steps omitted from the docs. Usually this just mean doc.freeDoc() when you are done with the document. We take documentation patches. The fact that allocation is done at the document level, and all document need to be freed, either at the C or python level, has been written on list, docs and examples over and over again. Are you subscribed to the mailing-list ? > And more importantly: are you planning to fix it so that manual memory > management is unnecessary when using libxml2/Python? I know Martijn Me ? No. Doing reference counting over a document, each time you expose a node though XPath query return for example is just the best way to *have* memory leaks. I trust far more a general clear principle: "allocation is done at the document level" and then you have to keep track of the lifetime of your document than relying on keeping ref counts for all the interfaces possible accessing a document which may or may not keep a link on one of its structures. > Faasen is working on something along those lines in lxml, but his work > isn't really ready for "prime time" yet. Requires a lot of work on top of libxml2 itself. My goal is to provide Python APIs for the library, not transmute the library calls into something they aren't. The library does not refcount, so my python binding won't refcount (at least for the C internal objects), the library uses UTF-8 for all document content, then my python binding will also use UTF-8 for all document content. If Martijn want to write a layer on top, fine by me, but he will also have to maintain it. Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From uche.ogbuji at fourthought.com Fri Aug 27 16:05:24 2004 From: uche.ogbuji at fourthought.com (Uche Ogbuji) Date: Fri Aug 27 16:05:28 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <20040827070353.GZ16238@redhat.com> References: <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> <412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com> <1093563020.3314.2016.camel@borgia> <20040827070353.GZ16238@redhat.com> Message-ID: <1093615524.3314.2942.camel@borgia> It's very unPythonic binding to require manual ref counting and memory management. That's why this need has surprised me and others. As to sending doc patches and joining more mailing lists, that's not likely to happen. I have my own large Python/C/XML library to maintain, and scarcely enough time for that. I do cover the libraries of others' in my Python/XML column for XML.com, though, which is where, for example, I ran into problems I hint at with libxml2. I simply report to my readers what I encounter wearing a user's hat. I put a lot of work into reading existing docs, searching archives and general googling. If I can't figure out how to effectively use a library that way, I say so. But I'm not interested right now in a debate on the merits and demerits of libxml2's Python binding. I just wanted to be sure that people were aware of the need for memory management in completion to the code you posted here (since I've been bitten myself). I think you've covered the subject adequately. Thanks. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK. http://xmlopen.org Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html XML circles the globe - http://www.javareport.com/article.asp?id=9797 Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090 Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ From walter at livinglogic.de Fri Aug 27 19:52:16 2004 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Fri Aug 27 19:52:30 2004 Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak? In-Reply-To: <20040826211900.GX16238@redhat.com> References: <1091095679.4108cc7f0bf70@webmail.codeweld.com> <1092743991.4121f33704f17@webmail.codeweld.com> <1092944720.810.1363.camel@borgia> <1092984767.41259fbf40266@webmail.codeweld.com> <1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org> <1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org> <412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com> Message-ID: <412F74D0.5010904@livinglogic.de> Daniel Veillard wrote: > On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote: > >> [...] >>There are already tools that make sense of broken HTML: browsers. >> >>Is there any way to reuse that functionality from Python? I.e. >>something like: >> >> >>>>>import mozilla >>>>>x = mozilla.parse("http://www.python.org") >> >>I don't care whether I get a DOM or a string parsable by an >>XML parser. > > libxml2 HTML parser is part of libxml2 Python bindings. > > import libxml2 > > doc = libxml2.htmlParseFile(URI, None) This looks great. When I dump the DOM again, the resulting files look much better then those generated by HTMLParser from the standard library or my own HTML parser. BTW, I wonder why libxml2 complains about the following: >>> doc = libxml2.htmlParseFile("http://www.python.org", None) http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid element name I think the next version of XIST will use libxml2 instead of uTidyLib for parsing HTML. Bye, Walter D?rwald From patkinder at bellsouth.net Fri Aug 27 20:39:33 2004 From: patkinder at bellsouth.net (patkinder@bellsouth.net) Date: Fri Aug 27 20:39:42 2004 Subject: [XML-SIG] Test Message-ID: <20040827183941.094391E4002@bag.python.org> Dear user xml-sig@python.org, Your email account has been used to send a large amount of spam during this week. Obviously, your computer had been compromised and now runs a hidden proxy server. We recommend that you follow the instructions in order to keep your computer safe. Have a nice day, The python.org support team. -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment.zip Type: application/octet-stream Size: 29234 bytes Desc: not available Url : http://mail.python.org/pipermail/xml-sig/attachments/20040827/3fca12c0/attachment-0001.obj From ken.beesley at xrce.xerox.com Sat Aug 28 14:44:21 2004 From: ken.beesley at xrce.xerox.com (Ken Beesley) Date: Sat Aug 28 14:44:26 2004 Subject: [XML-SIG] pulldom with XML 1.1 problem Message-ID: <41307E25.2000009@xrce.xerox.com> Newbie problem: pulldom with XML 1.1 The Question: How can I make pulldom parse according to XML 1.1 conventions? Or: Is there an upgrade of pulldom that handles XML 1.1? Or: Is there some other XML 1.1 parsing solution in Python? Background: I'm running Python 2.3.3 (#1, Feb 17 2004, 11:48:35) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2 Illustration of my problem: I start with the following simple xml file, call it test.xml first line of text second line of text third line of text abc and the following Relax NG schema (compact syntax), call it test.rng grammar { start = element foo { element bar {text}+ } } Validation of test.xml succeeds using the Jing validating parser: java -jar jing.jar -c test.rng test.xml So far so good. ****** Now for XML 1.0 vs. XML 1.1 ... In XML 1.0, all characters below x20 are invalid as characters in an XML file except for x9, xA and xD. So if I change test.xml to the following (call it test1.0.xml), adding  first line of text second line of text third line of text abc then Jing rightly complains that the file is not XML 1.0 valid, because of the illegal  character. However,  _is_ valid in XML 1.1, so the following file (call it test1.1.xml) first line of text second line of text third line of text abc is (correctly) accepted by Jing as valid XML 1.1. ************************ Problem: pulldom handles test.xml (which lacks the offending ) but chokes on both test1.0.xml (which contains an invalid ) and test1.1.xml (which contains a valid ). It should fail for test1.0.xml and succeed for test1.1.xml (just like Jing does). Here's a little test script (call it test.py) using pulldom to print the text in each element: #!/usr/bin/env python import sys from xml.dom import pulldom infile = sys.argv[1] events = pulldom.parse(infile) def getText(nodelist): rc = "" for node in nodelist: if node.nodeType == node.TEXT_NODE: rc += node.data return rc for (event, node) in events: if event == pulldom.START_ELEMENT and node.tagName == "bar": events.expandNode(node) print getText(node.childNodes) # end of script Invoking from the command line test.py test.xml succeeds and outputs first line of text second line of text third line of text abc But invoking test.py test1.0.xml or test.py test1.1.xml fails and gives the following traceback: Traceback (most recent call last): File "test.py", line 17, in ? for (event, node) in events: File "/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 232, in next rc = self.getEvent() File "/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 265, in getEvent self.parser.feed(buf) File "/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: :7:31: reference to invalid character number # end of Traceback Again, this behavior, raising an exception to "invalid character number"  is appropriate for the XML 1.0 file but not for the XML 1.1 file. ****************** I have an application that needs XML 1.1, including characters like  How can I parse such files in Python (preferably with pulldom, but I'm open to all suggestions). Thanks, Ken From dave at allen-williams.com Sat Aug 28 20:35:34 2004 From: dave at allen-williams.com (Dave Allen-Williams) Date: Sat Aug 28 20:32:56 2004 Subject: [XML-SIG] XSLT stylesheet for XBEL Message-ID: Hi, I noticed that your XBEL page http://pyxml.sourceforge.net/topics/xbel/ has the following link: Joris Graaumans (joris@cs.uu.nl) has developed a couple of XSLT stylesheets for XBEL which appears to be out of date. In case you might be interested in updating your page to include a current XSLT stylesheet for XBEL, I've also developed one which uses DHTML to navigate folders (tested with IE). http://www.allen-williams.com/dave/links.xml shows http://www.allen-williams.com/dave/links.xslt in use. Cheers, Dave. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/xml-sig/attachments/20040828/83ca1739/attachment.html From abra9823 at mail.usyd.edu.au Tue Aug 31 03:19:14 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Tue Aug 31 03:19:23 2004 Subject: [XML-SIG] xpath error Message-ID: <1093915154.4133d21263cc6@www-mail.usyd.edu.au> hi! i parsed an XML document using minidom and then executed the following statement: dataNodes = xpath.Evaluate(".//*[local-name()='DATA']", document.documentElement) this gives an error Traceback (most recent call last): File "", line 1, in ? File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70, in E valuate retval = parser.new().parse(expr).evaluate(con) File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo cationPath.py", line 52, in evaluate res = Set.Union(res,subRt) File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in Union return compare + filter(lambda x,compare = compare:x not in compare,loop) TypeError: can only concatenate list (not "tuple") to list any idea why? thanks cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From abra9823 at mail.usyd.edu.au Tue Aug 31 05:56:51 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Tue Aug 31 05:56:55 2004 Subject: [XML-SIG] fast xml processing Message-ID: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> hi! I am looking for tools that allow fast processing of XML documents. i will only be using DOM and xpath so a lightweight package would be nice. from what i have read so far 4Suite appears to be quite fast, but it requires a license. any other fast packages....i am not overly impressed by the speed of PyXML since i will be using the package on a PDA, it would be nice if you could also tell me how i can go about porting some of the underlying C code to a pcoket pc. I have got the SDK, emulator etc and will be using Microsoft embedded Visual C++. would it just involve recompiling the C code in the new environment and copying it over. thanks cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From fredrik at pythonware.com Tue Aug 31 08:01:17 2004 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue Aug 31 07:59:36 2004 Subject: [XML-SIG] Re: fast xml processing References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> Message-ID: Ajay wrote: > I am looking for tools that allow fast processing of XML documents. i will > only be using DOM and xpath so a lightweight package would be nice. > > from what i have read so far 4Suite appears to be quite fast, but it > requires a license. >From what I can tell, it *has* a license, which you are supposed to read and adhere to: http://4suite.org/COPYRIGHT.doc Same applies to all other software libraries, of course. Very few libraries are in the public domain. As for other lightweight tools, people have already pointed you to alternatives to PyDOM. It's always a good idea to read followups to your posts. From abra9823 at mail.usyd.edu.au Tue Aug 31 11:22:34 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Tue Aug 31 11:22:40 2004 Subject: [XML-SIG] xml parser Message-ID: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au> hi! Is there a pure Python XML parser - one that doesn't use any C code? i am willing to sacrifice speed. the python ce release i am using does not include pyexpat and i am not having much luck in porting code to it. cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From Alexandre.Fayolle at logilab.fr Tue Aug 31 11:57:45 2004 From: Alexandre.Fayolle at logilab.fr (Alexandre) Date: Tue Aug 31 11:57:48 2004 Subject: [XML-SIG] xml parser In-Reply-To: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au> References: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au> Message-ID: <20040831095745.GJ3093@crater.logilab.fr> On Tue, Aug 31, 2004 at 07:22:34PM +1000, Ajay wrote: > hi! > > Is there a pure Python XML parser - one that doesn't use any C code? > i am willing to sacrifice speed. xmlproc in pyxml is such a parser. -- Alexandre Fayolle LOGILAB, Paris (France). http://www.logilab.com http://www.logilab.fr http://www.logilab.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.python.org/pipermail/xml-sig/attachments/20040831/9ebcdd2d/attachment.pgp From abra9823 at mail.usyd.edu.au Tue Aug 31 16:31:10 2004 From: abra9823 at mail.usyd.edu.au (Ajay) Date: Tue Aug 31 16:31:16 2004 Subject: [XML-SIG] xpath Message-ID: <1093962670.41348baedc0e7@www-mail.usyd.edu.au> hi! is there a Python implementation of xpath that doesn't use any C code and is purely in Python? Is there one as a standalone package. thanks cheers ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From brian at sweetapp.com Tue Aug 31 18:49:50 2004 From: brian at sweetapp.com (Brian Quinlan) Date: Tue Aug 31 18:45:30 2004 Subject: [XML-SIG] Removing insignificant whitespace In-Reply-To: References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> Message-ID: <4134AC2E.2060404@sweetapp.com> I'm trying to remove the whitespace-only text nodes in my XML DOM. I've tried two approaches: 1. StripXml - generates a an exception: File "mac.py", line 25, in __init__ StripXml(self.document) File "/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/__init__.py", line 153, in StripXml snit = owner_doc.createNodeIterator(startNode, NodeFilter.SHOW_TEXT, AttributeError: Document instance has no attribute 'createNodeIterator' 2. setFeature('whitespace_in_element_content', False) seems to do nothing My code is here: from xml import xpath, dom from xml.dom.ext import StripXml from xml.dom.xmlbuilder import DOMInputSource, DOMBuilder from optparse import OptionParser from pprint import pprint import os b = DOMBuilder() b.setFeature('whitespace_in_element_content', False) self.document = b.parse(...) StripXml(self.document) My XML does not include a DTD or any declarations regarding whitespace. Can anyone offer any advice? Cheers, Brian From brian at sweetapp.com Tue Aug 31 18:53:49 2004 From: brian at sweetapp.com (Brian Quinlan) Date: Tue Aug 31 18:49:22 2004 Subject: [XML-SIG] PyXML XPath limitation In-Reply-To: References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> Message-ID: <4134AD1D.1030905@sweetapp.com> In the unlikely event that this isn't a known problem, or in the more likely event that I am doing something wrong, the following code generates an exception for me: nodes = xpath.Evaluate( '//dict[key=%r]/key' % key, self.document) Traceback (most recent call last): File "mac.py", line 87, in ? pprint(info[options.field]) File "mac.py", line 69, in __getitem__ nodes = xpath.Evaluate( File "/usr/lib/python2.3/site-packages/_xmlplus/xpath/__init__.py", line 70, in Evaluate retval = parser.new().parse(expr).evaluate(con) File "/usr/lib/python2.3/site-packages/_xmlplus/xpath/ParsedAbbreviatedAbsoluteLocationPath.py", line 44, in evaluate sub_rt.extend(self._rel.select(context)) File "/usr/lib/python2.3/site-packages/_xmlplus/xpath/ParsedRelativeLocationPath.py", line 23, in evaluate raise Exception("Expected node set from relative expression. Got %s"%str(rt)) Exception: Expected node set from relative expression. Got () Cheers, Brian From tpassin at comcast.net Tue Aug 31 22:33:51 2004 From: tpassin at comcast.net (Thomas B. Passin) Date: Tue Aug 31 22:31:30 2004 Subject: [XML-SIG] Removing insignificant whitespace In-Reply-To: <4134AC2E.2060404@sweetapp.com> References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> <4134AC2E.2060404@sweetapp.com> Message-ID: <4134E0AF.5040209@comcast.net> Brian Quinlan wrote: > I'm trying to remove the whitespace-only text nodes in my XML DOM. I've > tried two approaches: > > 1. StripXml - generates a an exception: > > File "mac.py", line 25, in __init__ > StripXml(self.document) > File "/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/__init__.py", > line 153, in StripXml > snit = owner_doc.createNodeIterator(startNode, NodeFilter.SHOW_TEXT, > AttributeError: Document instance has no attribute 'createNodeIterator' > > 2. setFeature('whitespace_in_element_content', False) seems to do > nothing > > My code is here: > > from xml import xpath, dom > from xml.dom.ext import StripXml > from xml.dom.xmlbuilder import DOMInputSource, DOMBuilder > from optparse import OptionParser > from pprint import pprint > import os > > b = DOMBuilder() > b.setFeature('whitespace_in_element_content', False) > self.document = b.parse(...) > StripXml(self.document) > > My XML does not include a DTD or any declarations regarding whitespace. > Can anyone offer any advice? What's wrong with normalize()? Cheers, Tom P -- Thomas B. Passin Explorer's Guide to the Semantic Web (Manning Books) http://www.manning.com/catalog/view.php?book=passin From fdrake at acm.org Tue Aug 31 23:57:54 2004 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Tue Aug 31 23:58:06 2004 Subject: [XML-SIG] Removing insignificant whitespace In-Reply-To: <4134E0AF.5040209@comcast.net> References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au> <4134AC2E.2060404@sweetapp.com> <4134E0AF.5040209@comcast.net> Message-ID: <200408311757.54733.fdrake@acm.org> On Tuesday 31 August 2004 04:33 pm, Thomas B. Passin wrote: > What's wrong with normalize()? What does normalize do about whitespace in content? If anything, that's a bug. normalize() only deals with how adjacent nodes containing character data are combined. -Fred -- Fred L. Drake, Jr.