From noreply@sourceforge.net Mon Sep 2 03:54:11 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sun, 01 Sep 2002 19:54:11 -0700 Subject: [XML-SIG] [ pyxml-Bugs-603322 ] The SAX (1) driver does not report & Message-ID: Bugs item #603322, was opened at 2002-09-01 19:54 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=603322&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: The SAX (1) driver does not report & Initial Comment: Trying to make ns_parse from XBEL work, I found that the SAX driver for sgmlop is not reporting entity and character references (specifically & and *, which are the ones I tried). The driver for sgmllib does report them. I am using pyxml 0.7 with python 2.1.3 on WIn2000. Here is a small file that demonstrates the behavior from xml.sax import saxexts,saxlib from StringIO import StringIO class test_handler(saxlib.HandlerBase): def __init__(self): pass def startElement(self,name,attrs): print 'Start Element: %s ' % name def characters(self,data,start,length): print '--->' print data[start:start+length] print '<---' def endElement(self,name): print 'End Element: ',name html='''

first&second

''' thefile=StringIO(html) if __name__ == '__main__': print '============' h=test_handler() p=saxexts.SGMLParserFactory.make_parser() p.setDocumentHandler(h) p.parseFile( thefile ) : ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=603322&group_id=6473 From noreply@sourceforge.net Mon Sep 2 04:14:00 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sun, 01 Sep 2002 20:14:00 -0700 Subject: [XML-SIG] [ pyxml-Bugs-603325 ] ns_parse creates extra nested folders Message-ID: Bugs item #603325, was opened at 2002-09-01 20:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=603325&group_id=6473 Category: XBEL Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: ns_parse creates extra nested folders Initial Comment: ns_parse creates undesired additional nested folders when a folder title contains & (or any other character or entity reference). The problem is that characters() does not take into account that the character data may arrive in several chunks, and creates a new folder for each chunk. This problem is best seen using the sgmllib driver, becasue the current sgmlop driver (the usual default) does not report the entity references, although it does create multiple chunks when they occur. Here is a short NS bookmarks file on which ns_parse demonstrates the problem. You will see that a folder that should be titled "B&B" becomes three nested folders, titled "B", "&", and "B" - Bookmarks

Bookmarks

Travel

B&B

Peters Creek Inn The Bed and Breakfast of Distinction

---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=603325&group_id=6473 From martin@v.loewis.de Mon Sep 2 05:28:11 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 06:28:11 +0200 Subject: [XML-SIG] Element.localName, Attr.localName In-Reply-To: <15713.15269.98947.380831@grendel.zope.com> References: <200208191829.g7JITi5c069807@chilled.skew.org> <15713.15269.98947.380831@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Ah, but here I disagree. minidom should support namespace-unaware > processing, primarily because it is *the* DOM that is shipped as part > of the Python standard library, and most simple applications of XML > are namespace unaware (which is more reasonable than expecting them to > become namespace aware). I consider this a substantial requirement. Yes, and one that Python meets quite well. Just restrict yourself to DOM L1 functions, and voila, you have namespace-unaware processing. I don't think you've demonstrated that the L1 functions are misbehaving. Regards, Martin From martin@v.loewis.de Mon Sep 2 05:31:15 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 06:31:15 +0200 Subject: [XML-SIG] Re: Memory leak in xmlrpclib.py on Windows? In-Reply-To: <3D64F90C.2CB57C84@fluent.com> References: <3D616815.5ACBBA5@fluent.com> <3D64F90C.2CB57C84@fluent.com> Message-ID: Mark Moales writes: > The problem appears to be in httplib.HTTPConnection. I replaced > xmlrpclib.Transport with my own socket-based Transport and the memory > leak goes away. So, I took the HTTPConnection sample out of the Python > doc and stuck a loop around it. Sure enough, I see a 4K increase every > 10 seconds or so over the life of the process. I'm uncertain about the conclusion of this message. Do you mean to say that this solves your problem, or that there still is a problem that you want to solve, or get solved? Regards, Martin From martin@v.loewis.de Mon Sep 2 05:37:33 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 06:37:33 +0200 Subject: [XML-SIG] Problem Installing PyXML In-Reply-To: <1030491345.3d6c0cd15559f@webmail.got.net> References: <1030486210.3d6bf8c29261f@webmail.got.net> <7CDD7B94357FD5119E800002A537C46E1C1AB6@s5-ccr-r1.ccrs.nrcan.gc.ca> <20020827223040.GB1809@swordfish.havenrock.com> <15724.2550.803900.706414@grendel.zope.com> <1030491345.3d6c0cd15559f@webmail.got.net> Message-ID: landauer@got.net writes: > "The only requirements for installing the package are Python > 2.0 or later, and a C compiler. Note that the Python must > actually be an INSTALLed python, rather than one that is being > used directly from Python's build area. This release has been > tested with Python 2.x" Thanks, added. > perhaps with a more precise rendering of "2.x" at the end there... How more precise do you want that? It is usually tested for any possible value of x. Regards, Martin From martin@v.loewis.de Mon Sep 2 05:41:28 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 06:41:28 +0200 Subject: [XML-SIG] minidom with xmlproc (pure-Python DOM) In-Reply-To: <20020830124234.39248.qmail@web13307.mail.yahoo.com> References: <20020830124234.39248.qmail@web13307.mail.yahoo.com> Message-ID: Pat Notz writes: > Does anyone have an example of using the xmlproc parser with minidom? Untested, but... parser = xml.sax.make_parser("xml.sax.drivers2.drv_xmlproc") # alternatively: # parser = xml.sax.sax2exts.XMLValParserFactory.make_parser() doc = xml.dom.minidom.parse(resource, parser = parser) HTH, Martin From martin@v.loewis.de Mon Sep 2 05:45:09 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 06:45:09 +0200 Subject: [XML-SIG] SAX: distinguising empty and non-empty elements? In-Reply-To: References: <20020821100418.5433eddb.bud@sistema.it> Message-ID: Lars Marius Garshol writes: > | Is there a possibility to distinguish between non-empty elements w/o > | content and empty elements in (Python) Sax? > > No. The distinction is considered to be a purely lexical distinction > with no more importance than the difference between E; and e;. Of course, if you process the DTD, you can tell the difference, right? So an application that wants to make the distinction can, no? Regards, Martin From mal@lemburg.com Mon Sep 2 09:11:05 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 02 Sep 2002 10:11:05 +0200 Subject: [XML-SIG] Re: Memory leak in xmlrpclib.py on Windows? References: <3D616815.5ACBBA5@fluent.com> <3D64F90C.2CB57C84@fluent.com> Message-ID: <3D731D19.5020609@lemburg.com> Martin v. Loewis wrote: > Mark Moales writes: > > >>The problem appears to be in httplib.HTTPConnection. I replaced >>xmlrpclib.Transport with my own socket-based Transport and the memory >>leak goes away. So, I took the HTTPConnection sample out of the Python >>doc and stuck a loop around it. Sure enough, I see a 4K increase every >>10 seconds or so over the life of the process. > > I'm uncertain about the conclusion of this message. Do you mean to say > that this solves your problem, or that there still is a problem that > you want to solve, or get solved? I think Mark is saying that there seems to be a problem in the httplib (rather than xmlrpclib). Mark, please post a SourceForge bug report about this including your sample code. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From Juergen Hermann" Message-ID: On 02 Sep 2002 06:45:09 +0200, Martin v. Loewis wrote: >Of course, if you process the DTD, you can tell the difference, right? >So an application that wants to make the distinction can, no? You can detect whether an element has an EMPTY content model, which makes it a little easier to _generate_ the short form. That still makes and equivalent and undetectable on the standard processing levels. Ciao, J=FCrgen -- J=FCrgen Hermann, Developer WEB.DE AG, http://webde-ag.de/ From martin@v.loewis.de Mon Sep 2 21:37:20 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 02 Sep 2002 22:37:20 +0200 Subject: [XML-SIG] SAX: distinguising empty and non-empty elements? In-Reply-To: References: Message-ID: "Juergen Hermann" writes: > >Of course, if you process the DTD, you can tell the difference, right? > >So an application that wants to make the distinction can, no? > > You can detect whether an element has an EMPTY content model, which > makes it a little easier to _generate_ the short form. > > That still makes and equivalent and undetectable on the > standard processing levels. Ah, I didn't understand that this was about that distinction: I thought that the OP had and needed to know whether this was EMPTY, or just had no content (which can be answered by looking at the DTD). I agree that the distinction between and is purely lexical, and not part of the "true" content. Regards, Martin From mmoales@fluent.com Tue Sep 3 13:59:22 2002 From: mmoales@fluent.com (Mark Moales) Date: Tue, 03 Sep 2002 08:59:22 -0400 Subject: [XML-SIG] Re: Memory leak in xmlrpclib.py on Windows? References: <3D616815.5ACBBA5@fluent.com> <3D64F90C.2CB57C84@fluent.com> <3D731D19.5020609@lemburg.com> Message-ID: <3D74B22A.A34AC85F@fluent.com> This is a multi-part message in MIME format. --------------3503276C9BA3ADCE3BFA777E Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit The problem still exists and it does appear to be in httplib, not xmlrpclib. I posted a bug on SourceForge (598797) on 8/22. Thanks, Mark "M.-A. Lemburg" wrote: > > Martin v. Loewis wrote: > > Mark Moales writes: > > > > > >>The problem appears to be in httplib.HTTPConnection. I replaced > >>xmlrpclib.Transport with my own socket-based Transport and the memory > >>leak goes away. So, I took the HTTPConnection sample out of the Python > >>doc and stuck a loop around it. Sure enough, I see a 4K increase every > >>10 seconds or so over the life of the process. > > > > I'm uncertain about the conclusion of this message. Do you mean to say > > that this solves your problem, or that there still is a problem that > > you want to solve, or get solved? > > I think Mark is saying that there seems to be a problem in the > httplib (rather than xmlrpclib). > > Mark, please post a SourceForge bug report about this including > your sample code. > > Thanks, > -- > Marc-Andre Lemburg > CEO eGenix.com Software GmbH > _______________________________________________________________________ > eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... > Python Consulting: http://www.egenix.com/ > Python Software: http://www.egenix.com/files/python/ --------------3503276C9BA3ADCE3BFA777E Content-Type: text/x-vcard; charset=us-ascii; name="mmoales.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Mark Moales Content-Disposition: attachment; filename="mmoales.vcf" begin:vcard n:Moales;Mark tel;work:603-643-2600 x758 x-mozilla-html:FALSE url:www.fluent.com org:Fluent, Inc.;Software Development version:2.1 email;internet:mmoales@fluent.com adr;quoted-printable:;;10 Cavendish Ct.=0D=0A;Lebanon;NH;03766;USA fn:Mark Moales end:vcard --------------3503276C9BA3ADCE3BFA777E-- From Matthews@heyanita.com Tue Sep 3 17:25:00 2002 From: Matthews@heyanita.com (Matthew Shomphe) Date: Tue, 3 Sep 2002 09:25:00 -0700 Subject: [XML-SIG] Parser not preserving DTD? Message-ID: <8C50918F08A109479D62BAE0F1AB95465F85F3@lionking.HANA> I've noticed that two different methods of parsing an XML grammar have = both yielded outputs with DTDs different from the input DTD. For = example, given the input: test And the following code: #! "D:\Python22\python.exe" import sys from xml.dom.ext.reader import Sax2 from xml.dom.ext import PrettyPrint, Print if __name__ =3D=3D "__main__": usage =3D "Usage: " + sys.argv[0] + " = [output_XML]\nDefault output to STDOUT\n" try: sInFile =3D open(sys.argv[1], "r") except IndexError: sys.stderr.write(usage) sys.exit() try: sOutFile =3D open(sys.argv[2], "w") except IndexError: sOutFile =3D sys.stdout reader =3D Sax2.Reader() doc =3D reader.fromStream(sInFile) PrettyPrint(doc, sOutFile) The following will be output: test Is this a bug? If not, how can I preserve DTDs when reading in and = manipulating a document? Thanks in advance, Matt Shomphe -------------- Matt Shomphe MatthewS@HeyAnita.com =20 From martin@v.loewis.de Tue Sep 3 20:13:50 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 03 Sep 2002 21:13:50 +0200 Subject: [XML-SIG] Parser not preserving DTD? In-Reply-To: <8C50918F08A109479D62BAE0F1AB95465F85F3@lionking.HANA> References: <8C50918F08A109479D62BAE0F1AB95465F85F3@lionking.HANA> Message-ID: "Matthew Shomphe" writes: > Is this a bug? If not, how can I preserve DTDs when reading in and > manipulating a document? It's not clear whether this is a bug. There is no requirement in the DOM spec, or elsewhere, that the DOCTYPE must be roundtrip - yet it is an obvious user requirement. So this could be a bug, but fixing it is very difficult - in particular since the lower-level parsers fail to provide the information. If you really need this, I encourage you to investigate and contribute a fix. Regards, Martin From xlinv05@vse.cz Wed Sep 4 09:54:48 2002 From: xlinv05@vse.cz (Vaclav Lin) Date: Wed, 4 Sep 2002 10:54:48 +0200 (CEST) Subject: [XML-SIG] Python/XML Message-ID: Hello, I have just read the tutorial "Python/XML HOWTO" by A.M. Kuchling which is available at http://pyxml.sourceforge.net/topics/howto/xml-howto.html. I'd like to know, is it possible to download the *whole* document from somewhere? Thank you very much best regards Vaclav Lin From olc@ninti.com Wed Sep 4 13:30:33 2002 From: olc@ninti.com (Michael Hall) Date: Wed, 4 Sep 2002 22:00:33 +0930 (CST) Subject: [XML-SIG] general xml queries? Message-ID: I'm new to this list and haven't yet figured out if it is a place to seek answers to general XML questions (mine relate specifically to XSLT stylesheets) or not. Can someone clear this up, and maybe recommend a good list for general XML stuff if this isn't the place? Also, being new to PyXML, I'm not up to speed with where it is heading. Is there any work being done to integrate XSL-FO support? Is Apache's Java-based FOP the only game in town at the moment for Linux? TIA Mick From tpassin@comcast.net Wed Sep 4 14:00:20 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Wed, 04 Sep 2002 09:00:20 -0400 Subject: [XML-SIG] general xml queries? References: Message-ID: <000601c25413$06460880$fe193044@tbp1> [Michael Hall] > > I'm new to this list and haven't yet figured out if it is a place to seek > answers to general XML questions (mine relate specifically to XSLT > stylesheets) or not. Can someone clear this up, and maybe recommend a good > list for general XML stuff if this isn't the place? > Welcome, glad to have you with us. No, this is not a gneral list. For xslt questions, THE list is the Mulberry xslt list: http://www.mulberrytech.com/xsl/xsl-list/ There are also some good FAQ sites - especially see Dave Pawson's and Jeni Tennison's sites: http://www.dpawson.co.uk/xsl/xslfaq.html http://www.jenitennison.com/xslt/ For xml questions there are a number of xml lists, including xml-dev, though that is not really a general-urpose list. > Also, being new to PyXML, I'm not up to speed with where it is heading. Is > there any work being done to integrate XSL-FO support? Is Apache's > Java-based FOP the only game in town at the moment for Linux? > Pretty much, if it has to be low-cost. If you can pay $5000, there are a few commercial products. I believe I just saw a FO plugin that licenses RenderX (I think it was) for under $150, so using FO may be becoming easier and more affordable (sorry, I do not have the reference). Cheers, Tom P From martin@v.loewis.de Wed Sep 4 21:13:59 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 04 Sep 2002 22:13:59 +0200 Subject: [XML-SIG] Python/XML In-Reply-To: References: Message-ID: Vaclav Lin writes: > I have just read the tutorial "Python/XML HOWTO" by A.M. Kuchling which is > available at http://pyxml.sourceforge.net/topics/howto/xml-howto.html. I'd > like to know, is it possible to download the *whole* document from > somewhere? Sure: Just download the PyXML sources; it's in the doc/ directory. You can also download individual files through SF viewcvs. Regards, Martin From Matthews@heyanita.com Thu Sep 5 00:42:29 2002 From: Matthews@heyanita.com (Matthew Shomphe) Date: Wed, 4 Sep 2002 16:42:29 -0700 Subject: [XML-SIG] Parser not preserving DTD? Message-ID: <8C50918F08A109479D62BAE0F1AB9546392F5B@lionking.HANA> I've done a few tests to see where the issue in getting mangled DTDs is = coming from. I can't report much success beyond the following: 1. The problem is not with pyexpat or Expat. I was able to run some = tests and the full DTD is passed to pyexpat. I added the following code to test_pyexpat.py: def StartDoctypeDeclHandler(self, *args): doctypeName, systemId, publicId, has_internal_subset =3D args print 'DTD declared:', args =20 The full DTD was printed to stdout 2. The SAX implementation does not natively support = declarations. From their website = (http://www.saxproject.org/?selected=3Dfaq): ---- Does SAX support comments/CDATA sections/DOCTYPE declarations, etc.?=20 Not in the core API. These kinds of things are pure lexical details, = and are not relevant to most kinds of XML processing, so it doesn't make = sense to put them in the core and force all implementors to support = them. However, SAX2 is designed to be extensible, and the LexicalHandler = interface is supported by most SAX parsers. SAX2 parsers are not = required to support this handler, but they are required to report an = error if you try to use handlers they don't support.=20 ---- & unparsed entites are supported. 3. The above-mentioned LexicalHandler does seem to support DTDs, but I = have no idea how to implement this. In short, there is some place along the processing route where data are = being lost. I'm not well-versed in the APIs for this set of = applications, so I'm a bit dazed trying to track down the methods and = attributes needed to get the DTD passed all the way through. It seems = to be an issue with SAX2, which has an extension, but it's just not been = implemented yet.=20 Is there any other type of reader out there that will not truncate DTDs = & returns a full DOM? Thanks, Matt From martin@v.loewis.de Thu Sep 5 08:34:30 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 05 Sep 2002 09:34:30 +0200 Subject: [XML-SIG] Parser not preserving DTD? In-Reply-To: <8C50918F08A109479D62BAE0F1AB9546392F5B@lionking.HANA> References: <8C50918F08A109479D62BAE0F1AB9546392F5B@lionking.HANA> Message-ID: "Matthew Shomphe" writes: > In short, there is some place along the processing route where data > are being lost. I'm not well-versed in the APIs for this set of > applications, so I'm a bit dazed trying to track down the methods > and attributes needed to get the DTD passed all the way through. I think this is the situation that pretty much everybody else is in: A number of people on this list could answer your questions - but only after studying the relevant specifications, and the code of PyXML; I think nobody really oversees all details of the entire processing chain in her memory. Regards, Martin From noreply@sourceforge.net Thu Sep 5 11:02:07 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Thu, 05 Sep 2002 03:02:07 -0700 Subject: [XML-SIG] [ pyxml-Bugs-604973 ] memory leak in xml.parsers.sgmllib Message-ID: Bugs item #604973, was opened at 2002-09-05 12:02 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=604973&group_id=6473 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jean-Claude Rimbault (jcrimbault) Assigned to: Nobody/Anonymous (nobody) Summary: memory leak in xml.parsers.sgmllib Initial Comment: The following short test program is leaking memory: from xml.parsers.sgmllib import SGMLParser while 1: p = SGMLParser() There is a cross reference cycle between sgmllib.FastSGMLParser and sgmlop.SGMLParser which is not detected by the garbage collector. Breaking the cycle with the following workaround stops the memory leakage: while 1: p = SGMLParser() p.feed = None p.parser = None (environment: PyXML 0.8, Python 2.2.1, Linux 2.4.3) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=604973&group_id=6473 From uche.ogbuji@fourthought.com Thu Sep 5 16:29:06 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Thu, 05 Sep 2002 09:29:06 -0600 Subject: [XML-SIG] ANN: SLiP and SLIDE - a quick XML shorthand syntax and tool for editing In-Reply-To: Message from "Bud P. Bruegger" of "Mon, 19 Aug 2002 19:28:19 +0200." <20020819192819.630dce5b.bud@sistema.it> Message-ID: > hello, > > (A little late,) I have noted the announcement of SLiP and the followup > discussion on XML shorthand on this list. Have you guys followed up on > the > topic and are working on a joint specification/implementation? I would > be > interested to join in. In the following I describe some ideas. [SNIP] Your post sounds interesting, and apparently a lot of work has gone into your ideas. Some brief examples would be helpful as I'm trying to get a sense of your ideas quickly. I must note that I have recently started just using straight Wiki text in XML content, and this works very well for me: I have an XSLT extension element for 4Suite that takes such text and emits HTML to the processor output. I plan to post this soon to the Python Cookbook. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From BudP.Bruegger Thu Sep 5 17:09:00 2002 From: BudP.Bruegger (BudP.Bruegger) Date: Thu, 5 Sep 2002 18:09:00 +0200 Subject: [XML-SIG] ANN: SLiP and SLIDE - a quick XML shorthand syntax and tool for editing In-Reply-To: References: <20020819192819.630dce5b.bud@sistema.it> Message-ID: <20020905180900.5e1da308.bud@sistema.it> Hi Uche: Thanks for the interest. Give me some time to work out some points better in my mind and I'm also implementing s.th. that is close to releasing as first minimal approach. That way I can actually produce the examples with running code. I'm currently contemplating that the requirements are quite differnet for "document style" applications with mixed content where whitespace (leading and trailing etc.) matters and "data-style" applications where whitespace is irrelevant. I'm thinking of two modes in the shortand to take care of this.. In data mode, one can then use a lot of indenting for clarity--in document mode one has full control over whitespace... Also, Karl's feedback made me realize (apart from an ugly formatting problem that I have hopefully fixed now) that the main idea of my approach is that apart from a base XML syntax, a shorthand needs to be extensible to accomodate custom syntaxes (syntices??), for example for lists and tables, where the use for tagging is rather cumbersome compared to Wiki style or structured text. Other exmaples would be to include CVS files, customization files, or e-mail messages that have well defined, parsable formats... Will come up with some more soon. best cheers --bud PS. Have you looked at reStructuredText (http://docutils.sourceforge.net/#restructuredtext) that seems a good generalized Wiki-style text formatting approach. They seem to create some xml-like output from the current parser... --b On Thu, 05 Sep 2002 09:29:06 -0600 Uche Ogbuji wrote: > Your post sounds interesting, and apparently a lot of work has gone into your > ideas. Some brief examples would be helpful as I'm trying to get a sense of > your ideas quickly. > > I must note that I have recently started just using straight Wiki text in XML > content, and this works very well for me: I have an XSLT extension element for > > 4Suite that takes such text and emits HTML to the processor output. I plan to > > post this soon to the Python Cookbook. > From fdrake@acm.org Thu Sep 5 19:04:19 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 5 Sep 2002 14:04:19 -0400 Subject: [XML-SIG] Roadmap for the development of Expat Message-ID: <15735.40099.481729.43202@grendel.zope.com> The Expat team has published a proposed roadmap that describes our intended directions for future development of the parser. The roadmap is available on the Expat website at: http://www.libexpat.org/dev/roadmap.html We welcome comments on the proposal; please send feedback on the roadmap to expat-discuss mailing list: http://mail.libexpat.org/mailman-21/listinfo/ Please do not "Reply to All" to this message to avoid further cross posting. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Thu Sep 5 03:11:04 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 4 Sep 2002 22:11:04 -0400 Subject: [XML-SIG] Expat 1.95.5 forthcoming Message-ID: <15734.48440.589529.242180@grendel.zope.com> I'm planning to release Expat 1.95.5 on Friday. This release fixes some bugs that have been exposed by the Zope tests as well as some segfault bugs exposed elsewhere. Some minor API enhancements have been made as well. Once the Expat release is out, I'm going to plan on checking the new version into PyXML, and probably expose some additional information in pyexpat. I'd like to see a new release of PyXML to follow sometime next week if there are no objections. This certainly seems like the easiest way for Zope developers to upgrade to the latest Expat version, and will probably be so for others as well. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Thu Sep 5 23:52:23 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 00:52:23 +0200 Subject: [XML-SIG] Expat 1.95.5 forthcoming In-Reply-To: <15734.48440.589529.242180@grendel.zope.com> References: <15734.48440.589529.242180@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > I'd like to see a new release of PyXML to follow sometime next week if > there are no objections. This certainly seems like the easiest way > for Zope developers to upgrade to the latest Expat version, and will > probably be so for others as well. That won't be a problem. Regards, Martin From noreply@sourceforge.net Fri Sep 6 00:11:47 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Thu, 05 Sep 2002 16:11:47 -0700 Subject: [XML-SIG] [ pyxml-Bugs-605323 ] setup.py build failure on MacOS X Message-ID: Bugs item #605323, was opened at 2002-09-05 16:11 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=605323&group_id=6473 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: setup.py build failure on MacOS X Initial Comment: On MacOS X 10.1.5, Python 2.2, PyXML 0.8. Build apparently fails for some users with [localhost:~/PyXML-0.8] % python setup.py build running build running build_py creating build creating build/lib.darwin-5.5-Power Macintosh-2.2 creating build/lib.darwin-5.5-Power Macintosh-2.2/_xmlplus copying xml/__init__.py -> build/lib.darwin-5.5-Power Macintosh- 2.2/_xmlplus [ tons of stuff deleted] cc -g -O3 -Wall -Wstrict-prototypes -no-cpp-precomp - DHAVE_EXPAT_H -DXML_BYTE_ORDER=21 -Iextensions/ expat/lib -I/usr/local/include/python2.2 -c extensions/expat/lib/ xmlrole.c -o build/temp.darwin-5.5-Power Macintosh-2.2/xmlrole.o cc -g -O3 -Wall -Wstrict-prototypes -no-cpp-precomp - DHAVE_EXPAT_H -DXML_BYTE_ORDER=21 -Iextensions/ expat/lib -I/usr/local/include/python2.2 -c extensions/expat/lib/ xmltok.c -o build/temp.darwin-5.5-Power Macintosh-2.2/xmltok.o cc -bundle -flat_namespace -undefined suppress build/ temp.darwin-5.5-Power Macintosh-2.2/pyexpat.o build/ temp.darwin-5.5-Power Macintosh-2.2/xmlparse.o build/ temp.darwin-5.5-Power Macintosh-2.2/xmlrole.o build/temp.darwin- 5.5-Power Macintosh-2.2/xmltok.o -o build/lib.darwin-5.5-Power Macintosh-2.2/_xmlplus/parsers/pyexpat.so -flat_namespace /usr/bin/ld: -undefined: unknown argument: -lbundle1.o error: command 'cc' failed with exit status 1 Doesn't seem to happen to everyone, but apparently due to - flat_namespace appearing twice in last cc. Apparently fixed when I alter the line in setup.py if sys.platform[:6] == "darwin": # Mac OS X LDFLAGS.append('-flat_namespace') to not append '-flat_namespace'. Might need to check if this is already in LDFLAGS? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=605323&group_id=6473 From elloyd@lancaster.lib.pa.us Fri Sep 6 17:36:12 2002 From: elloyd@lancaster.lib.pa.us (Eron Lloyd) Date: 06 Sep 2002 12:36:12 -0400 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: <20020906.013922.15248657.glyph@twistedmatrix.com> References: <20020902.040745.78701595.glyph@twistedmatrix.com> <1031262790.1111.3.camel@phobos> <20020906.013922.15248657.glyph@twistedmatrix.com> Message-ID: <1031330172.1094.9.camel@phobos> Hmm, I know that minidom has had some problems recently, but it has also seen some good improvements. It sounds like you need more robust DOM support--have you tried 4DOM? It's not as fast, but it does adhere to the spec the best. Maybe (when you have time) if you let us know what you expect to accomplish we can help out--the people in XML-SIG are some of the sharpest in the community. Perhaps TREX or RELAX-NG would be more suitable. I guess the only comforting thing I can say is that every development community is experiencing growing pains when it comes to an XML strategy. Good luck, Eron On Fri, 2002-09-06 at 02:39, Glyph Lefkowitz wrote: > > On 05 Sep 2002 17:53:09 -0400, Eron Lloyd wrote: > > Are you referring to PyXML? I know xml.* in the Standard Library is > > pretty weak by far (but getting better!). > > Yes. In fact, PyXML is a big part of the problem. Its "minidom" module, for > example, is *far* buggier than the one found in the standard library. (As an > example of that, try to figure out how to make cloneNode work on a Document > object.) > > I could deal with one set of potential problems and pitfalls using XML in > Python and work around then, but I have to work around every combination of > versions to make a useful app that doesn't have very stringent installation > requirements: in pracitice this means 4 environments: python2.1 with pyxml, > python2.1 standalone, python2.2 with pyxml, python2.2 standalone. > > I don't want a plethora of XML parsers with rich features, all of which are > broken. I want *one* XML parser that can *reliably* transform a stream of > bytes into a stream of nodes, and a text file into a tree of nodes. You > mentioned validatation in your post and I explicitly said that validation is > worse than useless to me; in most cases I want to parse XHTML, which means > dealing with lots of potentially DTD-violating stuff which is still "valid" as > far as I'm concerned. > > Eventually I'll clean up the problem cases I'm having and submit them as bug > reports, but right now it's not worth my time, because I really don't want to > deal with the fragility of the PyXML or python-standard-library xml.* stuff. > > -- > | <`'> | Glyph Lefkowitz: Traveling Sorcerer | > | < _/ > | Lead Developer, the Twisted project | > | < ___/ > | http://www.twistedmatrix.com | -- Eron Lloyd Technology Coordinator Lancaster County Library elloyd@lancaster.lib.pa.us Phone: 717-239-2116 Fax: 717-394-3083 --- [This E-mail scanned for viruses by Declude Virus] From Matt Gushee Fri Sep 6 18:41:33 2002 From: Matt Gushee (Matt Gushee) Date: Fri, 6 Sep 2002 11:41:33 -0600 Subject: [mgushee@havenrock.com: Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?] Message-ID: <20020906174133.GD2999@swordfish> Oops, accidentally sent this to Eron instead of the list. Here it is again. ----- Forwarded message from Matt Gushee ----- Date: Fri, 6 Sep 2002 11:12:45 -0600 From: Matt Gushee To: Eron Lloyd Subject: Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? Reply-To: Matt Gushee Since this thread was apparently imported from another list, I'm missing some of the context, but here goes ... On Fri, Sep 06, 2002 at 12:36:12PM -0400, Eron Lloyd wrote: > Hmm, I know that minidom has had some problems recently, but it has also > seen some good improvements. It sounds like you need more robust DOM > support--have you tried 4DOM? It's not as fast, That's an understatement. > but it does adhere to > the spec the best. Maybe (when you have time) if you let us know what > you expect to accomplish we can help out--the people in XML-SIG are some > of the sharpest in the community. Perhaps TREX or RELAX-NG would be more > suitable. I don't follow that at all. First of all, he says he doesn't want validation. But even if the greater flexibility of RELAX NG made validation useful to him, RELAX NG hasn't been implemented in Python. As for TREX, it has been merged into RELAX NG, so it is de facto, if not formally, deprecated. So you want him to implement RELAX NG in Python, *and* rewrite the XHTML schema in RELAX NG? I don't think so. Unfortunately I don't have much good news to contribute. 4DOM might work better, but you should be aware that it is essentially a legacy product. Fourthought, Inc., which created it, is no longer developing it, because its performance was horrible and there was simply not a huge demand for a full DOM implementation. In fact, I worked for Fourthought for a year and never once touched 4DOM. cDomlette, also from Fourthought, is the fastest Python DOM parser (because it's a C extension), provides the most commonly needed features, and will continue to be maintained for the foreseeable future. Unfortunately, it's not quite ready for production use, but depending on your timeline you might want to give it a try (it's part of 4Suite, available at http://4Suite.org). -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ ----- End forwarded message ----- -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From Matt Gushee Fri Sep 6 19:59:54 2002 From: Matt Gushee (Matt Gushee) Date: Fri, 6 Sep 2002 12:59:54 -0600 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) Message-ID: <20020906185953.GG2999@swordfish> > ----- Forwarded message from Matt Gushee ----- > > On Fri, Sep 06, 2002 at 10:23:51AM -0600, Karl Putland wrote: > > > Problem is, that I can't get the tag to print out. > > Karl, instead of toprettyxml(), try this (you'll need PyXML installed): > > from xml.dom.ext.Printer import PrintWalker, PrintVisitor > > f = open('foo.xml','w') > v = PrintVisitor(f, 'iso8859-1', indent=' ') > w = PrintWalker(v, document) > w.run() > f.close() > > Works for me. You know, it occurs to me that this isn't documented *anywhere* (at least I couldn't find any docs last time I forgot how to do it). Shouldn't it be? What's the proper way to submit an addition to the PyXML documentation? -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From Mike.Olson@fourthought.com Fri Sep 6 20:14:11 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 06 Sep 2002 13:14:11 -0600 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <20020906185953.GG2999@swordfish> References: <20020906185953.GG2999@swordfish> Message-ID: <1031339653.26343.29.camel@penny> > > > > from xml.dom.ext.Printer import PrintWalker, PrintVisitor > > > > f = open('foo.xml','w') > > v = PrintVisitor(f, 'iso8859-1', indent=' ') > > w = PrintWalker(v, document) > > w.run() > > f.close() > > > > Works for me. As a note, you don't need to do that much work Just xml.dom.ext.Print(node) will get it to print to a string, or xml.dom.ext.Print(node,open('foo.xml','w')) will print to the file. If you want Pretty, same call signatures to xml.dom.ext.PrettyPrint > > You know, it occurs to me that this isn't documented *anywhere* (at > least I couldn't find any docs last time I forgot how to do it). > Shouldn't it be? What's the proper way to submit an addition to the > PyXML documentation? But now you can always reference the FrPythoneers mailing list, we just need to tell the rest of the world :) Actully, contact Fred Drake as he is incharge of the PyXML docs. Mike > > -- > Matt Gushee > Englewood, Colorado, USA > mgushee@havenrock.com > http://www.havenrock.com/ > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From fdrake@acm.org Fri Sep 6 20:39:51 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 6 Sep 2002 15:39:51 -0400 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <1031339653.26343.29.camel@penny> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> Message-ID: <15737.1159.609009.339799@grendel.zope.com> Mike Olson writes: > Actully, contact Fred Drake as he is incharge of the PyXML docs. Actually, I don't think I am. ;-) I'm certainly glad to help out as time allows. The best thing to do when documentation is missing is (doing as many of these as possible), but at least the first item): - File a bug report, telling exactly what you were looking for - Explain (as part of the bug report, or a followup comment) what the documentation should say about the topic (what would have answered your question) - Write any required new material for the documentation, in Python-style LaTeX or plain text, and attach it to the bug report. If you provide a patch or additional material for the documentation, feel free to assign it to me. If new material still needs to be written, it'll get done on an as-time-and-knowledge-allow basis. In any case, it's unlikely to ever get done if there isn't a bug report. An email to the list may be handy to get an answer to a question, but typically won't get documentation written; it's too easy to lose the request in an over-full inbox. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Mike.Olson@fourthought.com Fri Sep 6 20:48:06 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 06 Sep 2002 13:48:06 -0600 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <15737.1159.609009.339799@grendel.zope.com> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> <15737.1159.609009.339799@grendel.zope.com> Message-ID: <1031341692.3138.41.camel@penny> On Fri, 2002-09-06 at 13:39, Fred L. Drake, Jr. wrote: > > Mike Olson writes: > > Actully, contact Fred Drake as he is incharge of the PyXML docs. > > Actually, I don't think I am. ;-) I'm certainly glad to help out as > time allows. The best thing to do when documentation is missing is > (doing as many of these as possible), but at least the first item): My bad, thought you were. Mike > > - File a bug report, telling exactly what you were looking for > - Explain (as part of the bug report, or a followup comment) what the > documentation should say about the topic (what would have answered > your question) > - Write any required new material for the documentation, in > Python-style LaTeX or plain text, and attach it to the bug report. > > If you provide a patch or additional material for the documentation, > feel free to assign it to me. If new material still needs to be > written, it'll get done on an as-time-and-knowledge-allow basis. In > any case, it's unlikely to ever get done if there isn't a bug report. > An email to the list may be handy to get an answer to a question, but > typically won't get documentation written; it's too easy to lose the > request in an over-full inbox. > > > -Fred > > -- > Fred L. Drake, Jr. > PythonLabs at Zope Corporation -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Matt Gushee Fri Sep 6 20:53:50 2002 From: Matt Gushee (Matt Gushee) Date: Fri, 6 Sep 2002 13:53:50 -0600 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <15737.1159.609009.339799@grendel.zope.com> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> <15737.1159.609009.339799@grendel.zope.com> Message-ID: <20020906195350.GI2999@swordfish> On Fri, Sep 06, 2002 at 03:39:51PM -0400, Fred L. Drake, Jr. wrote: > > Mike Olson writes: > > Actully, contact Fred Drake as he is incharge of the PyXML docs. > > Actually, I don't think I am. ;-) I'm certainly glad to help out as > time allows. The best thing to do when documentation is missing is > (doing as many of these as possible), but at least the first item): > > - File a bug report, telling exactly what you were looking for > - Explain (as part of the bug report, or a followup comment) what the > documentation should say about the topic (what would have answered > your question) > - Write any required new material for the documentation, in > Python-style LaTeX or plain text, and attach it to the bug report. Okay, I'll do that. By 'Python-style LaTeX,' I presume you mean the format documented in the 'Documenting Python' document? -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From fdrake@acm.org Fri Sep 6 21:04:40 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 6 Sep 2002 16:04:40 -0400 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <1031341692.3138.41.camel@penny> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> <15737.1159.609009.339799@grendel.zope.com> <1031341692.3138.41.camel@penny> Message-ID: <15737.2648.197597.710271@grendel.zope.com> Mike Olson writes: > My bad, thought you were. Not a problem; I'm certainly glad to work on it when I have time. I guess doing the standard Python documentation is kind of an ever-expanding task. ;-) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Fri Sep 6 21:05:41 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 6 Sep 2002 16:05:41 -0400 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <20020906195350.GI2999@swordfish> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> <15737.1159.609009.339799@grendel.zope.com> <20020906195350.GI2999@swordfish> Message-ID: <15737.2709.928720.110300@grendel.zope.com> Matt Gushee writes: > Okay, I'll do that. By 'Python-style LaTeX,' I presume you mean the > format documented in the 'Documenting Python' document? Yes, that's right. We use the same tools for the PyXML documentation and the Python/XML How-to. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From shane@irosoft.com Fri Sep 6 20:59:16 2002 From: shane@irosoft.com (Daniel Shane) Date: Fri, 06 Sep 2002 15:59:16 -0400 Subject: [XML-SIG] Reading blocks smaller than 16384 in xmlproc? Message-ID: <5.1.0.14.0.20020906155303.0264a3d8@mail.irosoft.com> Hi, I would like to change the value of the bufsize in xmlproc so that it reads the input file in smaller chunks. This is because I use xmlproc to generate data that is passed in another xmlproc which in turn tells the first process how to send more data and so on and so forth... Unfortunately, this system could dead-lock because the first process will wait for the second process to signal something, but the second process is blocked on the read because less than 16384 characters were sent by process 1. I tried to change this value to something very small like 1 (!) or 10 (by changing bufsize at 3 places in the code) but end up with a validating parser that is no longer functionning correctly. Does anyone know if it would be easy to modify xmlproc so that its input chunks could be as small as 1 caracter? How small a bufsize can be used while still having a functionning parser? Regards, Daniel Shane From uche.ogbuji@fourthought.com Sat Sep 7 00:31:51 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Fri, 06 Sep 2002 17:31:51 -0600 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from Eron Lloyd of "06 Sep 2002 12:36:12 EDT." <1031330172.1094.9.camel@phobos> Message-ID: > Hmm, I know that minidom has had some problems recently, but it has also > seen some good improvements. It sounds like you need more robust DOM > support--have you tried 4DOM? It's not as fast, but it does adhere to > the spec the best. cDomlette's cloneNode does work. If minidom's doesn't, a bug report would be nice. > Maybe (when you have time) if you let us know what > you expect to accomplish we can help out--the people in XML-SIG are some > of the sharpest in the community. Perhaps TREX or RELAX-NG would be more > suitable. I guess the only comforting thing I can say is that every > development community is experiencing growing pains when it comes to an > XML strategy. > > Good luck, > > Eron > > On Fri, 2002-09-06 at 02:39, Glyph Lefkowitz wrote: > > > > On 05 Sep 2002 17:53:09 -0400, Eron Lloyd wrote: > > > Are you referring to PyXML? I know xml.* in the Standard Library is > > > pretty weak by far (but getting better!). > > > > Yes. In fact, PyXML is a big part of the problem. Its "minidom" module, for > > example, is *far* buggier than the one found in the standard library. (As an > > example of that, try to figure out how to make cloneNode work on a Document > > object.) What version of PyXML? > > I could deal with one set of potential problems and pitfalls using XML in > > Python and work around then, but I have to work around every combination of > > versions to make a useful app that doesn't have very stringent installation > > requirements: in pracitice this means 4 environments: python2.1 with pyxml, > > python2.1 standalone, python2.2 with pyxml, python2.2 standalone. > > > > I don't want a plethora of XML parsers with rich features, all of which are > > broken. I want *one* XML parser that can *reliably* transform a stream of > > bytes into a stream of nodes, and a text file into a tree of nodes. You haven't given any evidence to the effect that PyXML does not have this. A bug in cloneNode has nothing to do with parsing. > > You > > mentioned validatation in your post and I explicitly said that validation is > > worse than useless to me; in most cases I want to parse XHTML, which means > > dealing with lots of potentially DTD-violating stuff which is still "valid" as > > far as I'm concerned. Doesn't HtmlParser do the trick? If not, you could try dom.ext.readers.HtmlReader with a minidom implementation used to override the default. BTW, from what you're describing above, you are *not* parsing XHTML. If it violates the DTD, it is not XHTML. Period. Just say you're parsing "HTML" and don't mention a version. That's the only way to say it correctly ;-) > > Eventually I'll clean up the problem cases I'm having and submit them as bug > > reports, but right now it's not worth my time, because I really don't want to > > deal with the fragility of the PyXML or python-standard-library xml.* stuff. Well, no one can tell you what to do with your time, but such general comments are not very useful. It's not as if you posted 10 bug reports, then threw up your hands and said "I'm blowing this joint". You made one vague mention of a cloneNode bug, without even a bare test case. No one gets paid to develop PyXML, but if you come our way a bit, we're quite willing to help. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Sat Sep 7 00:36:34 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Fri, 06 Sep 2002 17:36:34 -0600 Subject: [mgushee@havenrock.com: Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?] In-Reply-To: Message from Matt Gushee of "Fri, 06 Sep 2002 11:41:33 MDT." <20020906174133.GD2999@swordfish> Message-ID: > Oops, accidentally sent this to Eron instead of the list. Here it is > again. > > ----- Forwarded message from Matt Gushee ----- > > Date: Fri, 6 Sep 2002 11:12:45 -0600 > From: Matt Gushee > To: Eron Lloyd > Subject: Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? > Reply-To: Matt Gushee > > Since this thread was apparently imported from another list, I'm missing > some of the context, but here goes ... > > On Fri, Sep 06, 2002 at 12:36:12PM -0400, Eron Lloyd wrote: > > Hmm, I know that minidom has had some problems recently, but it has also > > seen some good improvements. It sounds like you need more robust DOM > > support--have you tried 4DOM? It's not as fast, > > That's an understatement. > > > but it does adhere to > > the spec the best. Maybe (when you have time) if you let us know what > > you expect to accomplish we can help out--the people in XML-SIG are some > > of the sharpest in the community. Perhaps TREX or RELAX-NG would be more > > suitable. > > I don't follow that at all. First of all, he says he doesn't want > validation. But even if the greater flexibility of RELAX NG made > validation useful to him, RELAX NG hasn't been implemented in Python. Happily, this is not true. Eric van der Vlist announced XVIF stand-alone here a month or so ago, and I announced to the 4Suite list that 4Suite/CVS incorporates XVIF gor RELAX NG support. http://lists.fourthought.com/pipermail/4suite/2002-August/004141.html > As for TREX, it has been merged into RELAX NG, so it is de facto, if not > formally, deprecated. So you want him to implement RELAX NG in Python, > *and* rewrite the XHTML schema in RELAX NG? I don't think so. I'm pretty sure there's already an XHTML schema in RELAX NG. I don't have a moment to look right now. Of course, it's a moot point because the guy says he is not processing XHTML, but some broken markup with some resemblance to XHTML. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From fdrake@acm.org Sat Sep 7 00:49:45 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 6 Sep 2002 19:49:45 -0400 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <1031330172.1094.9.camel@phobos> Message-ID: <15737.16153.431188.210149@grendel.zope.com> Uche Ogbuji writes: > cDomlette's cloneNode does work. If minidom's doesn't, a bug > report would be nice. There were some bugs checked into the minidom implementation at the last minute before the PyXML 0.8 release; all the ones that I know of are fixed in CVS. If there are still bugs in that version of the code, I'd really like to see a bug report filed on SourceForge. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From glyph@twistedmatrix.com Sat Sep 7 04:44:42 2002 From: glyph@twistedmatrix.com (Glyph Lefkowitz) Date: Fri, 06 Sep 2002 22:44:42 -0500 (CDT) Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: Message-ID: <20020906.224442.74757833.glyph@twistedmatrix.com> ----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji wrote: > BTW, from what you're describing above, you are *not* parsing XHTML. If it > violates the DTD, it is not XHTML. Period. > Just say you're parsing "HTML" and don't mention a version. That's the only > way to say it correctly ;-) OK, "XML which browsers will render". I am not parsing HTML, in that I won't accept XML that is not well-formed. I suppose I could try to wrap HtmlParser with minidom... yuck. Gross, but probably a good idea, come to think of it :) > Well, no one can tell you what to do with your time, but such general comments > are not very useful. It's not as if you posted 10 bug reports, then threw up > your hands and said "I'm blowing this joint". You made one vague mention of a > cloneNode bug, without even a bare test case. The reason I mentioned the cloneNode bug is because it is the most reliable and the most trivial to demonstrate. Like I said; at some point, I will clean up my complaints and submit some bug reports. Here's a "bare test case" of that particular spurious accusation: glyph@zelda:~% python Python 2.2.1 (#1, Aug 30 2002, 09:36:47) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from xml.dom.minidom import parseString >>> parseString("").cloneNode(1) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 186, in cloneNode clone = _clone_node(self, deep, self.ownerDocument or self) File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1248, in _clone_node elif node.nodeType == PROCESSING_INSTRUCTION_NODE: NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined In order to do the work I want to do, though, those bug reports aren't going to help. Even if you resolved every bug report that I submitted within a week, I would be stuck in the same place I am now: I have to work around the bugs in a bunch of old versions of PyXML or produce what amounts to my own `implementation' of an XML parser. Granted, if I packaged a newer, fixed-up version of PyXML with Twisted, I wouldn't have to be mucking about with bits and bytes -- but I *would* have to understand the entire ontology of confusion associated with cross-language XML APIs. My main frustration is with packaging. If all the world were running Debian unstable, I'd be fine: I'd just say Depends: python2.2-xml >= 0.9. However, with lots of users in Windows, and many more on other linux platforms with less pleasant package management, every new package that Twisted requires is another fifteen minutes that the software takes to get running. It's already confusing enough to understand it when it *works*; I want the process of getting it running to be as seamless as possible :). For the applications that I'm intending to write, just doing my own parser and API is both more appealing and more rewarding. Neither DOM nor SAX will present an API which allows me to get network XML events in quite the way I want, so I'm going to have to do some wrapping. (I do wish pyRXP were event-based... it's very close, in spirit, to what I want.) If the general quality of XML parsers in Python were really high, I would regard this impulse as contrary and counterproductive -- why write my own library for doing this when perfectly good ones already exist and and are deployed all over the place? So maybe I'm just rationalizing what I would have done anyway. Nevertheless, it is easier to write my own XML parser than to even properly report the bugs that I have thus far discovered. > No one gets paid to develop PyXML, but if you come our way a bit, we're quite > willing to help. I appreciate that. At some point I hope to have the time to run down every last bug I've found and help PyXML to become very robust. (I know that my requirements are at least a little esoteric; I don't plan for Twisted to be a general-purpose XML processing toolkit!) Despite my various problems with it, PyXML *is* what got me to see why XML might be worthwhile and kind of cool in some circumstances. For more information my perception of XML, and why my requirements are as stripped-down as they are, look at the presentation here: http://xmlsucks.org/but_you_have_to_use_it_anyway/ (Yes, it's a real URL, and it's not mine.) -- | <`'> | Glyph Lefkowitz: Traveling Sorcerer | | < _/ > | Lead Developer, the Twisted project | | < ___/ > | http://www.twistedmatrix.com | ----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)-- Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQA9eXYwvVGR4uSOE2wRAlm6AJ9wx1ca8rTQ7sHHXeAAM36O5s2PgwCeOy1a DidtNC/SRvQm/3pYWA0CAOI= =jET9 -----END PGP SIGNATURE----- ----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)---- From fdrake@acm.org Sat Sep 7 05:07:28 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Sat, 7 Sep 2002 00:07:28 -0400 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: <20020906.224442.74757833.glyph@twistedmatrix.com> References: <20020906.224442.74757833.glyph@twistedmatrix.com> Message-ID: <15737.31616.908090.505123@pcp745479pcs.reston01.va.comcast.net> Glyph Lefkowitz writes: > The reason I mentioned the cloneNode bug is because it is the most > reliable and the most trivial to demonstrate. Like I said; at some > point, I will clean up my complaints and submit some bug reports. > Here's a "bare test case" of that particular spurious accusation: This particular bug has already been fixed in CVS. > In order to do the work I want to do, though, those bug reports > aren't going to help. Even if you resolved every bug report that I > submitted within a week, I would be stuck in the same place I am > now: I have to work around the bugs in a bunch of old versions of > PyXML or produce what amounts to my own `implementation' of an XML If you're shipping commercial applications, ship the versions of relevant libraries and Python needed for the application. Eating up disk space may be annoying, but it's cheap enough not to be a real problem. Bugs are a real problem, no matter how unfortunate, even if they're not your own. If the issue is that you're shipping a framework that needs to work with as many other packages as possible, then document which versions it's known to work with, which versions its known not to work with, and keep moving. Please understand, I'm really sorry PyXML 0.8 had bugs, but we're not getting paid for this, so I don't feel it's my job to double-check every checkin that every PyXML develop makes before a release; I try to make my checkins work as well as I can, and I do test with 4 different major versions of Python. If you need PyXML to become increasingly bug free over time, I'd like to suggest two things: 1. Keep track of the CVS version regularly, and test it out with your components. Sometimes this can be tedious, but good automated tests can make this substantially easier. Report bugs quickly using the SourceForge tracker. 2. Contribute regression tests to the project. We know our tests are not complete, and are improving them with each release, but some assistance with this, especially when you report bugs, can make more of a difference even than contributing fixes (which are also welcome). > parser. Granted, if I packaged a newer, fixed-up version of PyXML > with Twisted, I wouldn't have to be mucking about with bits and > bytes -- but I *would* have to understand the entire ontology of > confusion associated with cross-language XML APIs. I must be missing something. Doesn't it just mean that you need to provide a sufficiently updated PyXML distribution? > My main frustration is with packaging. If all the world were > running Debian unstable, I'd be fine: I'd just say Depends: > python2.2-xml >= 0.9. However, with lots of users in Windows, and Yeah, the packaging sucks. It's not any worse than for any other bit of library code though, as far as I can tell. (I'll admit the horizon for my sight is substantially limited to open source software, however.) > For the applications that I'm intending to write, just doing my own > parser and API is both more appealing and more rewarding. Neither > DOM nor SAX will present an API which allows me to get network XML > events in quite the way I want, so I'm going to have to do some If you don't think the interfaces match you application space very well, please describe your requirements and explain how the current APIs don't meet your requirements, and what sort of APIs you're looking for. > If the general quality of XML parsers in Python were really high, I > would regard this impulse as contrary and counterproductive -- why You talk about parser, but I don't think that's what you mean. The bug you referred to in minidom had nothing to do with the underlying parser; it would have manifested itself with any parser you picked that reported processing instructions. ("All of them.") > So maybe I'm just rationalizing what I would have done anyway. > Nevertheless, it is easier to write my own XML parser than to even > properly report the bugs that I have thus far discovered. As an Expat maintainer, I wish you luck. ;-) > I appreciate that. At some point I hope to have the time to run > down every last bug I've found and help PyXML to become very > robust. Yes, bug reports are definately necessary to develop a solid piece of software. I do hope we can encourage you to produce a few. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Sat Sep 7 07:10:51 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Sat, 07 Sep 2002 00:10:51 -0600 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from Glyph Lefkowitz of "Fri, 06 Sep 2002 22:44:42 CDT." <20020906.224442.74757833.glyph@twistedmatrix.com> Message-ID: > ----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)-- > Content-Type: Text/Plain; charset=us-ascii > Content-Transfer-Encoding: 7bit > > > On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji wrote: > > > BTW, from what you're describing above, you are *not* parsing XHTML. If it > > violates the DTD, it is not XHTML. Period. > > > Just say you're parsing "HTML" and don't mention a version. That's the only > > way to say it correctly ;-) > > OK, "XML which browsers will render". I am not parsing HTML, in that I won't > accept XML that is not well-formed. I suppose I could try to wrap HtmlParser > with minidom... yuck. Gross, but probably a good idea, come to think of it :) I can't imagine why this would be gross. IMO, it's illustrates very admirable technique, and one of the strengths of Python/XML. The parsing mechanism and the generated representation are independent of each other, so you can mix them and match them in order to take advantage of the most needed features of either. We put a lot of work into making this possible, and I find it very elegant. C++ folks took ages before they cottonned on to such an approach (in the STL), and now it has them in raptures (generic programming is all the rage). Of course, old strait-jacket Java can't touch this. Too bad for them. > The reason I mentioned the cloneNode bug is because it is the most reliable and > the most trivial to demonstrate. Like I said; at some point, I will clean up > my complaints and submit some bug reports. Here's a "bare test case" of that > particular spurious accusation: > > glyph@zelda:~% python > Python 2.2.1 (#1, Aug 30 2002, 09:36:47) > [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from xml.dom.minidom import parseString > >>> parseString("").cloneNode(1) > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 186, in cloneNode > clone = _clone_node(self, deep, self.ownerDocument or self) > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1248, in _clone_node > elif node.nodeType == PROCESSING_INSTRUCTION_NODE: > NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined You see, this is why reporting such "bugs" early is helpful. I could have told you ages ago that it is a *bad* idea to call cloneNode on a Document object. Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, Entity, and Notation nodes is implementation dependent." IOW, yer gets what yer gets and can't really complain :-) Can you expand a bit more on the actual use case that makes you think you want to clone a document node? I do agree that the confused error message is a glitch. Current PyXML CVS gives a more straightforward "sod off" :-) >>> from xml.dom.minidom import parseString >>> parseString("").cloneNode(1) Traceback (most recent call last): File "", line 1, in ? File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 198, in cloneNode clone = _clone_node(self, deep, self.ownerDocument or self) File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1454, in _clone_node raise Exception("Cannot clone node %s" % repr(node)) Exception: Cannot clone node We choose not to allow it. Perfectly legal, and I think this is the right choice. > In order to do the work I want to do, though, those bug reports aren't going to > help. Even if you resolved every bug report that I submitted within a week, I > would be stuck in the same place I am now: I have to work around the bugs in a > bunch of old versions of PyXML You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? :-) > or produce what amounts to my own `implementation' of an XML parser. If you try going this route, I guarantee you'll still be trying to get the most basic things right six months from now. > Granted, if I packaged a newer, fixed-up > version of PyXML with Twisted, I wouldn't have to be mucking about with bits > and bytes -- but I *would* have to understand the entire ontology of confusion > associated with cross-language XML APIs. > > My main frustration is with packaging. If all the world were running Debian > unstable, I'd be fine: I'd just say Depends: python2.2-xml >= 0.9. However, > with lots of users in Windows, and many more on other linux platforms with less > pleasant package management, every new package that Twisted requires is another > fifteen minutes that the software takes to get running. It's already confusing > enough to understand it when it *works*; I want the process of getting it > running to be as seamless as possible :). Here you have a point. Python, PyXML, and a lot of the related packages move very quickly,. and so quickly that they cause all manner of packaging problems. There is no easy solution to this. Python is much more of a volunteer community than, say JAva. People work on Python and PyXML mostly to scratch their itches, which means they have less incentive to worry about the packaging mess they leave behind. This is the impetus for the Python-in-a-tie effort for Python proper. I do think we'd make a lot more friends if there were a matching PyXML-in-a-tie. It would mean companies would have to commit scarce resources to freezing interfaces and then testing and packaging to oblivion. I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python Business Forum once the effort on Python itself starts to gain legs. I guess I can count on you to at least help cheerlead? :-) > For the applications that I'm intending to write, just doing my own parser and > API is both more appealing and more rewarding. Really? Color me deep skeptical. I have not seen an application on earth where implementing one's own parser is a good idea, and precious few where implementing one's own API is a good idea. I have a lot of colleagues who have tried. By all means, if you'd like to try, go ahead. > Neither DOM nor SAX will > present an API which allows me to get network XML events in quite the way I > want, so I'm going to have to do some wrapping. I have learned through my own bitter experience that you do not want network interfaces to have *anything* to do with the lexical XML layer (or even Infoset). It is best to design network interactions around *application* level semantics. Basically sending around chunks of XML text is far less hazardous than what I think you mean. > (I do wish pyRXP were > event-based... it's very close, in spirit, to what I want.) If the general > quality of XML parsers in Python were really high, I would regard this impulse > as contrary and counterproductive -- why write my own library for doing this > when perfectly good ones already exist and and are deployed all over the place? Well, as I said, I don't see any evidence that the quality of XML parsers in Python is not high. You pointed out one problem in cloneNode which, from what I gather, was mostly because you're abusing DOM. This had nothing to do with parsing. Are you speaking generically? > So maybe I'm just rationalizing what I would have done anyway. Nevertheless, > it is easier to write my own XML parser than to even properly report the bugs > that I have thus far discovered. I find this claim ludicrous on its face. Writing an XML parser with the compliance level and quality of any of the ones in PyXML takes years. Yes. Years. Feel free to re-learn this fact the hard way, if you wish. > For more information my perception of XML, and why my requirements are as > stripped-down as they are, look at the presentation here: > > http://xmlsucks.org/but_you_have_to_use_it_anyway/ > > (Yes, it's a real URL, and it's not mine.) Yes. I'd guess we've all seen that link. So what useful technology doesn't suck? XML works for me. Your mileage may vary. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From fdrake@acm.org Sat Sep 7 08:14:40 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Sat, 7 Sep 2002 03:14:40 -0400 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <20020906.224442.74757833.glyph@twistedmatrix.com> Message-ID: <15737.42848.170464.688796@pcp745479pcs.reston01.va.comcast.net> Uche Ogbuji writes: > You see, this is why reporting such "bugs" early is helpful. I could have > told you ages ago that it is a *bad* idea to call cloneNode on a Document > object. > > Accordin to the DOM Level 2 spec: > > "And, cloning Document, DocumentType, Entity, and Notation nodes is > implementation dependent." That's no reason to think its a bad idea to implement it or need it, just that you can't rely on it being supported by an arbitrary DOM implementation. > I do agree that the confused error message is a glitch. Current PyXML CVS > gives a more straightforward "sod off" :-) Not quite; the previous message would have been raised calling cloneNode() on a processing instruction as well. Or calling it with deep=1 on a portion of the tree that contained a processing instruction. That was a real bug, and not an arbitrary limitation. > We choose not to allow it. Perfectly legal, and I think this is the right > choice. Honestly, I think we should implement cloneNode() for Document, simply because not doing so seems an unnecessary limitation. It is not for the library to decide what is right for the application. I agree that not supporting it is legal. The exception that is raised is wrong: it should be xml.dom.NotSupportedErr. > If you try going this route, I guarantee you'll still be trying to get the > most basic things right six months from now. Heck, we're still trying to get Expat right, and it isn't exactly the freshest software around! > This is the impetus for the Python-in-a-tie effort for Python > proper. I do think we'd make a lot more friends if there were a > matching PyXML-in-a-tie. It would mean companies would have to That would be nice to have. First task: improve & integrate all the random piles of tests out there! They should all be run when I type "make check" at the top level, not just a handful. > You pointed out one problem in cloneNode which, from what I gather, > was mostly because you're abusing DOM. This had nothing to do with It is not at all clear that this is an abuse of the DOM, as I explained above. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Sat Sep 7 08:38:25 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Sep 2002 09:38:25 +0200 Subject: [XML-SIG] PyXML documentation (was [mgushee@havenrock.com: Re: [FRPythoneers] xml woes]) In-Reply-To: <15737.1159.609009.339799@grendel.zope.com> References: <20020906185953.GG2999@swordfish> <1031339653.26343.29.camel@penny> <15737.1159.609009.339799@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Actually, I don't think I am. ;-) I'm certainly glad to help out as > time allows. The best thing to do when documentation is missing is > (doing as many of these as possible), but at least the first item): > > - File a bug report, telling exactly what you were looking for > - Explain (as part of the bug report, or a followup comment) what the > documentation should say about the topic (what would have answered > your question) > - Write any required new material for the documentation, in > Python-style LaTeX or plain text, and attach it to the bug report. I think the best thing would be if you provided a patch to doc/xml-ref.tex. Regards, Martin From uche.ogbuji@fourthought.com Sat Sep 7 19:54:46 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Sat, 07 Sep 2002 12:54:46 -0600 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from "Fred L. Drake, Jr." of "Sat, 07 Sep 2002 03:14:40 EDT." <15737.42848.170464.688796@pcp745479pcs.reston01.va.comcast.net> Message-ID: > > Uche Ogbuji writes: > > You see, this is why reporting such "bugs" early is helpful. I could have > > told you ages ago that it is a *bad* idea to call cloneNode on a Document > > object. > > > > Accordin to the DOM Level 2 spec: > > > > "And, cloning Document, DocumentType, Entity, and Notation nodes is > > implementation dependent." > > That's no reason to think its a bad idea to implement it or need it, > just that you can't rely on it being supported by an arbitrary DOM > implementation. OK. So what should it mean to clone any of these node types? I can't hardly imagine anything that doesn't run into circular madness. I think the DOM WG refused to specify this for good reason. > > I do agree that the confused error message is a glitch. Current PyXML CVS > > gives a more straightforward "sod off" :-) > > Not quite; the previous message would have been raised calling > cloneNode() on a processing instruction as well. Or calling it with > deep=1 on a portion of the tree that contained a processing > instruction. That was a real bug, and not an arbitrary limitation. OK. Glad it's fixed, then. > > We choose not to allow it. Perfectly legal, and I think this is the right > > choice. > > Honestly, I think we should implement cloneNode() for Document, simply > because not doing so seems an unnecessary limitation. It is not for > the library to decide what is right for the application. It's not arbitrary at all. cloneNode is not supposed to alter the ownerDocument: that is for importNode to do. So if you clone a document node, what happens? Do you Create a new document (and thus docType) node and then effectively call importNode on the childNodes? That's the only approach that makes sense to me. Yet it's quite arbitrary and magical. I would rather force people to be clear about what they're doing by manually creating another document and then calling importNode on all the childNodes of the original. So I do not agree that we should support cloneNode for the 4 unspecified node types. > I agree that not supporting it is legal. The exception that is raised > is wrong: it should be xml.dom.NotSupportedErr. There is no stipulation that such an exception should be thrown. The behavior is impl dependent, and I don't see why that doesn't mean the implementation can choose to throw whatever exception it wishes. However, I certainly do not object to a change to throwing xml.dom.NotSupportedErr. I just don't think it's a bug that right now it doesn't. > > This is the impetus for the Python-in-a-tie effort for Python > > proper. I do think we'd make a lot more friends if there were a > > matching PyXML-in-a-tie. It would mean companies would have to > > That would be nice to have. First task: improve & integrate all the > random piles of tests out there! They should all be run when I type > "make check" at the top level, not just a handful. Yes. Build and test farms would be the main engine of such an effort. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From glyph@twistedmatrix.com Mon Sep 9 02:06:53 2002 From: glyph@twistedmatrix.com (Glyph Lefkowitz) Date: Sun, 08 Sep 2002 20:06:53 -0500 (CDT) Subject: [XML-SIG] Re: Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: Message-ID: <20020908.200653.27439528.glyph@twistedmatrix.com> ----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit On Sat, 07 Sep 2002 00:10:51 -0600, Uche Ogbuji wrote: > > On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji wrote: > > I suppose I could try to wrap HtmlParser with minidom... yuck. Gross, but > > probably a good idea, come to think of it :) > I can't imagine why this would be gross. Sorry, I was saying that making sense of non-XHTML HTML is kind of gross. I did say that it was a good idea, and it's definitely a neat trick. > Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, > Entity, and Notation nodes is implementation dependent." This is why standards compliance is not terribly important to me. I would rather have a useful XML API than a standardized one. > Can you expand a bit more on the actual use case that makes you think you want > to clone a document node? I have a template "frame" document. I want to clone the document, populate it with information lifted from other XML files, and then write the resultant (cloned) document out. This is the very first use-case I ever had working with XML and it is still the most common. > We choose not to allow it. Perfectly legal, and I think this is the right > choice. Yes, but the point remains that this *used* to work, and now it *doesn't*. This is functionality I found useful. While I can't comment on the intrinsic sense or nonsense of cloning document nodes in DOM, I do know that it's difficult to keep track of when features like this appear and disappear in the various different XML solutions for Python. Maybe this is the only feature that has done this; I don't know. It just happens that it's a very commonly-used one for me. This is just another instance of my general complaint that tracking versioning dependencies is not worth the effort for my degenerately simple use-cases for XML. > You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? > :-) There are still some parties interested in Twisted who are upset that it requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am likely going to have to backport portions of it to 1.5.2 for some people. We can all thank Red Hat for this inane persistence of ancient python versions, but it is sadly the world I live in. > > My main frustration is with packaging. > Here you have a point. Python, PyXML, and a lot of the related packages move > very quickly,. and so quickly that they cause all manner of packaging > problems. This is my main point, and this is the one that the PyXML community can do the least to address. Buggy and idiosyncratic implementations are already in the wild, and some apps will depend on those particular bugs and idiosyncrasies. If twisted depends on a new or different set of bugs and quirks, I make it incompatible with whatever other XML-using applications are out there today. Given that XML is an integration technology this is certainly less than desirable. > There is no easy solution to this. Having a project that is precipitously approaching 1.0 myself, I can sympathize. As much as this sort of dependency and compatibility problem has bothered me, I *know* there will be people that write apps for Twisted and will curse my name when I enhance some functionality later on :-). > I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python > Business Forum once the effort on Python itself starts to gain legs. I guess > I can count on you to at least help cheerlead? :-) Cheerleading, certainly :-). Although I'm less interested in seeing PyXML prepared for "business" clients and more interested in just seeing the level of QA on the volunteer work go up. If I *had* any spare "scarce resources" to commit beyond my own projects, I would certainly help getting the unit tests unified and automated. > > or produce what amounts to my own `implementation' of an XML parser. > > If you try going this route, I guarantee you'll still be trying to get the > most basic things right six months from now. ... > > For the applications that I'm intending to write, just doing my own parser and > > API is both more appealing and more rewarding. > > Really? Color me deep skeptical. I have not seen an application on earth > where implementing one's own parser is a good idea, and precious few where > implementing one's own API is a good idea. I have a lot of colleagues who > have tried. While it is *possible* that I'm smarter than you think I am, it is certain that I'm more stubborn. My sophomoric attempt at an XML parser is now in Twisted CVS. I've had this objection raised over writing yet another a web server, yet another remote procedure call protocol, yet another asynchronous socket server and yet another database interface. It seems like at least some of these ideas were good ones, so I went ahead and wrote an XML parser and representation anyway :-). A fellow I know from IRC once said "it's easier to write an s-expression parser for a particular platform by hand than to learn to use any of the XML tools for that platform". I think that if you're interested in keeping your focus narrow in terms of what you do with XML, the same is true of writing an XML parser. As a data point for this hypothesis, writing the parser and the node tree took me less than half as much time as writing these posts to various mailing lists about XML tools (not counting this post, which has been the most time-consuming): it took less than a quarter as much time as attempting (and failing) to track down bugs in PyXML, not counting the time I spent trying to figure out how to turn off undesired features in a way that would work on more than one version. My two main existing PyXML-using applications are already ported to this, changing barely any of their code. Even so, this is almost not a fair comparison because I have several months of experience with those tools on Python 2.1, and I've read a few books on XML already. > > Neither DOM nor SAX will present an API which allows me to get network XML > > events in quite the way I want, so I'm going to have to do some wrapping. > I have learned through my own bitter experience that you do not want network > interfaces to have *anything* to do with the lexical XML layer (or even > Infoset). It is best to design network interactions around *application* > level semantics. Basically sending around chunks of XML text is far less > hazardous than what I think you mean. I'm not sure what you think I mean, really, but specifically, I'm thinking particularly of parsing and routing Jabber XML streams. If they are designed in a "hazardous" way then it's not my issue... I don't think much of their protocol design as it is, especially with regard to routing. (As you might guess, I think the whole idea of using XML as a network protocol is rather strange; but Jabber in particular could have been much better done. BEEP, for example, I consider odd, but not broken.) > > (I do wish pyRXP were event-based... it's very close, in spirit, to what I > > want.) If the general quality of XML parsers in Python were really high, I > > would regard this impulse as contrary and counterproductive -- why write my > > own library for doing this when perfectly good ones already exist and and > > are deployed all over the place? > Well, as I said, I don't see any evidence that the quality of XML parsers in > Python is not high. You pointed out one problem in cloneNode which, from what > I gather, was mostly because you're abusing DOM. This had nothing to do with > parsing. Are you speaking generically? When I run my particular XML-munging tool, sometimes I get: NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined which we have discussed the reasons for here. Slightly less often, but still with a significant frequency (same python, same PyXML, same input), I get: zsh: segmentation fault ] (doc/howto/basics) I can't present hard evidence for this, I'm sorry, because I'm not familiar with the internals of PyXML or expat and I can't get the bug to happen reliably. If I can ever boil it down to something predictable (i.e. less than 1500 lines of code and half a meg of XML to trigger it) be assured I will make the most complete bug report I can. > > Nevertheless, it is easier to write my own XML parser than to even properly > > report the bugs that I have thus far discovered. > I find this claim ludicrous on its face. Writing an XML parser with the > compliance level and quality of any of the ones in PyXML takes years. Yes. > Years. I never claimed to need a parser with PyXML's level of compliance; in fact, I've said several times that compliance at that level is annoying to me because it's too strict. I think we're going to have to agree to disagree on "quality", but at least for my use cases I don't get occasional coredumps from my parser. I cannot substantiate this with real bug reports, so please feel free to dismiss this as FUD if you disagree. From my discussions with other developers near my interest area, however, QA on the PyXML project is notoriously poor, and the quality is wildly variant from release to release. As you yourself have said, this is likely to remain so until someone funds improvements. I do not feel as though I am owed anything in particular by the PyXML project or by any subscriber to any of these lists. In fact, I'm quite grateful for it having provided a nice, simple introduction to the world of XML; I probably would not be using XML today at all if it weren't for the PyXML project. Unfortunately, due to my larger-than-average concerns about dependencies and ease of automating testing for my own project, I don't think that PyXML is a good solution. I need a *very* small XML library, with no strings attached. PyXML is huge, and featureful, and I'm sure in the most recent incarnations it's very robust. It does come with a lot of strings attached though. I have decided it's not worth my time at this point to invest a lot of effort in helping out, until a few versions go by and the general impressions I get from XML developers I work with are becoming more positive. This doesn't mean I won't lend a helping hand when I can, but the communication overhead to working in the PyXML community is not currently worth the gain I would get from it. I wish you the best of luck in making me look foolish for saying that :-). -- | <`'> | Glyph Lefkowitz: Traveling Sorcerer | | < _/ > | Lead Developer, the Twisted project | | < ___/ > | http://www.twistedmatrix.com | ----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)-- Content-Type: application/pgp-signature Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQA9e/QwvVGR4uSOE2wRAjGQAJ9vT0mgRknUubzodsun+Pj6geYlTwCglQWP QOZ+9KV3DfQVQJ8xPjkrdoM= =YDh3 -----END PGP SIGNATURE----- ----Security_Multipart(Sun_Sep__8_20:06:53_2002_145)---- From fredrik@pythonware.com Mon Sep 9 12:12:45 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 9 Sep 2002 13:12:45 +0200 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? References: Message-ID: <00fa01c257f1$d3f48ef0$0900a8c0@spiff> uche wrote: > > For the applications that I'm intending to write, just doing my own = parser and > > API is both more appealing and more rewarding. >=20 > Really? Color me deep skeptical. I have not seen an application on = earth=20 > where implementing one's own parser is a good idea, and precious few = where=20 > implementing one's own API is a good idea. on the other hand, virtually every commercial XML python user I know of use their own non-pydom parser/sax-style api/dom- style api (with 4thought being the obvious exception, of course). if I couldn't use ElementTree-like apis, I'd probably give up XML programming... (using element trees, Glyph's use case would look something like: tree =3D deepcopy.deepcopy(template_tree) for node in tree.find(pattern): expand(context, node) tree.write(stream) ) From uche.ogbuji@fourthought.com Mon Sep 9 20:39:01 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 09 Sep 2002 13:39:01 -0600 Subject: [XML-SIG] Re: Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from Glyph Lefkowitz of "Sun, 08 Sep 2002 20:06:53 CDT." <20020908.200653.27439528.glyph@twistedmatrix.com> Message-ID: > > Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, > > Entity, and Notation nodes is implementation dependent." > > This is why standards compliance is not terribly important to me. I would > rather have a useful XML API than a standardized one. Well, what do you think is the most useful behavior of cloning a document? Is it the one I posted in response to thread? If so, don't you think the element of surprise is too great (I'd be surprised myself at that behavior)? Wouldn't it be better for Python/XML to offer a *separate*, specialized function for cloning nodes, rather than doing weird things with cloneNode? > > Can you expand a bit more on the actual use case that makes you think you want > > to clone a document node? > > I have a template "frame" document. I want to clone the document, populate it > with information lifted from other XML files, and then write the resultant > (cloned) document out. This is the very first use-case I ever had working with > XML and it is still the most common. I see. It sounds as if a general document duplication function would be of use to you. I agree that this would be useful. I'm willing to write one and add it to xml.dom.ext. But I don't think this is a use case for node.cloneNode. > > We choose not to allow it. Perfectly legal, and I think this is the right > > choice. > > Yes, but the point remains that this *used* to work, and now it *doesn't*. I don't remember. What did it do when it "worked"? > This is functionality I found useful. While I can't comment on the intrinsic > sense or nonsense of cloning document nodes in DOM, I do know that it's > difficult to keep track of when features like this appear and disappear in the > various different XML solutions for Python. Was it ever documented? Every software module has undocumented "features" that you use at your peril. I don't think it's fair to complain when these appear and disappear. Then again, the poor state of PyXML documentation in general weakens that point of mine, doesn't it? Ah well. > Maybe this is the only feature that has done this; I don't know. It just > happens that it's a very commonly-used one for me. > > This is just another instance of my general complaint that tracking versioning > dependencies is not worth the effort for my degenerately simple use-cases for > XML. > > > You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? > > :-) > > There are still some parties interested in Twisted who are upset that it > requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am > likely going to have to backport portions of it to 1.5.2 for some people. We > can all thank Red Hat for this inane persistence of ancient python versions, > but it is sadly the world I live in. I sympethize. It's largely because of Red Hat that it took us so long to drop 1.5 support in 4Suite. But a couple of months ago we decided it is not worth the developemtn and support overhead and ditched support for all versions before 2.1. I sleep better since then :-) > > > My main frustration is with packaging. > > > Here you have a point. Python, PyXML, and a lot of the related packages move > > very quickly,. and so quickly that they cause all manner of packaging > > problems. > > This is my main point, and this is the one that the PyXML community can do the > least to address. Buggy and idiosyncratic implementations are already in the > wild, and some apps will depend on those particular bugs and idiosyncrasies. > If twisted depends on a new or different set of bugs and quirks, I make it > incompatible with whatever other XML-using applications are out there today. > > Given that XML is an integration technology this is certainly less than > desirable. > > > There is no easy solution to this. > > Having a project that is precipitously approaching 1.0 myself, I can > sympathize. As much as this sort of dependency and compatibility problem has > bothered me, I *know* there will be people that write apps for Twisted and will > curse my name when I enhance some functionality later on :-). > > > I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python > > Business Forum once the effort on Python itself starts to gain legs. I guess > > I can count on you to at least help cheerlead? :-) > > Cheerleading, certainly :-). Although I'm less interested in seeing PyXML > prepared for "business" clients and more interested in just seeing the level of > QA on the volunteer work go up. If I *had* any spare "scarce resources" to > commit beyond my own projects, I would certainly help getting the unit tests > unified and automated. > > > > or produce what amounts to my own `implementation' of an XML parser. > > > > If you try going this route, I guarantee you'll still be trying to get the > > most basic things right six months from now. > > ... > > > > For the applications that I'm intending to write, just doing my own parser and > > > API is both more appealing and more rewarding. > > > > Really? Color me deep skeptical. I have not seen an application on earth > > where implementing one's own parser is a good idea, and precious few where > > implementing one's own API is a good idea. I have a lot of colleagues who > > have tried. > > While it is *possible* that I'm smarter than you think I am, it is certain that > I'm more stubborn. I think you take the wrong gloss on my words. I think Linus Torvalds himself would take years to write a complete and correct XML parser. It's the nature of the beast (XML), not the programmer. I certainly do not consider myself smart enough to take on that dragon. I'm just glad to lean on folk like Clark (and Drake, Evans and co), Garshol and Viellard. > My sophomoric attempt at an XML parser is now in Twisted > CVS. Interesting. So how did you test it? > I've had this objection raised over writing yet another a web server, yet > another remote procedure call protocol, yet another asynchronous socket server > and yet another database interface. It seems like at least some of these ideas > were good ones, so I went ahead and wrote an XML parser and representation > anyway :-). I would rather write a Web server, another RPC, another async socket server *and* another DBMS interface all in a row than just take on the single task of writing an XML parser. And I think I can speak authoritatively, because I *have* implemented all four of those things. > As a data point for this hypothesis, writing the parser and the node tree took > me less than half as much time as writing these posts to various mailing lists > about XML tools (not counting this post, which has been the most > time-consuming): it took less than a quarter as much time as attempting (and > failing) to track down bugs in PyXML, not counting the time I spent trying to > figure out how to turn off undesired features in a way that would work on more > than one version. My two main existing PyXML-using applications are already > ported to this, changing barely any of their code. As I said, I am very skeptical of the result. I'll be impressed when you tell me your home-brew XML parser passes the OASIS conformance suite. Anyway, this is all moot argument. It looks as if you've satisfied yourself for now. Good luck. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Mon Sep 9 20:42:44 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 09 Sep 2002 13:42:44 -0600 Subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from "Fredrik Lundh" of "Mon, 09 Sep 2002 13:12:45 +0200." <00fa01c257f1$d3f48ef0$0900a8c0@spiff> Message-ID: > uche wrote: > > > > For the applications that I'm intending to write, just doing my own parser and > > > API is both more appealing and more rewarding. > > > > Really? Color me deep skeptical. I have not seen an application on earth > > where implementing one's own parser is a good idea, and precious few where > > implementing one's own API is a good idea. > > on the other hand, virtually every commercial XML python user > I know of use their own non-pydom parser/sax-style api/dom- > style api (with 4thought being the obvious exception, of course). Really? I am surprised. I suspect the reasons for this would not be as straightforward as truly unique requirements. > if I couldn't use ElementTree-like apis, I'd probably give up XML > programming... > > (using element trees, Glyph's use case would look something like: > > tree = deepcopy.deepcopy(template_tree) > for node in tree.find(pattern): > expand(context, node) > tree.write(stream) > > ) I'm not familiar with ElementTrees. At any rate, I don't see this use case as very daunting, especially if you have DOM and generators. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From wkhenning@softhome.net Tue Sep 10 02:26:48 2002 From: wkhenning@softhome.net (warren henning) Date: Mon, 9 Sep 2002 18:26:48 -0700 Subject: [XML-SIG] an example of generating XML? Message-ID: <001001c25869$25d3c0c0$0400a8c0@STAMPY> Hi, I have found the PyXML documentation to be somewhat incomplete. The example code, in many cases, doesn't work. Since the W3C documents are more geared towards programmers implementing it rather than lowly programmers like myself trying to use a particular implementation, you might understand why I'm a little bit frustrated. Could someone give a simple example of generating a valid, well-formed XML file using PyXML? Just show me the code to create something simple like: This is a test node. So is this. I want something that works. I even tried printing the __doc__ of various functions. No luck, most have none. And I would certainly go to 4Suite.com and look at their documentation, but their domain got stolen. I mean, if you go there, you get a ton of popup windows and all this stuff about domain registration. Obviously not 4Suite, Inc. The documentation I've found just provides code fragments. I can't figure out how to make a whole, entire example that can be pasted to a .py file and run. I'm just having a lot of trouble, I've tried everything I can think of. Any help would be much appreciated. -Warren From tpassin@comcast.net Tue Sep 10 05:06:19 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Tue, 10 Sep 2002 00:06:19 -0400 Subject: [XML-SIG] an example of generating XML? References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> Message-ID: <000601c2587f$6b280170$fe193044@tbp1> [warren henning] > And I would certainly go to 4Suite.com and look at their documentation, but > their domain got stolen. I mean, if you go there, you get a ton of popup > windows and all this stuff about domain registration. Obviously not 4Suite, > Inc. > http://4suite.org Tom P From fdrake@acm.org Tue Sep 10 05:40:55 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 00:40:55 -0400 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <20020908.200653.27439528.glyph@twistedmatrix.com> Message-ID: <15741.30679.845057.285100@grendel.zope.com> Uche Ogbuji writes: > Wouldn't it be better for Python/XML to offer a *separate*, specialized > function for cloning nodes, rather than doing weird things with cloneNode? Why? I'd rather make cloneNode() do the right thing, and it seems rather clear what that should be. Certainly more clear than for DocumentType nodes. ;-) > I see. It sounds as if a general document duplication function > would be of use to you. I agree that this would be useful. I'm > willing to write one and add it to xml.dom.ext. > > But I don't think this is a use case for node.cloneNode. I think it's a perfectly valid use case for Document.cloneNode(). > Then again, the poor state of PyXML documentation in general weakens that > point of mine, doesn't it? Ah well. There is that. ;-) Perhaps before making something stop working (for some definition of "work"), the documentation should be checked for contracts and updated if some under-specified behavior should be consider beyond the contract. Removing features tends to be frowned upon in the Python world, especially if the documentation for what something should do is just plain missing -- it becomes really hard to say what isn't in the contract, because nobody said what *is* in the contract. Glyph: > There are still some parties interested in Twisted who are upset > that it requires Python 2.1; in fact, I felt guilty doing 2.1 > support because I am likely going to have to backport portions of > it to 1.5.2 for some people. Hey, at least PyXML makes that part easy, since Python 2.0 support is still in it's contract! ;-) On the other hand, it's painful because we end up with cruft like xml.dom.minicompat to make things work reasonably with newer Pythons and still work for older versions. I'm waiting for the day we can assume there are new-style objects, and everything works. > I think you take the wrong gloss on my words. I think Linus > Torvalds himself would take years to write a complete and correct > XML parser. It's the nature of the beast (XML), not the > programmer. Hear ye, hear ye! > As I said, I am very skeptical of the result. I'll be impressed > when you tell me your home-brew XML parser passes the OASIS > conformance suite. Heck, even Expat doesn't pass that yet! (We are making progress, though.) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Tue Sep 10 05:47:37 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 00:47:37 -0400 Subject: [XML-SIG] Mass assignment of 4Suite bug reports Message-ID: <15741.31081.119000.971214@grendel.zope.com> I just assigned all the unassigned bug reports categorized as "4Suite" to Uche. Mostly so they'd have a better chance at getting noticed. Uche, feel free to re-assign them to someone more appropriate if there is someone better for the report, or knowledgable with the code and having (relatively) free cycles. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Tue Sep 10 06:02:46 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 09 Sep 2002 23:02:46 -0600 Subject: [XML-SIG] an example of generating XML? In-Reply-To: Message from "warren henning" of "Mon, 09 Sep 2002 18:26:48 PDT." <001001c25869$25d3c0c0$0400a8c0@STAMPY> Message-ID: > Hi, > I have found the PyXML documentation to be somewhat incomplete. The example > code, in many cases, doesn't work. Since the W3C documents are more geared > towards programmers implementing it rather than lowly programmers like > myself trying to use a particular implementation, you might understand why > I'm a little bit frustrated. > > Could someone give a simple example of generating a valid, well-formed XML > file using PyXML? > > Just show me the code to create something simple like: > > > > This is a test node. > So is this. > It depends on what you're generating it *from*. I don't mean to be flip, but without any further daya, the way I would generate the document above is: print """ This is a test node. So is this. """ And I know PyXML pretty well. > I want something that works. > > I even tried printing the __doc__ of various functions. No luck, most have > none. > > And I would certainly go to 4Suite.com and look at their documentation, but > their domain got stolen. I mean, if you go there, you get a ton of popup > windows and all this stuff about domain registration. Obviously not 4Suite, > Inc. There is no 4Suite Inc., and thus no 4Suite.com :-) The company is Fourthought, Inc. ergo http://www.fourthought.com The project is 4Suite ergo http://4Suite.org > The documentation I've found just provides code fragments. I can't figure > out how to make a whole, entire example that can be pasted to a .py file and > run. > > I'm just having a lot of trouble, I've tried everything I can think of. Any > help would be much appreciated. We'll be happy to help. We just need some specifics. Can you start at the beginning and tell us what you're tying to do? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A1EA5A2CF4621C386256BBB006F4CEC From tpassin@comcast.net Tue Sep 10 06:02:26 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Tue, 10 Sep 2002 01:02:26 -0400 Subject: [XML-SIG] an example of generating XML? References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> Message-ID: <000401c25887$42055470$fe193044@tbp1> [warren henning] > Could someone give a simple example of generating a valid, well-formed XML > file using PyXML? > > Just show me the code to create something simple like: > > > > This is a test node. > So is this. > > > I want something that works. > You could be more complete and accurate about what you are asking for. Without a DTD, your document cannot be tested for validity. And you do not need pyXML to create your document - you can just create a string - so I assume that you really mean to create it using DOM. Here is a minimal example that works, though without error handling. It uses pyXML 0.8 plus the corresponding version of 4Suite, on Windows2000. This code comes mainly from test_document.py in the xmldoc\test\dom directory, which is worth reading parts of. It creates enough of your requested document so you can see how to complete it. Cheers, Tom P from xml.dom import Document from xml.dom.ext.Printer import PrintWalker,PrintVisitor EMPTY_NAMESPACE=None def build_doc(): dt = implementation.createDocumentType('','','') doc = implementation.createDocument(EMPTY_NAMESPACE,None,dt); e = doc.createElement('data') doc.appendChild(e) e2 = doc.createElement('node') e2.setAttribute('id','1') e.appendChild(e2) return doc if __name__=='__main__': doc=build_doc() import sys visitor=PrintVisitor(sys.stdout,'iso-8859-1',' ') printer=PrintWalker(visitor,doc) printer.run() From uche.ogbuji@fourthought.com Tue Sep 10 06:08:13 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 09 Sep 2002 23:08:13 -0600 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 00:40:55 EDT." <15741.30679.845057.285100@grendel.zope.com> Message-ID: > > Uche Ogbuji writes: > > Wouldn't it be better for Python/XML to offer a *separate*, specialized > > function for cloning nodes, rather than doing weird things with cloneNode? > > Why? I'd rather make cloneNode() do the right thing, and it seems > rather clear what that should be. Certainly more clear than for > DocumentType nodes. ;-) So you think it should do what I mentioned before? 1) Create a new documenType and document node 2) clone all child nodes 3) set the ownerDocument of each of the new nodes to the new document? If we have it do that, then let us please 1) Document it properly 2) Point out that it is not standard DOM behavior I am not at all clear that this is the "right thing". I still think the right thing is to throw an exception. I know the above behavior would throw me as I expect the ownerDocument of cloned nodes to be the same as the ones from which they were cloned. But I certainly don't care enough about it to oppose such an addition. I'd just like to make sure we call it out properly. Least surprise and all that. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From Matt Gushee Tue Sep 10 06:01:37 2002 From: Matt Gushee (Matt Gushee) Date: Mon, 9 Sep 2002 23:01:37 -0600 Subject: [XML-SIG] an example of generating XML? In-Reply-To: <000401c25887$42055470$fe193044@tbp1> References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> <000401c25887$42055470$fe193044@tbp1> Message-ID: <20020910050137.GE622@swordfish> On Tue, Sep 10, 2002 at 01:02:26AM -0400, Thomas B. Passin wrote: > Here is a minimal example that works, though without error handling. Almost works, you mean? > from xml.dom import Document > from xml.dom.ext.Printer import PrintWalker,PrintVisitor > > EMPTY_NAMESPACE=None > > def build_doc(): > dt = implementation.createDocumentType('','','') implementation? -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From uche.ogbuji@fourthought.com Tue Sep 10 06:19:57 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 09 Sep 2002 23:19:57 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 00:47:37 EDT." <15741.31081.119000.971214@grendel.zope.com> Message-ID: > I just assigned all the unassigned bug reports categorized as "4Suite" > to Uche. Mostly so they'd have a better chance at getting noticed. > > Uche, feel free to re-assign them to someone more appropriate if there > is someone better for the report, or knowledgable with the code and > having (relatively) free cycles. I think we can remove the 4Suite category from the PyXML bug roster. Now 4Suite has an SF project and bug tracker of its own. But anyway, looking at those bugs, some of them are actually bugs about the 4XPath and 4XSLT in PyXML. Which, since we have an impending release, is probably a good topic for discussion. These modules have been broken a good long while, and they lag 4XPath and 4XSLT in 4Suite woefully. There are aspects of the 4Suite code base that makes it unsuitable for PyXML. For one thing, we only support Python 2.1 and up now. For another, I think we used some C modules that Martin felt were too much to dump into PyXML. So it looks like time for a clean fork of the code bases. This means that we need a maintainer for the fork in PyXML. I am happy to help, but I can't do it on my own. So if there is anyone who could work as co-maintainer with me, great. We could maybe even back-port some of the *many* improvements in 4Suite (especially in performance) little by little. Our plans are still to offer up the 4Suite code base once it goes 1.0, but I've given up projecting when that will be. On the flip side, if we're stuck without reliable maintenance, maybe it's better to drop the packages. Anyway, I also think that because of the growing difference between the two code bases, that we should rename the set in PyXML. I know I wanted to keep the "4XPath" and "4XSLT" names, but given the increasing likelihood of confusion, I think it's enough to record their provenience int he docs. If this seems like a good idea, how about "pyxpath" and "pyxslt"? Anyway, I think we should decide on these matters before the next release. Things have been up in the air way too long. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From martin@v.loewis.de Tue Sep 10 07:36:44 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 08:36:44 +0200 Subject: [XML-SIG] an example of generating XML? In-Reply-To: <001001c25869$25d3c0c0$0400a8c0@STAMPY> References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> Message-ID: "warren henning" writes: > Just show me the code to create something simple like: > > > > This is a test node. > So is this. > I recommend print """ This is a test node. So is this. """ If you meant to indicate that "bla" and "This is a test node." come from some data structure, say, class Node: def __init__(self, text, ack = None): self.text = text self.ack = ack nodes = [Node("This is a test node.", "bla"), Node("So is this.")] then I recommend to add a method to class Node def as_xml(self, id): if self.ack: ack = ' ack="%s"' % self.ack else: ack = '' return '%s' % (id, ack, self.text) print """ """ for i in range(nodes) print nodes[i].as_xml(i) print "" HTH, Martin From martin@v.loewis.de Tue Sep 10 07:47:29 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 08:47:29 +0200 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: Uche Ogbuji writes: > These modules have been broken a good long while, Correct. They have been waiting for the next 4Suite release all the time. > and they lag 4XPath and 4XSLT in 4Suite woefully. They may lag the current implementation, but they don't lag the latest released version that much. > There are aspects of the 4Suite code base that makes it unsuitable > for PyXML. For one thing, we only support Python 2.1 and up now. > For another, I think we used some C modules that Martin felt were > too much to dump into PyXML. Yes, in particular the bison parser modules. > So if there is anyone who could work as co-maintainer with me, great. We > could maybe even back-port some of the *many* improvements in 4Suite > (especially in performance) little by little. I've been planning to move to the 4Suite code base as-is once 0.12 is released, wholesale. I don't think the backporting-to-2.0 issues will be significant, and can be done little by little. Performing the merging little by little seems to be a waste of time to me. > On the flip side, if we're stuck without reliable maintenance, maybe it's > better to drop the packages. I'm willing to do the merging after 0.12 is released. After that point, another volunteer would be welcome. > If this seems like a good idea, how about "pyxpath" and "pyxslt"? Sounds good to me. > Anyway, I think we should decide on these matters before the next > release. I don't think so. If you mean the next 4Suite release - I wanted to act only after it. If you mean the next PyXML release - it will happen by the end of this week, so I don't think we can do much until then. > Things have been up in the air way too long. Primarily because I've been waiting for 4Suite all the time :-) Notice that inclusion of the current code base in releases was completely on user request. I proposed, once, that the code should merely live in the CVS, and not be released. Because of user protests, we are now releasing this known-to-be-broken code. It appears that it still makes some users happy - admittedly at the expense of causing worries to other users. Regards, Martin From tpassin@comcast.net Tue Sep 10 13:10:18 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Tue, 10 Sep 2002 08:10:18 -0400 Subject: [XML-SIG] an example of generating XML? References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> <000401c25887$42055470$fe193044@tbp1> Message-ID: <000601c258c3$07656df0$fe193044@tbp1> Sorry, I accidently cut one essentiial import line from my example. Add this line: from xml.dom import implementation Tom P > [warren henning] > > > Could someone give a simple example of generating a valid, well-formed XML > > file using PyXML? > > > > Just show me the code to create something simple like: > > > > > > > > This is a test node. > > So is this. > > > > > > I want something that works. > > > > You could be more complete and accurate about what you are asking for. > Without a DTD, your document cannot be tested for validity. And you do not > need pyXML to create your document - you can just create a string - so I > assume that you really mean to create it using DOM. > > Here is a minimal example that works, though without error handling. It > uses pyXML 0.8 plus the corresponding version of 4Suite, on Windows2000. > This code comes mainly from test_document.py in the xmldoc\test\dom > directory, which is worth reading parts of. It creates enough of your > requested document so you can see how to complete it. > > Cheers, > > Tom P > > from xml.dom import Document > from xml.dom.ext.Printer import PrintWalker,PrintVisitor > > EMPTY_NAMESPACE=None > > def build_doc(): > dt = implementation.createDocumentType('','','') > doc = implementation.createDocument(EMPTY_NAMESPACE,None,dt); > > e = doc.createElement('data') > doc.appendChild(e) > > e2 = doc.createElement('node') > e2.setAttribute('id','1') > e.appendChild(e2) > > return doc > > if __name__=='__main__': > doc=build_doc() > > import sys > visitor=PrintVisitor(sys.stdout,'iso-8859-1',' ') > printer=PrintWalker(visitor,doc) > printer.run() > > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig > From tpassin@comcast.net Tue Sep 10 13:19:08 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Tue, 10 Sep 2002 08:19:08 -0400 Subject: [XML-SIG] an example of generating XML? References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> <000401c25887$42055470$fe193044@tbp1> <20020910050137.GE622@swordfish> Message-ID: <000d01c258c4$435e4100$fe193044@tbp1> [Matt Gushee] > On Tue, Sep 10, 2002 at 01:02:26AM -0400, Thomas B. Passin wrote: > > > Here is a minimal example that works, though without error handling. > > Almost works, you mean? > Yes, I accidentally left off an import statement. > > from xml.dom import Document > > from xml.dom.ext.Printer import PrintWalker,PrintVisitor > > > > EMPTY_NAMESPACE=None > > > > def build_doc(): > > dt = implementation.createDocumentType('','','') > > implementation? > It is defined in dom\__init__.py (as illustrated in test_document.py). And the code runs as expected on my system, when the import statement isn not cut out. Although, as I actually look at that part of __init__.py, I see that it really likes to set implementation to HTMLDOMImplementation.HTMLDOMImplementation() if it can, otherwise to DOMImplementation.DOMImplementation() Cheers, Tom P From fredrik@pythonware.com Tue Sep 10 13:36:16 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 10 Sep 2002 14:36:16 +0200 Subject: [XML-SIG] an example of generating XML? References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> Message-ID: <038e01c258c6$a97485b0$0900a8c0@spiff> martin wrote: > then I recommend to add a method to class Node >=20 > def as_xml(self, id): > if self.ack: > ack =3D ' ack=3D"%s"' % self.ack > else: > ack =3D '' > return '%s' % (id, ack, self.text) ...and pray that nobody will ever pass in a string containing reserved xml characters, or non-ascii data, or a unicode string... From fdrake@acm.org Tue Sep 10 14:29:47 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 09:29:47 -0400 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: <15741.62411.786430.704756@grendel.zope.com> Martin v. Loewis writes: > Notice that inclusion of the current code base in releases was > completely on user request. I proposed, once, that the code should > merely live in the CVS, and not be released. Because of user protests, > we are now releasing this known-to-be-broken code. It appears that it > still makes some users happy - admittedly at the expense of causing > worries to other users. Perhaps these should only be installed on user request? The current setup.py includes them by default; perhaps we should change that to deal with the known-broken situation. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Tue Sep 10 14:45:36 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 09:45:36 -0400 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: References: <15741.31081.119000.971214@grendel.zope.com> Message-ID: <15741.63360.727334.811891@grendel.zope.com> Uche Ogbuji writes: > I think we can remove the 4Suite category from the PyXML bug roster. Now > 4Suite has an SF project and bug tracker of its own. We can't actually remove a category, but I've renamed it from "4Suite" to "4Suite (inactive)", and I've added notes to the "Submit New" pages for the Bugs and Patches trackers directing submitters to the 4Suite project on SourceForge for matters relating to that project. > But anyway, looking at those bugs, some of them are actually bugs about the > 4XPath and 4XSLT in PyXML. Please feel free to re-categorize as appropriate. > Anyway, I also think that because of the growing difference between > the two code bases, that we should rename the set in PyXML. I know > I wanted to keep the "4XPath" and "4XSLT" names, but given the > increasing likelihood of confusion, I think it's enough to record > their provenience int he docs. Do you think the package names should be changed, or just the human-readable name? I'm happy to see the human name change if that's what you want. I'd be less happy about changes that affect import statements. > If this seems like a good idea, how about "pyxpath" and "pyxslt"? If we're talking about names-for-humans, how about PyXPath and PyXSLT? > Anyway, I think we should decide on these matters before the next release. > Things have been up in the air way too long. Too long, yes, but that's not a good reason to rush a decision in the next couple of days. When 4Suite 0.12 comes out and Martin updates the code in PyXML, we can have another release with any needed documentation updates we need to make. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Matt Gushee Tue Sep 10 16:28:21 2002 From: Matt Gushee (Matt Gushee) Date: Tue, 10 Sep 2002 09:28:21 -0600 Subject: [XML-SIG] an example of generating XML? In-Reply-To: <000d01c258c4$435e4100$fe193044@tbp1> References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> <000401c25887$42055470$fe193044@tbp1> <20020910050137.GE622@swordfish> <000d01c258c4$435e4100$fe193044@tbp1> Message-ID: <20020910152821.GA846@swordfish> On Tue, Sep 10, 2002 at 08:19:08AM -0400, Thomas B. Passin wrote: > > > > > Here is a minimal example that works, though without error handling. > > > > Almost works, you mean? > > > > Yes, I accidentally left off an import statement. Leaving me to provide it? Alright, if you insist: from xml.dom import implementation > > implementation? > > > > It is defined in dom\__init__.py (as illustrated in test_document.py). And > the code runs as expected on my system, when the import statement isn not > cut out. Although, as I actually look at that part of __init__.py, I see > that it really likes to set implementation to > > HTMLDOMImplementation.HTMLDOMImplementation() Yes, I've noticed that ... and I suppose there must be a good reason, though I can't imagine what. That's why I generally use from xml.dom import getDOMImplementation impl = getDOMImplementation() which (at least on my system) returns an XML DOM parser, usually minidom. -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From BudP.Bruegger Tue Sep 10 16:46:36 2002 From: BudP.Bruegger (BudP.Bruegger) Date: Tue, 10 Sep 2002 17:46:36 +0200 Subject: ANN: ezex 0.1--an xml shorthand in python; was: [XML-SIG] ANN: SLiP and SLIDE - a quick XML shorthand syntax and tool for editing In-Reply-To: References: <20020819192819.630dce5b.bud@sistema.it> Message-ID: <20020910174636.7f1847e7.bud@sistema.it> [*** Fasttrack ***: look at the example at the end of the message] On Thu, 05 Sep 2002 09:29:06 -0600 Uche Ogbuji wrote: [SNIP] > Your post sounds interesting, and apparently a lot of work has gone into your > ideas. Some brief examples would be helpful as I'm trying to get a sense of > your ideas quickly. Uche and all: I have since implemented a prototype of my xml shorthand ideas and created some illustrative examples. You can find it all at http://www.sistema.it/ezex/ The name "ezex" tries to convey that it is an easy (ez) syntax for xml (ex). Suggestions for better names are very welcome. As I mentioned before, the approach has some similarities with SOX (particularly in data mode) and with PYX (particularly in document mode). On the server: ezex.0.1.alpha.py is the code. In the examples e1 through e4, the file without extension (e1) is the ezex source and the one with the .xml extension is the output produced with the prototype. The examples illustrate the following: * e1 gives an example for using ezex in "data mode" where leading and trailing whitespace does not matter and whitespace is extensively used for a layout that optimizes readability. (Data mode is selected by setting useIndent to 1). The similarities to SOX are obvious, even if the syntax is slightly different and SOX allows more "shortcuts" * e2 gives an example for using ezex in "document mode" where whitespace matters and mixed content is common. While a little more cumbersome to write than data mode, note that the author has complete control over whitespace. This is very similar to PYX. [Note: a subset of ezex that avoids the use of multi-line {text, comments}, as well as elements with text content on the same line, can be as easily analyzed with grep as can PYX. It would be easy to implement this option in an xml to ezex converter.] * e3 is a very minimal example for the extensibility of ezex. Here, a simplistic custom parser for lists was implemented. More complex examples could include: - tables - nested lists - structured text (reStructuredText, structuredTextNG, some wiki stuff) - raw (for example for including xml syntax) - include (to add and parse the content of an external file) - csv (to parse a csv table into an xml table) - e-mail (that parses e-mail format into xml) - sh (that includes the result of some sh command such as ls) As the example suggests, it is quite straight forward to write simple custom parsers or to incorporate existing parsers (structured text). Since it is basically a call to a python function, it is also easy to define "pipelines" of processing. For example, it should be easy to import a csv file and then parse it to create an xml table representation. * e4 illustrates the syntax used in more detail giving examples for all options. * e5 illustrates the use of namespaces. While ezex is not actually aware of ns, it makes it quite straight forward to declare them and to use prefixed in element and attribute names. I'm looking forward to your comments and suggestions! kind regards --bud -------------------- simple example, ezex input ------------------------------- ?xml version="1.0" encoding="UTF-8" !mydoctype "some elaborate stuff here"

123 Sesame Street Wonderland CA 90012 Please leave packages with Grouch in garbage can next door.
From Mike.Olson@fourthought.com Tue Sep 10 16:48:12 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 10 Sep 2002 09:48:12 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: <1031672903.31754.4.camel@penny> On Mon, 2002-09-09 at 23:19, Uche Ogbuji wrote: > > Anyway, I also think that because of the growing difference between the two > code bases, that we should rename the set in PyXML. I know I wanted to keep > the "4XPath" and "4XSLT" names, but given the increasing likelihood of > confusion, I think it's enough to record their provenience int he docs. > > If this seems like a good idea, how about "pyxpath" and "pyxslt"? I'm all for this. What I would like to see then in xml.xpath and xml.xslt is smart import logic similar that in the default xml. So, if a user tries to import xml.xslt.processor it will first look to see if Ft.Xml.Xslt.Processor is available, and if not, then try xml.pyxslt.Processor. Or, any other of the xslt processors that people are proposing. This means a common (or close to common) set of interfaces on Xslt parsers but that shouldn't be too hard to come up with. Mike > > Anyway, I think we should decide on these matters before the next release. > Things have been up in the air way too long. > > > -- > Uche Ogbuji Fourthought, Inc. > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ > Basic XML and RDF techniques for knowledge management, Part 7 - > http://www-106.ibm.com/developerworks/xml/library/x-think12.html > Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra > ry/x-jclark.html > Python and XML development using 4Suite, Part 3: 4RDF - > http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A > 1EA5A2CF4621C386256BBB006F4CEC > > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From uche.ogbuji@fourthought.com Tue Sep 10 17:42:54 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 10:42:54 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from martin@v.loewis.de (Martin v. Loewis) of "10 Sep 2002 08:47:29 +0200." Message-ID: > Uche Ogbuji writes: > > > These modules have been broken a good long while, > > Correct. They have been waiting for the next 4Suite release all the > time. > > > and they lag 4XPath and 4XSLT in 4Suite woefully. > > They may lag the current implementation, but they don't lag the latest > released version that much. > > > There are aspects of the 4Suite code base that makes it unsuitable > > for PyXML. For one thing, we only support Python 2.1 and up now. > > For another, I think we used some C modules that Martin felt were > > too much to dump into PyXML. > > Yes, in particular the bison parser modules. There is even more that has moved to C recently in 4Suite, so we'll certainly want to keep in mind general principles about what we want to keep in Python int he PyXPath/PyXSLT versions. I do like the idea of keeping them mostly Python for max cross-platform support. > > So if there is anyone who could work as co-maintainer with me, great. We > > could maybe even back-port some of the *many* improvements in 4Suite > > (especially in performance) little by little. > > I've been planning to move to the 4Suite code base as-is once 0.12 is > released, wholesale. I don't think the backporting-to-2.0 issues will > be significant, and can be done little by little. Performing the > merging little by little seems to be a waste of time to me. OK, then. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 17:46:51 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 10:46:51 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 09:29:47 EDT." <15741.62411.786430.704756@grendel.zope.com> Message-ID: > > Martin v. Loewis writes: > > Notice that inclusion of the current code base in releases was > > completely on user request. I proposed, once, that the code should > > merely live in the CVS, and not be released. Because of user protests, > > we are now releasing this known-to-be-broken code. It appears that it > > still makes some users happy - admittedly at the expense of causing > > worries to other users. > > Perhaps these should only be installed on user request? The current > setup.py includes them by default; perhaps we should change that to > deal with the known-broken situation. I would agree with this while they remain broken. Once they're working, we'd re-enable them by default. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 17:50:58 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 10:50:58 -0600 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 09:45:36 EDT." <15741.63360.727334.811891@grendel.zope.com> Message-ID: > > Anyway, I also think that because of the growing difference between > > the two code bases, that we should rename the set in PyXML. I know > > I wanted to keep the "4XPath" and "4XSLT" names, but given the > > increasing likelihood of confusion, I think it's enough to record > > their provenience int he docs. > > Do you think the package names should be changed, or just the > human-readable name? I'm happy to see the human name change if that's > what you want. I'd be less happy about changes that affect import > statements. I was meaning the human readable names. But now that you mention it :-) Maybe it would be less confusing to use xml.pyxpath and xml.pyxslt. I like it because the idea behind these modules as distinct from other XPath/XSLT for Python is that they are implemented all in Python, except the boolean extension, IIRC. Of course, all we have to do is require Python 2.3 and we can remove the boolean extension . Aside: Python 2.3 will also bring us sets. Yes! Yes! Hallelujah! Finally! :-) > > If this seems like a good idea, how about "pyxpath" and "pyxslt"? > > If we're talking about names-for-humans, how about PyXPath and PyXSLT? Sure. > > Anyway, I think we should decide on these matters before the next release. > > Things have been up in the air way too long. > > Too long, yes, but that's not a good reason to rush a decision in the > next couple of days. When 4Suite 0.12 comes out and Martin updates > the code in PyXML, we can have another release with any needed > documentation updates we need to make. OK. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 17:57:07 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 10:57:07 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from Mike Olson of "10 Sep 2002 09:48:12 MDT." <1031672903.31754.4.camel@penny> Message-ID: > On Mon, 2002-09-09 at 23:19, Uche Ogbuji wrote: > > > > Anyway, I also think that because of the growing difference between the two > > code bases, that we should rename the set in PyXML. I know I wanted to keep > > the "4XPath" and "4XSLT" names, but given the increasing likelihood of > > confusion, I think it's enough to record their provenience int he docs. > > > > If this seems like a good idea, how about "pyxpath" and "pyxslt"? > > > I'm all for this. What I would like to see then in xml.xpath and > xml.xslt is smart import logic similar that in the default xml. I think this is a good idea. After all, if they can import the 4XPath/4XSLT versions, they probably want to use those (faster, more features, etc.) > So, if a user tries to import xml.xslt.processor it will first look to > see if Ft.Xml.Xslt.Processor is available, and if not, then try > xml.pyxslt.Processor. Or, any other of the xslt processors that people > are proposing. > > This means a common (or close to common) set of interfaces on Xslt > parsers but that shouldn't be too hard to come up with. Luckily, the only interfaces to the outside for the general programmer are Processor InputSource(Factory) Context, XsltContext Compile, Match, Evaluate These should be easy to standardize. I think the bigger roadblock might be the extension APIs, since those tend to be close to internals. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From martin@v.loewis.de Tue Sep 10 17:55:47 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 18:55:47 +0200 Subject: [XML-SIG] an example of generating XML? In-Reply-To: <038e01c258c6$a97485b0$0900a8c0@spiff> References: <001001c25869$25d3c0c0$0400a8c0@STAMPY> <038e01c258c6$a97485b0$0900a8c0@spiff> Message-ID: "Fredrik Lundh" writes: > ...and pray that nobody will ever pass in a string containing reserved > xml characters, or non-ascii data, or a unicode string... There might be no need to pray, depending on your application. Regards, Martin From mclay@nist.gov Tue Sep 10 18:37:28 2002 From: mclay@nist.gov (Michael McLay) Date: Tue, 10 Sep 2002 13:37:28 -0400 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: <200209101337.28036.mclay@nist.gov> On Tuesday 10 September 2002 12:42 pm, Uche Ogbuji wrote: > > Yes, in particular the bison parser modules. > > There is even more that has moved to C recently in 4Suite, so we'll > certainly want to keep in mind general principles about what we want to > keep in Python int he PyXPath/PyXSLT versions. I do like the idea of > keeping them mostly Python for max cross-platform support. Isn't it just the portablity of the specific C code that is at issue? The bison dependancy is troublesome for porting to the Mac and Windows. Most everyone would prefer the speed of C if it is available and it compiles without a hitch. From fdrake@acm.org Tue Sep 10 19:11:40 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 14:11:40 -0400 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <15741.30679.845057.285100@grendel.zope.com> Message-ID: <15742.13788.298950.751589@grendel.zope.com> Uche Ogbuji writes: > So you think it should do what I mentioned before? > > 1) Create a new documenType and document node > 2) clone all child nodes > 3) set the ownerDocument of each of the new nodes to the new document? If deep==True, yes. See table below for deep==False. > If we have it do that, then let us please > > 1) Document it properly > 2) Point out that it is not standard DOM behavior I'm glad to document it carefully; that's entirely reasonable. It certainly falls within the space of "implementation dependent", which the DOM spec says this is. I think this is the right set of behaviors: \ cloneNode(0) | cloneNode(1) | importNode(n,0) | importNode(n,1) nodeType \ | | | +--------------|--------------|-----------------|----------------- document | return None | new document | NotSupportedErr | NotSupportedErr +--------------|--------------|-----------------|----------------- doctype | new doctype, | new doctype, | new doctype if | new doctype if | no entities | w/ entities | new parent has | new parent has | or notations | and notations| doctype==None, | doctype==None, | | | else NotSuppErr,| w/ entities | | | no entities or | and notations | | | notations | Document.cloneNode(0) returns None since it's not allowed to raise an exception according to the DOM spec. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Tue Sep 10 19:31:12 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 20:31:12 +0200 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: Uche Ogbuji writes: > I would agree with this while they remain broken. Once they're > working, we'd re-enable them by default. Good idea, done! Martin From martin@v.loewis.de Tue Sep 10 19:33:54 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 20:33:54 +0200 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: References: Message-ID: Uche Ogbuji writes: > Maybe it would be less confusing to use xml.pyxpath and xml.pyxslt. I would not like that. There is enough code and documentation that uses it this way, and it is the most natural way. Also, people might be requesting XPath support in Python core at some day - at which time I'd like to propose integration of PyXPath (apparently, pressure to use XSLT programmatically is not that high). Regards, Martin From martin@v.loewis.de Tue Sep 10 19:36:14 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 20:36:14 +0200 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: <200209101337.28036.mclay@nist.gov> References: <200209101337.28036.mclay@nist.gov> Message-ID: Michael McLay writes: > Isn't it just the portablity of the specific C code that is at issue? Correct. I'll look at the code when 0.12 is released, and I'm certainly open to building more C modules - we built the boolean module already to satisfy XPath/XSLT. Regards, Martin From fdrake@acm.org Tue Sep 10 19:47:30 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 14:47:30 -0400 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: <200209101337.28036.mclay@nist.gov> Message-ID: <15742.15938.727755.807823@grendel.zope.com> Martin v. Loewis writes: > Correct. I'll look at the code when 0.12 is released, and I'm > certainly open to building more C modules - we built the boolean > module already to satisfy XPath/XSLT. Speaking of the boolean module... is there any reason not to change it to use the built-in bool type under Python 2.3 and newer? Or just replacing it with a Python module for 2.3? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From brian@sweetapp.com Tue Sep 10 19:54:10 2002 From: brian@sweetapp.com (Brian Quinlan) Date: Tue, 10 Sep 2002 11:54:10 -0700 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suitebug reports In-Reply-To: <1031672903.31754.4.camel@penny> Message-ID: <000f01c258fb$72bd2830$df7e4e18@brianspiv1700> > I'm all for this. What I would like to see then in xml.xpath and > xml.xslt is smart import logic similar that in the default xml. > > So, if a user tries to import xml.xslt.processor it will first look to > see if Ft.Xml.Xslt.Processor is available, and if not, then try > xml.pyxslt.Processor. Or, any other of the xslt processors that people > are proposing. Please don't do this. The smart import used by PyXML has caused me serious problems in the past. Unless the two libraries are going to be bug for bug compatible, then I'd like to be able choose myself. Cheers, Brian From fdrake@acm.org Tue Sep 10 20:10:09 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 15:10:09 -0400 Subject: [XML-SIG] [Q] xml.utils.boolean C API Message-ID: <15742.17297.374531.327438@grendel.zope.com> The C API to the boolean module, expressed in extensions/boolean.h, does not appear to be used. It certainly can't be used as-is when the module is built to be dynamically loaded. The boolean_new() function it references is not present in boolean.c at all. Is there any reason this header is used, or can the needed information simply be merged into the implementation? The implementation also uses DL_EXPORT for a few things that should be static, and almost nobody outside the Python core needs to use the tp_print slot; the use made by this code is certainly avoidable in efficient ways. Uche, Mike, et al.: any objection to a little cleanup here? Thanks! -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Tue Sep 10 20:30:35 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 13:30:35 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from Michael McLay of "Tue, 10 Sep 2002 13:37:28 EDT." <200209101337.28036.mclay@nist.gov> Message-ID: > On Tuesday 10 September 2002 12:42 pm, Uche Ogbuji wrote: > > > > Yes, in particular the bison parser modules. > > > > There is even more that has moved to C recently in 4Suite, so we'll > > certainly want to keep in mind general principles about what we want to > > keep in Python int he PyXPath/PyXSLT versions. I do like the idea of > > keeping them mostly Python for max cross-platform support. > > Isn't it just the portablity of the specific C code that is at issue? The > bison dependancy is troublesome for porting to the Mac and Windows. Most > everyone would prefer the speed of C if it is available and it compiles > without a hitch. Bison is part of it. Aoother problem is that not everyone on every platform has a C compiler in order to build such extensions. Isn't it enough to say "if you want maximum velocity, install 4Suite or libxslt or Pyana"? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 20:33:25 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 13:33:25 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 14:47:30 EDT." <15742.15938.727755.807823@grendel.zope.com> Message-ID: > > Martin v. Loewis writes: > > Correct. I'll look at the code when 0.12 is released, and I'm > > certainly open to building more C modules - we built the boolean > > module already to satisfy XPath/XSLT. > > Speaking of the boolean module... is there any reason not to change it > to use the built-in bool type under Python 2.3 and newer? Or just > replacing it with a Python module for 2.3? I like this idea. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 20:36:50 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 13:36:50 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suitebug reports In-Reply-To: Message from Brian Quinlan of "Tue, 10 Sep 2002 11:54:10 PDT." <000f01c258fb$72bd2830$df7e4e18@brianspiv1700> Message-ID: > > I'm all for this. What I would like to see then in xml.xpath and > > xml.xslt is smart import logic similar that in the default xml. > > > > So, if a user tries to import xml.xslt.processor it will first look to > > see if Ft.Xml.Xslt.Processor is available, and if not, then try > > xml.pyxslt.Processor. Or, any other of the xslt processors that > people > > are proposing. > > Please don't do this. The smart import used by PyXML has caused me > serious problems in the past. You mean _xmlplus? I think this is a very different matter. > Unless the two libraries are going to be bug for bug compatible, then > I'd like to be able choose myself. But you can always do this, right? You can just directly import what you want. This is different from _xmlplus which *masks* the overriden modules. Under Mike's proposal, all the various modules would still be available for explicit import. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 20:38:42 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 13:38:42 -0600 Subject: [XML-SIG] [Q] xml.utils.boolean C API In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 15:10:09 EDT." <15742.17297.374531.327438@grendel.zope.com> Message-ID: > > The C API to the boolean module, expressed in extensions/boolean.h, > does not appear to be used. It certainly can't be used as-is when the > module is built to be dynamically loaded. The boolean_new() function > it references is not present in boolean.c at all. > > Is there any reason this header is used, or can the needed information > simply be merged into the implementation? > > The implementation also uses DL_EXPORT for a few things that should be > static, and almost nobody outside the Python core needs to use the > tp_print slot; the use made by this code is certainly avoidable in > efficient ways. > > Uche, Mike, et al.: any objection to a little cleanup here? Not from me. I'd actually probably study a patch for possible back-porting to 4Suite. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A1EA5A2CF4621C386256BBB006F4CEC From fdrake@acm.org Tue Sep 10 20:34:03 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 15:34:03 -0400 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: References: <15741.63360.727334.811891@grendel.zope.com> Message-ID: <15742.18731.32838.456294@grendel.zope.com> Uche Ogbuji writes: > Maybe it would be less confusing to use xml.pyxpath and xml.pyxslt. > I like it because the idea behind these modules as distinct from > other XPath/XSLT for Python is that they are implemented all in > Python, except the boolean extension, IIRC. Not a good reason to change the import names. > Of course, all we have to do is require Python 2.3 and we can > remove the boolean extension . And do you realize just how enticing that is? ;-) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From brian@sweetapp.com Tue Sep 10 21:03:45 2002 From: brian@sweetapp.com (Brian Quinlan) Date: Tue, 10 Sep 2002 13:03:45 -0700 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suitebug reports In-Reply-To: Message-ID: <001601c25905$2b29bf60$df7e4e18@brianspiv1700> > You mean _xmlplus? I think this is a very different matter. OK. > But you can always do this, right? You can just directly import what you > want. This is different from _xmlplus which *masks* the overriden > modules. Under Mike's proposal, all the various modules would still be > available for explicit import. So xml.pyxslt.Processor will remain the same if 4suite is installed? That's fine then. Cheers, Brian From uche.ogbuji@fourthought.com Tue Sep 10 21:12:17 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 14:12:17 -0600 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 15:34:03 EDT." <15742.18731.32838.456294@grendel.zope.com> Message-ID: > > Uche Ogbuji writes: > > Maybe it would be less confusing to use xml.pyxpath and xml.pyxslt. > > I like it because the idea behind these modules as distinct from > > other XPath/XSLT for Python is that they are implemented all in > > Python, except the boolean extension, IIRC. > > Not a good reason to change the import names. OK. xml.xxpath and xml.xslt it is, then. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From uche.ogbuji@fourthought.com Tue Sep 10 21:28:17 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 14:28:17 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suitebug reports In-Reply-To: Message from Brian Quinlan of "Tue, 10 Sep 2002 13:03:45 PDT." <001601c25905$2b29bf60$df7e4e18@brianspiv1700> Message-ID: > > You mean _xmlplus? I think this is a very different matter. > > OK. > > > But you can always do this, right? You can just directly import what > you > > want. This is different from _xmlplus which *masks* the overriden > > modules. Under Mike's proposal, all the various modules would still be > > available for explicit import. > > So xml.pyxslt.Processor will remain the same if 4suite is installed? Yes. Except the decision is now xml.xslt.Processor. > That's fine then. Cool. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From martin@v.loewis.de Tue Sep 10 21:28:35 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 10 Sep 2002 22:28:35 +0200 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: <15742.15938.727755.807823@grendel.zope.com> References: <200209101337.28036.mclay@nist.gov> <15742.15938.727755.807823@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Speaking of the boolean module... is there any reason not to change it > to use the built-in bool type under Python 2.3 and newer? Or just > replacing it with a Python module for 2.3? Sure - but can't that wait until Python 2.3 is released :-? Also, make sure not to use the Python 2.2 fallback... Regards, Martin From fdrake@acm.org Tue Sep 10 21:31:29 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 16:31:29 -0400 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: References: <200209101337.28036.mclay@nist.gov> <15742.15938.727755.807823@grendel.zope.com> Message-ID: <15742.22177.671657.899225@grendel.zope.com> Martin v. Loewis writes: > Sure - but can't that wait until Python 2.3 is released :-? Not clear that makes much difference. I won't get to it today, at any rate. > Also, make sure not to use the Python 2.2 fallback... Yeah, I think you're right on this. ;-( -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Tue Sep 10 23:03:56 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 16:03:56 -0600 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 14:11:40 EDT." <15742.13788.298950.751589@grendel.zope.com> Message-ID: > > I think this is the right set of behaviors: > > \ cloneNode(0) | cloneNode(1) | importNode(n,0) | importNode(n,1) > nodeType \ | | | > +--------------|--------------|-----------------|----------------- > document | return None | new document | NotSupportedErr | NotSupportedErr > +--------------|--------------|-----------------|----------------- > doctype | new doctype, | new doctype, | new doctype if | new doctype if > | no entities | w/ entities | new parent has | new parent has > | or notations | and notations| doctype==None, | doctype==None, > | | | else NotSuppErr,| w/ entities > | | | no entities or | and notations > | | | notations | > > Document.cloneNode(0) returns None since it's not allowed to raise an > exception according to the DOM spec. I take it Document.cloneNode(0) returns None because we don't want to clone the documentElement? Reinforces my whole sketchy feeling about this. But I have no more specific complaint about the table. I think the "new document" row should be more detailed, especally since that's where I think a lot of the surprising details lie. Basically what I wrote above, it seems. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From fdrake@acm.org Tue Sep 10 23:04:35 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 18:04:35 -0400 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <15742.13788.298950.751589@grendel.zope.com> Message-ID: <15742.27763.348099.275603@grendel.zope.com> Uche Ogbuji writes: > I take it Document.cloneNode(0) returns None because we don't want to clone > the documentElement? I really can't figure out whether it should include the doctype and document element or not. The right answer depends on whether the application is living in a DOM 1/2 world or a DOM 3 world, but... not necessarily just that. > Reinforces my whole sketchy feeling about this. > > But I have no more specific complaint about the table. I think the > "new document" row should be more detailed, especally since that's > where I think a lot of the surprising details lie. Basically what > I wrote above, it seems. What information do you think is missing from the "new document" slots? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Tue Sep 10 23:28:02 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 10 Sep 2002 16:28:02 -0600 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 10 Sep 2002 18:04:35 EDT." <15742.27763.348099.275603@grendel.zope.com> Message-ID: > > Uche Ogbuji writes: > > I take it Document.cloneNode(0) returns None because we don't want to clone > > the documentElement? > > I really can't figure out whether it should include the doctype and > document element or not. The right answer depends on whether the > application is living in a DOM 1/2 world or a DOM 3 world, but... not > necessarily just that. > > > Reinforces my whole sketchy feeling about this. > > > > But I have no more specific complaint about the table. I think the > > "new document" row should be more detailed, especally since that's > > where I think a lot of the surprising details lie. Basically what > > I wrote above, it seems. > > What information do you think is missing from the "new document" > slots? The fact that the new document has all the same nodes as the original, but with a modified ownerDocument value. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A 1EA5A2CF4621C386256BBB006F4CEC From fdrake@acm.org Tue Sep 10 23:32:06 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 10 Sep 2002 18:32:06 -0400 Subject: [XML-SIG] Can anyone recommend a sensible XML parser for Python? In-Reply-To: References: <15742.27763.348099.275603@grendel.zope.com> Message-ID: <15742.29414.953681.688127@grendel.zope.com> Uche Ogbuji writes: > The fact that the new document has all the same nodes as the original, but > with a modified ownerDocument value. I'll make sure that's explicit in the documentation. I'll try to wrap this up sometime tonight. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From tpassin@comcast.net Tue Sep 10 23:45:26 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Tue, 10 Sep 2002 18:45:26 -0400 Subject: [XML-SIG] PyXML XPath and XSLT References: Message-ID: <002401c2591b$c1c1d550$fe193044@tbp1> [Martin v. Loewis] > Also, people might be requesting XPath support in Python core at some > day - at which time I'd like to propose integration of PyXPath > (apparently, pressure to use XSLT programmatically is not that high). I use 4xslt programatically. I want to keep on doing so. Cheers, Tom P From fdrake@acm.org Wed Sep 11 05:08:02 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 00:08:02 -0400 Subject: [XML-SIG] test_c14n & default configuration Message-ID: <15742.49570.838626.753296@grendel.zope.com> Now that the default configuration does not include xml.xpath, the test/test_c14n.py test script fails in the default configuration. There doesn't seem to be a test of the XPath stuff, however. Does anyone know if xml.xpath is actually broken in the current CVS, or is it just xml.xslt? I don't use that enough to really know the current state of the package. If xml.xpath is really broken, should the c14n tests be modified to avoid using the xml.xpath package? I hate to have a misleading test result. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Wed Sep 11 07:25:01 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 11 Sep 2002 08:25:01 +0200 Subject: [XML-SIG] test_c14n & default configuration In-Reply-To: <15742.49570.838626.753296@grendel.zope.com> References: <15742.49570.838626.753296@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > There doesn't seem to be a test of the XPath stuff, however. Does > anyone know if xml.xpath is actually broken in the current CVS, or is > it just xml.xslt? Last I tried, xml.xpath would not pass all of the 4Suite test suite. But it is largely working, much better so than xml.xslt. Regards, Martin From Alexandre.Fayolle@logilab.fr Wed Sep 11 08:21:01 2002 From: Alexandre.Fayolle@logilab.fr (Alexandre) Date: Wed, 11 Sep 2002 09:21:01 +0200 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <002401c2591b$c1c1d550$fe193044@tbp1> References: <002401c2591b$c1c1d550$fe193044@tbp1> Message-ID: <20020911072101.GC18571@orion.logilab.fr> On Tue, Sep 10, 2002 at 06:45:26PM -0400, Thomas B. Passin wrote: > [Martin v. Loewis] > > > Also, people might be requesting XPath support in Python core at some > > day - at which time I'd like to propose integration of PyXPath > > (apparently, pressure to use XSLT programmatically is not that high). > > I use 4xslt programatically. I want to keep on doing so. Same for us at Logilab ! We depend on that. Alexandre Fayolle -- LOGILAB, Paris (France). http://www.logilab.com http://www.logilab.fr http://www.logilab.org Narval, the first software agent available as free software (GPL). From vdv@dyomedea.com Wed Sep 11 08:25:49 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 11 Sep 2002 09:25:49 +0200 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <20020911072101.GC18571@orion.logilab.fr> References: <002401c2591b$c1c1d550$fe193044@tbp1> <20020911072101.GC18571@orion.logilab.fr> Message-ID: <1031729149.32421.132.camel@ibook> On Wed, 2002-09-11 at 09:21, Alexandre wrote: > On Tue, Sep 10, 2002 at 06:45:26PM -0400, Thomas B. Passin wrote: > > [Martin v. Loewis] > >=20 > > > Also, people might be requesting XPath support in Python core at some > > > day - at which time I'd like to propose integration of PyXPath > > > (apparently, pressure to use XSLT programmatically is not that high). > >=20 > > I use 4xslt programatically. I want to keep on doing so. >=20 > Same for us at Logilab ! We depend on that.=20 Same for xvif (http://downloads.xmlschemata.org/python/xvif/) Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From rsalz@datapower.com Wed Sep 11 14:52:07 2002 From: rsalz@datapower.com (Rich Salz) Date: Wed, 11 Sep 2002 09:52:07 -0400 Subject: [XML-SIG] test_c14n & default configuration References: <15742.49570.838626.753296@grendel.zope.com> Message-ID: <3D7F4A87.6000205@datapower.com> The test_c14n file uses xpath to pick a nodeset. Now that xpath isn't there by default (boo hoo), it should probably say "can't run tests." or some such. (I missed the discussion: why isn't xpath there any more?) /r$ From fdrake@acm.org Wed Sep 11 14:57:59 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 09:57:59 -0400 Subject: [XML-SIG] test_c14n & default configuration In-Reply-To: <3D7F4A87.6000205@datapower.com> References: <15742.49570.838626.753296@grendel.zope.com> <3D7F4A87.6000205@datapower.com> Message-ID: <15743.19431.910257.636788@grendel.zope.com> Rich Salz writes: > The test_c14n file uses xpath to pick a nodeset. Now that xpath isn't > there by default (boo hoo), it should probably say "can't run tests." or > some such. (I missed the discussion: why isn't xpath there any more?) It appearantly doesn't pass the 4Suite regression test for that module, based on reports from others. I'm starting to think that xml.xpath should be re-enabled by default since it appearantly mostly works, and only omit xml.xslt package, since that supposedly is quite broken. I don't expect to have time to look into any specific failures there before the release, unfortunately. I'm desparately trying to get the documentation (there's a fair bit to do there still) for minidom up-to-date, and then I'll check in that and the implementation and tests for the cloneNode()/importNode() handling discussed on the list yesterday. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Mike.Olson@fourthought.com Wed Sep 11 15:51:53 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 08:51:53 -0600 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: References: Message-ID: <1031755914.3310.0.camel@penny> On Tue, 2002-09-10 at 10:50, Uche Ogbuji wrote: > I was meaning the human readable names. But now that you mention it :-) > > Maybe it would be less confusing to use xml.pyxpath and xml.pyxslt. I like > it because the idea behind these modules as distinct from other XPath/XSLT for > Python is that they are implemented all in Python, except the boolean > extension, IIRC. Of course, all we have to do is require Python 2.3 and we > can remove the boolean extension . number.c is also required. There is no way to do what you need to do in Python so it needs to be in C. Mike > > Aside: Python 2.3 will also bring us sets. Yes! Yes! Hallelujah! Finally! > > :-) > > > > > If this seems like a good idea, how about "pyxpath" and "pyxslt"? > > > > If we're talking about names-for-humans, how about PyXPath and PyXSLT? > > Sure. > > > > > Anyway, I think we should decide on these matters before the next release. > > > Things have been up in the air way too long. > > > > Too long, yes, but that's not a good reason to rush a decision in the > > next couple of days. When 4Suite 0.12 comes out and Martin updates > > the code in PyXML, we can have another release with any needed > > documentation updates we need to make. > > OK. > > > -- > Uche Ogbuji Fourthought, Inc. > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ > Basic XML and RDF techniques for knowledge management, Part 7 - > http://www-106.ibm.com/developerworks/xml/library/x-think12.html > Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra > ry/x-jclark.html > Python and XML development using 4Suite, Part 3: 4RDF - > http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A > 1EA5A2CF4621C386256BBB006F4CEC > > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Mike.Olson@fourthought.com Wed Sep 11 15:55:33 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 08:55:33 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: References: Message-ID: <1031756134.3310.2.camel@penny> On Tue, 2002-09-10 at 10:57, Uche Ogbuji wrote: > > > So, if a user tries to import xml.xslt.processor it will first look to > > see if Ft.Xml.Xslt.Processor is available, and if not, then try > > xml.pyxslt.Processor. Or, any other of the xslt processors that people > > are proposing. > > > > This means a common (or close to common) set of interfaces on Xslt > > parsers but that shouldn't be too hard to come up with. > > Luckily, the only interfaces to the outside for the general programmer are > > Processor In fact, for the public interface, we wouldn't even need the next two. > InputSource(Factory) > Context, XsltContext > Compile, Match, Evaluate I had in mind something like: parser = xml.xpath.get_default_parser() parser.compile parser.evaluate parser.match and processor = xml.xslt.get_default_processor() processor.apply("src string",["stys"],params) I think if you want anything above and beyond that then you need to use a specific processor. Mike > > These should be easy to standardize. > > I think the bigger roadblock might be the extension APIs, since those tend to > be close to internals. > > > -- > Uche Ogbuji Fourthought, Inc. > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ > Basic XML and RDF techniques for knowledge management, Part 7 - > http://www-106.ibm.com/developerworks/xml/library/x-think12.html > Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra > ry/x-jclark.html > Python and XML development using 4Suite, Part 3: 4RDF - > http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A > 1EA5A2CF4621C386256BBB006F4CEC > -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Mike.Olson@fourthought.com Wed Sep 11 15:57:08 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 08:57:08 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: <200209101337.28036.mclay@nist.gov> References: <200209101337.28036.mclay@nist.gov> Message-ID: <1031756234.3341.4.camel@penny> On Tue, 2002-09-10 at 11:37, Michael McLay wrote: > On Tuesday 10 September 2002 12:42 pm, Uche Ogbuji wrote: > > > > Yes, in particular the bison parser modules. > > > > There is even more that has moved to C recently in 4Suite, so we'll > > certainly want to keep in mind general principles about what we want to > > keep in Python int he PyXPath/PyXSLT versions. I do like the idea of > > keeping them mostly Python for max cross-platform support. > > Isn't it just the portablity of the specific C code that is at issue? The > bison dependancy is troublesome for porting to the Mac and Windows. Most > everyone would prefer the speed of C if it is available and it compiles > without a hitch. Note, Jeremy rewrote Bison (and LExx) in Python so now the code that was Bison dependent no longer requires bison installed. We do the complete conversion from .y (or our xml version of a .y) to a Py and C parser without bison (or flexx) installed. Mike > > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From fdrake@acm.org Wed Sep 11 16:11:02 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 11:11:02 -0400 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <1031755914.3310.0.camel@penny> References: <1031755914.3310.0.camel@penny> Message-ID: <15743.23814.790841.443578@grendel.zope.com> Mike Olson writes: > number.c is also required. There is no way to do what you need to > do in Python so it needs to be in C. This isn't in PyXML currently. Are we missing functionality because of that? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Mike.Olson@fourthought.com Wed Sep 11 15:59:48 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 08:59:48 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: <15742.15938.727755.807823@grendel.zope.com> References: <200209101337.28036.mclay@nist.gov> <15742.15938.727755.807823@grendel.zope.com> Message-ID: <1031756389.3341.6.camel@penny> On Tue, 2002-09-10 at 12:47, Fred L. Drake, Jr. wrote: > > Martin v. Loewis writes: > > Correct. I'll look at the code when 0.12 is released, and I'm > > certainly open to building more C modules - we built the boolean > > module already to satisfy XPath/XSLT. > > Speaking of the boolean module... is there any reason not to change it > to use the built-in bool type under Python 2.3 and newer? Or just > replacing it with a Python module for 2.3? I don't think there is any reason, except, that I don't think any of the FT developers are running 2.3 yet Mike > > > -Fred > > -- > Fred L. Drake, Jr. > PythonLabs at Zope Corporation > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Mike.Olson@fourthought.com Wed Sep 11 16:00:48 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 09:00:48 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suitebug reports In-Reply-To: <000f01c258fb$72bd2830$df7e4e18@brianspiv1700> References: <000f01c258fb$72bd2830$df7e4e18@brianspiv1700> Message-ID: <1031756452.3341.8.camel@penny> On Tue, 2002-09-10 at 12:54, Brian Quinlan wrote: > > I'm all for this. What I would like to see then in xml.xpath and > > xml.xslt is smart import logic similar that in the default xml. > > > > So, if a user tries to import xml.xslt.processor it will first look to > > see if Ft.Xml.Xslt.Processor is available, and if not, then try > > xml.pyxslt.Processor. Or, any other of the xslt processors that > people > > are proposing. > > Please don't do this. The smart import used by PyXML has caused me > serious problems in the past. > > Unless the two libraries are going to be bug for bug compatible, then > I'd like to be able choose myself. But you will be able to. from Ft.Xml.Xslt import Processor or from xml.pyxslt import Processor Mike > > Cheers, > Brian > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Mike.Olson@fourthought.com Wed Sep 11 16:03:33 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 09:03:33 -0600 Subject: [XML-SIG] [Q] xml.utils.boolean C API In-Reply-To: <15742.17297.374531.327438@grendel.zope.com> References: <15742.17297.374531.327438@grendel.zope.com> Message-ID: <1031756621.3310.10.camel@penny> On Tue, 2002-09-10 at 13:10, Fred L. Drake, Jr. wrote: > > The C API to the boolean module, expressed in extensions/boolean.h, > does not appear to be used. It certainly can't be used as-is when the > module is built to be dynamically loaded. The boolean_new() function > it references is not present in boolean.c at all. It confused me as well. But it is actually used. In the init of the module, we expose a ref to false and true. The are the only two instances of boolean ever created. > > Is there any reason this header is used, or can the needed information > simply be merged into the implementation? > > The implementation also uses DL_EXPORT for a few things that should be > static, and almost nobody outside the Python core needs to use the > tp_print slot; the use made by this code is certainly avoidable in > efficient ways. > > Uche, Mike, et al.: any objection to a little cleanup here? Not at all. Though, you might need to here from Jeremy before you do much as he wrote it. Mike > > Thanks! > > > -Fred > > -- > Fred L. Drake, Jr. > PythonLabs at Zope Corporation > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From fdrake@acm.org Wed Sep 11 16:25:20 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 11:25:20 -0400 Subject: [XML-SIG] [Q] xml.utils.boolean C API In-Reply-To: <1031756621.3310.10.camel@penny> References: <15742.17297.374531.327438@grendel.zope.com> <1031756621.3310.10.camel@penny> Message-ID: <15743.24672.594525.801594@grendel.zope.com> Mike Olson writes: > It confused me as well. But it is actually used. In the init of the > module, we expose a ref to false and true. The are the only two > instances of boolean ever created. I've maintained that in the changes I committed yesterday, but I moved the declarations for the (now static) globals into boolean.c. We did something similar for the bool type in Python 2.3, but re-used the Py_True and Py_False globals already provided. > Not at all. Though, you might need to here from Jeremy before you do > much as he wrote it. Oops, too late! Jeremy, feel free to scream if I botched things for you! -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Mike.Olson@fourthought.com Wed Sep 11 17:08:56 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 10:08:56 -0600 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <20020911072101.GC18571@orion.logilab.fr> References: <002401c2591b$c1c1d550$fe193044@tbp1> <20020911072101.GC18571@orion.logilab.fr> Message-ID: <1031760539.9251.4.camel@penny> On Wed, 2002-09-11 at 01:21, Alexandre wrote: > On Tue, Sep 10, 2002 at 06:45:26PM -0400, Thomas B. Passin wrote: > > [Martin v. Loewis] > > > > > Also, people might be requesting XPath support in Python core at some > > > day - at which time I'd like to propose integration of PyXPath > > > (apparently, pressure to use XSLT programmatically is not that high). > > > > I use 4xslt programatically. I want to keep on doing so. > > Same for us at Logilab ! We depend on that. Probably doesn't need saying, but we use these as well :) Mike > > Alexandre Fayolle > -- > LOGILAB, Paris (France). > http://www.logilab.com http://www.logilab.fr http://www.logilab.org > Narval, the first software agent available as free software (GPL). > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From Mike.Olson@fourthought.com Wed Sep 11 17:22:57 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 11 Sep 2002 10:22:57 -0600 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <15743.23814.790841.443578@grendel.zope.com> References: <1031755914.3310.0.camel@penny> <15743.23814.790841.443578@grendel.zope.com> Message-ID: <1031761378.9251.6.camel@penny> On Wed, 2002-09-11 at 09:11, Fred L. Drake, Jr. wrote: > > Mike Olson writes: > > number.c is also required. There is no way to do what you need to > > do in Python so it needs to be in C. > > This isn't in PyXML currently. Are we missing functionality because > of that? Yes, though pretty arcane stuff. -INF and -0 and that -INF/0 = -1 while -INF/-0 = 1 etc. I've never (conciously) used them. Mike > > > -Fred > > -- > Fred L. Drake, Jr. > PythonLabs at Zope Corporation -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From fdrake@acm.org Wed Sep 11 17:41:29 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 12:41:29 -0400 Subject: [XML-SIG] PyXML XPath and XSLT In-Reply-To: <1031761378.9251.6.camel@penny> References: <1031755914.3310.0.camel@penny> <15743.23814.790841.443578@grendel.zope.com> <1031761378.9251.6.camel@penny> Message-ID: <15743.29241.834997.331169@grendel.zope.com> Mike Olson writes: > Yes, though pretty arcane stuff. -INF and -0 and that -INF/0 = -1 while > -INF/-0 = 1 etc. > > I've never (conciously) used them. I think I can live without those as well. ;-) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Wed Sep 11 18:00:20 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 11 Sep 2002 11:00:20 -0600 Subject: [XML-SIG] test_c14n & default configuration In-Reply-To: Message from "Fred L. Drake, Jr." of "Wed, 11 Sep 2002 09:57:59 EDT." <15743.19431.910257.636788@grendel.zope.com> Message-ID: > > Rich Salz writes: > > The test_c14n file uses xpath to pick a nodeset. Now that xpath isn't > > there by default (boo hoo), it should probably say "can't run tests." or > > some such. (I missed the discussion: why isn't xpath there any more?) > > It appearantly doesn't pass the 4Suite regression test for that > module, based on reports from others. I'm starting to think that > xml.xpath should be re-enabled by default since it appearantly mostly > works, and only omit xml.xslt package, since that supposedly is quite > broken. Yes. This sounds like a more accurate ajustment. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From fdrake@acm.org Wed Sep 11 18:14:47 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 13:14:47 -0400 Subject: [XML-SIG] test_c14n & default configuration In-Reply-To: References: <15743.19431.910257.636788@grendel.zope.com> Message-ID: <15743.31239.244979.608086@grendel.zope.com> Uche Ogbuji writes: > Yes. This sounds like a more accurate ajustment. Ok, I've re-enabled xml.xpath in the default installation. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Wed Sep 11 18:22:21 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 11 Sep 2002 11:22:21 -0600 Subject: PyXML XPath and XSLT Re: [XML-SIG] Mass assignment of 4Suite bug reports In-Reply-To: Message from Mike Olson of "11 Sep 2002 08:59:48 MDT." <1031756389.3341.6.camel@penny> Message-ID: > On Tue, 2002-09-10 at 12:47, Fred L. Drake, Jr. wrote: > > > > Martin v. Loewis writes: > > > Correct. I'll look at the code when 0.12 is released, and I'm > > > certainly open to building more C modules - we built the boolean > > > module already to satisfy XPath/XSLT. > > > > Speaking of the boolean module... is there any reason not to change it > > to use the built-in bool type under Python 2.3 and newer? Or just > > replacing it with a Python module for 2.3? > > I don't think there is any reason, except, that I don't think any of the > FT developers are running 2.3 yet I run it on again, off again from Python CVS. It's rolling into release cycle, so I want to be sure we're ready. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From uche.ogbuji@fourthought.com Wed Sep 11 20:38:34 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 11 Sep 2002 13:38:34 -0600 Subject: [XML-SIG] [Q] xml.utils.boolean C API In-Reply-To: Message from Mike Olson of "11 Sep 2002 09:03:33 MDT." <1031756621.3310.10.camel@penny> Message-ID: > On Tue, 2002-09-10 at 13:10, Fred L. Drake, Jr. wrote: > > > > The C API to the boolean module, expressed in extensions/boolean.h, > > does not appear to be used. It certainly can't be used as-is when the > > module is built to be dynamically loaded. The boolean_new() function > > it references is not present in boolean.c at all. > > It confused me as well. But it is actually used. In the init of the > module, we expose a ref to false and true. The are the only two > instances of boolean ever created. > > > > > Is there any reason this header is used, or can the needed information > > simply be merged into the implementation? > > > > The implementation also uses DL_EXPORT for a few things that should be > > static, and almost nobody outside the Python core needs to use the > > tp_print slot; the use made by this code is certainly avoidable in > > efficient ways. > > > > Uche, Mike, et al.: any objection to a little cleanup here? > > Not at all. Though, you might need to here from Jeremy before you do > much as he wrote it. No. I wrote it. There is for sure nothing there that Fred can't handle easily. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From fdrake@acm.org Wed Sep 11 20:51:38 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 11 Sep 2002 15:51:38 -0400 Subject: [XML-SIG] PyXML 0.8.1 preparations Message-ID: <15743.40650.724560.132349@grendel.zope.com> I think I'm done with everything I'm planning to do for PyXML 0.8.1. Is there anything else I promised to get done that I've forgotten about? I've been testing all along with Python 2.0.1, 2.1.3, 2.2.1, and 2.3a0 on Linux (RedHat 7.2), and tested CVS as of this morning with Python 2.2.1 on Windows 2000. I won't be able to test on Windows again until I get home tonight. Are there any showstoppers I've missed? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Wed Sep 11 23:40:33 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 12 Sep 2002 00:40:33 +0200 Subject: [XML-SIG] PyXML 0.8.1 preparations In-Reply-To: <15743.40650.724560.132349@grendel.zope.com> References: <15743.40650.724560.132349@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > I think I'm done with everything I'm planning to do for PyXML 0.8.1. > Is there anything else I promised to get done that I've forgotten > about? Please review the ANNOUNCE changes. Apart from that, I'd be ready to produce the release. Regards, Martin From remi@cherrypy.org Thu Sep 12 14:22:11 2002 From: remi@cherrypy.org (Remi Delon) Date: Thu, 12 Sep 2002 15:22:11 +0200 Subject: [XML-SIG] Online XML/XSL transformation tool available (based on 4Suite and CherryPy) Message-ID: Hi everyone, I just made available a very simple online XML/XSL transformation tool. You just input your XML document and your XSL stylesheet (in textareas), and it runs the transformation and displays the result. This can sometimes come handy to debug XSL stylesheets. This tool is part of the CherryPy online demo and it is used to demonstrate the integration of the 4Suite module into CherryPy. The direct URL is http://www.cherrypy.org/demo/xmlXslOnline Cheers. Remi. PS: Let me know if you have any comments or if you would like more features. PS2: Thanks to the Fourthought guys for their good work on 4Suite. From Juergen Hermann" Message-ID: On Wed, 11 Sep 2002 15:51:38 -0400, Fred L. Drake, Jr. wrote: >Are there any showstoppers I've missed? Yes, we just found a bug in xmlproc. :) Without the patch below, xmlproc won't read external parameter entities = larger than 16K (actually, it will only read the first 16K, resulting in non-wellformedness errors and the like because parts of the entity are missing). Also added to the bug tracker... RCS file: /cvsroot/pyxml/xml/xml/parsers/xmlproc/xmlutils.py,v retrieving revision 1.33 diff -u -r1.33 xmlutils.py --- xml/parsers/xmlproc/xmlutils.py 22 Aug 2002 17:03:14 -0000 = 1.33 +++ xml/parsers/xmlproc/xmlutils.py 12 Sep 2002 16:04:16 -0000 @@ -162,7 +162,7 @@ tmp =3D self.seen_xmldecl self.seen_xmldecl =3D 0 # Avoid complaints - self.read_from(inf) + self.read_from(inf, -1) self.seen_xmldecl =3D tmp self.flush() Ciao, J=FCrgen -- J=FCrgen Hermann, Developer WEB.DE AG, http://webde-ag.de/ From noreply@sourceforge.net Thu Sep 12 17:14:37 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Thu, 12 Sep 2002 09:14:37 -0700 Subject: [XML-SIG] [ pyxml-Bugs-608453 ] xmlproc: fails on >16K entities Message-ID: Bugs item #608453, was opened at 2002-09-12 18:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=608453&group_id=6473 Category: xmlproc Group: None Status: Open Resolution: None Priority: 7 Submitted By: Jürgen Hermann (jhermann) Assigned to: Lars Marius Garshol (larsga) Summary: xmlproc: fails on >16K entities Initial Comment: Without the patch below, xmlproc won't read external parameter entities larger than 16K (actually, it will only read the first 16K, resulting in non-wellformedness errors and the like because parts of the entity are missing). RCS file: /cvsroot/pyxml/xml/xml/parsers/xmlproc/xmlutils.py,v retrieving revision 1.33 diff -u -r1.33 xmlutils.py --- xml/parsers/xmlproc/xmlutils.py 22 Aug 2002 17:03:14 -0000 1.33 +++ xml/parsers/xmlproc/xmlutils.py 12 Sep 2002 16:04:16 -0000 @@ -162,7 +162,7 @@ tmp = self.seen_xmldecl self.seen_xmldecl = 0 # Avoid complaints - self.read_from(inf) + self.read_from(inf, -1) self.seen_xmldecl = tmp self.flush() Testcase: %big-ent; ]> and read it with "xmlproc_val test.xml". ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=608453&group_id=6473 From fdrake@acm.org Thu Sep 12 17:40:10 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 12 Sep 2002 12:40:10 -0400 Subject: [XML-SIG] PyXML 0.8.1 preparations In-Reply-To: References: <15743.40650.724560.132349@grendel.zope.com> Message-ID: <15744.50026.235237.610213@grendel.zope.com> Juergen Hermann writes: > Yes, we just found a bug in xmlproc. :) Sheesh, the answer was supposed to be "no"! I don't know how active Lars Marius is these days; we haven't seen him a lot on the list. Hopefully he can look at the patch quickly and determine whether it should go in or not; someone else can check it in if he just marks the patch approved on SF. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From Juergen Hermann" Message-ID: On Thu, 12 Sep 2002 12:40:10 -0400, Fred L. Drake, Jr. wrote: >Sheesh, the answer was supposed to be "no"! OK, next time I will not report it. ;) It's no problem for *US* if it doesn't get into 0.8.1, because it's easily reapplied locally. But I guess it could cause problems with DocBook for example, since it uses large DTD entitiy includes. Ciao, J=FCrgen From fdrake@acm.org Thu Sep 12 19:06:11 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 12 Sep 2002 14:06:11 -0400 Subject: [XML-SIG] PyXML 0.8.1 preparations In-Reply-To: References: <15744.50026.235237.610213@grendel.zope.com> Message-ID: <15744.55187.713080.761379@grendel.zope.com> Juergen Hermann writes: > OK, next time I will not report it. ;) Oh, it's fine to report it, just don't call it a showstopper! ;-) > It's no problem for *US* if it doesn't get into 0.8.1, because it's > easily reapplied locally. But I guess it could cause problems with > DocBook for example, since it uses large DTD entitiy includes. I've certainly no objection to applying the patch, but since I'm no xmlproc expert, am hesitant to do so on my own judgement without spending some time reading the source. I do hope someone who knows xmlproc better than I can look at it quickly; with such a simple patch it would be a shame not to include it. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From thal@kahala.net Thu Sep 12 00:47:31 2002 From: thal@kahala.net (thal) Date: Wed, 11 Sep 2002 13:47:31 -1000 Subject: [XML-SIG] XUL and Python Message-ID: <3D7FD613.CB8A68C1@kahala.net> Hello, having read this excellent article on Mozilla and XUL: http://salon.com/tech/feature/2002/09/10/browser_wars/print.html I found this page on Python and XUL: http://www.mozilla.org/docs/xul/xulnotes/xulnote_oven.html which starts with this example: >>> from XUL import * >>> myXWin = XWindow('XUL Pie', 100, 200) >>> myXT1 = XText('Hello World!', bclass='marquee') >>> myXB1 = XButton('Quit', oncommand='window.close()') >>> myXWin.Bake(myXT1, myXB1) >>> myXWin.Serve() I'm using Python 2.2, and when I try the first line I get >>> ImportError: No module named XUL I've tried looking for this XUL module, but can't find it. Can anyone point me in the right direction? Thanks. From landauer@got.net Fri Sep 13 01:36:58 2002 From: landauer@got.net (landauer@got.net) Date: Thu, 12 Sep 2002 17:36:58 -0700 Subject: [XML-SIG] XUL and Python In-Reply-To: <3D7FD613.CB8A68C1@kahala.net> References: <3D7FD613.CB8A68C1@kahala.net> Message-ID: <1031877418.3d81332ab5630@webmail.got.net> > I found this page on Python and XUL: > http://www.mozilla.org/docs/xul/xulnotes/xulnote_oven.html > which starts with this example: > > >>> from XUL import * [...] > I've tried looking for this XUL module, but can't find it. Near the end of the article to which you referred, it says: > In the meantime, the source code is available __here__. To use > the module, copy the source code into a file called XUL.py and > put it in your PYTHONPATH, start the interpreter, and import > the module's classes. "__here__" is a hyperlink that points to the following URL: http://www.mozilla.org/docs/xul/xulnotes/XUL.py.txt But do note that the XUL article was written at least 2 and a half years ago, and PyXML didn't really exist 2.5 years ago. I don't know enough of either XUL or the current PyXML stuff to know how easy it would be to do this, but I suspect that PyXML could help someone to write a much better XUL.py today. -=-=-=- food. shelter. clothing. net. Got.net - The Internet Connection, Inc From vdv@dyomedea.com Fri Sep 13 10:33:50 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 13 Sep 2002 11:33:50 +0200 Subject: [XML-SIG] WXS type library: guidance needed Message-ID: <1031909630.11241.219.camel@ibook> Hi, I'd like to add, sooner or later, a support for the W3C XML Schema datatypes in my Relax NG xvif implementation [1] and was wondering if there is anything anywhere that I could borrow. This first track I could follow is XSV, the W3C XML Schema implementation, but its status page [2] shows that the implementation of a type library isn't done yet. The second path is a thread from Frebruary on this list [3] but this doesn't go beyond suggestions for an API and the implementation [4] seems to be halted. It this a subject on which someone is already working on and/or is there anything already existing? Many thanks, Eric [1] http://downloads.xmlschemata.org/python/xvif/ [2] http://www.ltg.ed.ac.uk/~ht/xsv-status.html [3] http://mail.python.org/pipermail/xml-sig/2002-February/007200.html [4] http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pyxml/sandbox/datatypes/ --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From nhs@llnl.gov Fri Sep 13 16:35:27 2002 From: nhs@llnl.gov (Norman Samuelson) Date: Fri, 13 Sep 2002 08:35:27 -0700 Subject: [XML-SIG] expat missing in Sun Python? Message-ID: <5.1.0.14.2.20020913082446.02c40bd0@popeye.llnl.gov> I have been working on an application that deals with XML files, converting text data to XML. It uses the minidom, which I thought was part of standard Python distributions. I developed the code on Windows, and I have had no trouble running it on Compaq or IBM flavors of Unix. When I try to run it on our Sun machines, it fails. The following two lines of Python code illustrate the problem: import xml.dom.minidom as mdom doc = mdom.parse('name of xml file goes here') When parse is called on the Sun, I get the following retrace: >Python 2.1.1 (#2, Nov 8 2001, 17:03:47) >[GCC 2.95.2 19991024 (release)] on sunos5 >Type "copyright", "credits" or "license" for more information. > >>> import xml.dom.minidom as mdom > >>> doc = mdom.parse('/usr/gapps/cyclops/data/ale3d.ParamDescriptions.xml') >Traceback (most recent call last): > File "", line 1, in ? > File > "/usr/local/apps/python/python-2.1.1/lib/python2.1/xml/dom/minidom.py", > line 910, in parse > return _doparse(pulldom.parse, args, kwargs) > File > "/usr/local/apps/python/python-2.1.1/lib/python2.1/xml/dom/minidom.py", > line 901, in _doparse > events = apply(func, args, kwargs) > File > "/usr/local/apps/python/python-2.1.1/lib/python2.1/xml/dom/pulldom.py", > line 289, in parse > parser = xml.sax.make_parser() > File > "/usr/local/apps/python/python-2.1.1/lib/python2.1/xml/sax/__init__.py", > line 88, in make_parser > raise SAXReaderNotAvailable("No parsers found", None) >xml.sax._exceptions.SAXReaderNotAvailable: No parsers found > >>> It appears that expat (or is it pyexpat) is missing. I would appreciate it if someone could tell me where to find the missing piece, or a newer version that includes that missing piece. - Norm - Norman H. Samuelson nhs@llnl.gov Lawrence Livermore National Lab 925-422-0661 P.O. Box 808, L-98 Livermore, CA 94551 From fdrake@acm.org Fri Sep 13 16:52:14 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 13 Sep 2002 11:52:14 -0400 Subject: [XML-SIG] expat missing in Sun Python? In-Reply-To: <5.1.0.14.2.20020913082446.02c40bd0@popeye.llnl.gov> References: <5.1.0.14.2.20020913082446.02c40bd0@popeye.llnl.gov> Message-ID: <15746.2478.468124.116750@grendel.zope.com> Norman Samuelson writes: > It appears that expat (or is it pyexpat) is missing. It does look that way. Here's what's probably happening; you'll be able to check by looking for some files on your system. Python includes an extension that allows it to use Expat. On Windows, the Expat DLL is bundled as well, so Expat will always be available on that platform. On Unix, the "pyexpat" extension will be built only if the setup.py script bundled with the Python sources can locate Expat when Python is built. If the system has Expat installed, you will be able to locate either expat.h or xmlparse.h (depending on the version of Expat) in the system include directory, and either libexpat.{a,so} or libxmlparse.a and libxmltok.a in one of the system libraries directories. If Expat is located somewhere else on the system, you will need to modify the setup.py script to allow Python to build the "pyexpat" extension using that version of Expat. Recent versions of PyXML bundle the complete Expat source code and do not require a separate installation of Expat. This approach is being adopted for Python as well, and will be incorporated in Python 2.3. > I would appreciate it if someone could tell me where to find the missing > piece, or a newer version that includes that missing piece. Perhaps the simplest approach to getting XML support at this point is to install the latest version of PyXML. Version 0.8.1 is expected in the next few days, and includes many improvements in xml.dom.minidom and the latest version of Expat (1.95.5). -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Fri Sep 13 17:28:19 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Fri, 13 Sep 2002 10:28:19 -0600 Subject: [XML-SIG] XUL and Python In-Reply-To: Message from landauer@got.net of "Thu, 12 Sep 2002 17:36:58 PDT." <1031877418.3d81332ab5630@webmail.got.net> Message-ID: > > I found this page on Python and XUL: > > http://www.mozilla.org/docs/xul/xulnotes/xulnote_oven.html > > which starts with this example: > > > > >>> from XUL import * > [...] > > > I've tried looking for this XUL module, but can't find it. > > Near the end of the article to which you referred, it says: > > > In the meantime, the source code is available __here__. To use > > the module, copy the source code into a file called XUL.py and > > put it in your PYTHONPATH, start the interpreter, and import > > the module's classes. > > "__here__" is a hyperlink that points to the following URL: > > http://www.mozilla.org/docs/xul/xulnotes/XUL.py.txt > > But do note that the XUL article was written at least 2 and a > half years ago, and PyXML didn't really exist 2.5 years ago. I > don't know enough of either XUL or the current PyXML stuff > to know how easy it would be to do this, but I suspect that PyXML > could help someone to write a much better XUL.py today. You can program XUL in Python through PyXPCOM. See my articles: http://www-106.ibm.com/developerworks/components/library/co-pyxp1.html http://www-106.ibm.com/developerworks/webservices/library/co-pyxp2.html http://www-106.ibm.com/developerworks/webservices/library/co-pyxp3/ Some things have changed since I wrote that. For instance, PyXPCOM is now built in with Mozilla. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html From uche.ogbuji@fourthought.com Fri Sep 13 17:36:01 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Fri, 13 Sep 2002 10:36:01 -0600 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: Message from Eric van der Vlist of "13 Sep 2002 11:33:50 +0200." <1031909630.11241.219.camel@ibook> Message-ID: > Hi, > = > I'd like to add, sooner or later, a support for the W3C XML Schema > datatypes in my Relax NG xvif implementation [1] and was wondering if > there is anything anywhere that I could borrow. > = > This first track I could follow is XSV, the W3C XML Schema > implementation, but its status page [2] shows that the implementation o= f > a type library isn't done yet. > = > The second path is a thread from Frebruary on this list [3] but this > doesn't go beyond suggestions for an API and the implementation [4] > seems to be halted. > = > It this a subject on which someone is already working on and/or is ther= e > anything already existing? This is something where I think an XML-SIG project would be excellent. L= ike = em or not, data types are everywhere in XML processing these days, and wh= y = should we even allow false reasons for Java envy? :-) I think we should take on a project to develop a generic XML type library= = implementation for Python. I would envision it as a set of classes. Each would have a regex or what= ever = that sets the lexical space, and also a set of methods and class attribut= es = that reify the value space. This way, it would be really easy to plug into schema, query or Web servi= ces = projects, or even Python data binding. If we all came up with an interface here, we could take volunteers for fi= lling = out the library. For one thing, we should probably support W3C XML Schem= a = (WXS) types, as much as I dislike them, because of their ubiquity. One p= erson = could tackle the numeric types, another the dates (poor sod :-) ), anoth= er = text types, another XML structural types etc. Yet more people could add = in = useful non-WXS types (geospatial comes to mind, as does a saner date type= = implementation). Any thoughts? -- = Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap= ache/ Basic XML and RDF techniques for knowledge management, Part 7 - = http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml= /libra ry/x-jclark.html From fdrake@acm.org Fri Sep 13 17:43:08 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 13 Sep 2002 12:43:08 -0400 Subject: [XML-SIG] XUL and Python In-Reply-To: References: <"Message from landauer"@got.net> <1031877418.3d81332ab5630@webmail.got.net> Message-ID: <15746.5532.790651.253270@grendel.zope.com> Uche Ogbuji writes: > Some things have changed since I wrote that. For instance, PyXPCOM is now > built in with Mozilla. Now that's cool! Do you know what version that started? -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From vdv@dyomedea.com Fri Sep 13 17:46:44 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 13 Sep 2002 18:46:44 +0200 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: References: Message-ID: <1031935605.11241.755.camel@ibook> Hi Uche, On Fri, 2002-09-13 at 18:36, Uche Ogbuji wrote: > >=20 > > It this a subject on which someone is already working on and/or is ther= e > > anything already existing? >=20 > This is something where I think an XML-SIG project would be excellent. L= ike=20 > em or not, data types are everywhere in XML processing these days, and wh= y=20 > should we even allow false reasons for Java envy? :-) I would be really happy if this could be a community project since W3C XML Schema datatypes can be a real nightmare if we want a full implementation. =20 > I think we should take on a project to develop a generic XML type library= =20 > implementation for Python. >=20 > I would envision it as a set of classes. Each would have a regex or what= ever=20 > that sets the lexical space, and also a set of methods and class attribut= es=20 > that reify the value space. >=20 > This way, it would be really easy to plug into schema, query or Web servi= ces=20 > projects, or even Python data binding. Exactly, such libraries could be used directly from DOM or SAX applications and through Relax NG (a great thing with Relax NG is its ability to use such libraries). =20 > If we all came up with an interface here, we could take volunteers for fi= lling=20 > out the library. For one thing, we should probably support W3C XML Schem= a=20 > (WXS) types, as much as I dislike them, because of their ubiquity. One p= erson=20 > could tackle the numeric types, another the dates (poor sod :-) ), anoth= er=20 > text types, another XML structural types etc. Yet more people could add = in=20 > useful non-WXS types (geospatial comes to mind, as does a saner date type= =20 > implementation). There is also a DTD compatibility type libraries defined in RNG. Although this is much less work than WXS, I would expect that it would be quite usefull. =20 > Any thoughts? Sounds good, when/how do we start :-) Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From rsalz@datapower.com Fri Sep 13 17:51:02 2002 From: rsalz@datapower.com (Rich Salz) Date: Fri, 13 Sep 2002 12:51:02 -0400 Subject: [XML-SIG] WXS type library: guidance needed References: Message-ID: <3D821776.9060001@datapower.com> Sorry Eric, I didn't pick up on your original note. > I think we should take on a project to develop a generic XML type library > implementation for Python. ZSI's typecode system might be a good starting point. It includes all the primitive XSD schema types, the ability to parse and serialize (according to SOAP RPC encoding rules), etc. /r$ From fdrake@acm.org Fri Sep 13 17:52:13 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Fri, 13 Sep 2002 12:52:13 -0400 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: References: <1031909630.11241.219.camel@ibook> Message-ID: <15746.6077.170385.423292@grendel.zope.com> Uche Ogbuji writes: > This is something where I think an XML-SIG project would be > excellent. Like em or not, data types are everywhere in XML > processing these days, and why should we even allow false reasons > for Java envy? :-) > > I think we should take on a project to develop a generic XML type > library implementation for Python. Andrew Kuchling started some work, currently checked into the PyXML CVS repository in the "datatypes" directory of the sandbox module (not the xml module). I don't really know what the state of it is. I'd really like to see Andrew's RELAX-NG package be finished and made part of the standard PyXML distribution as well (it's also hidden in the sandbox). I offered to help work on it at one point, but then never found the time to do so. ;-( Eric, perhaps you'd like to take a look at that code and see if it works for what you're trying to do with XVIF, and help out with general RELAX-NG support. ;-) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From vdv@dyomedea.com Fri Sep 13 18:07:21 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 13 Sep 2002 19:07:21 +0200 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: <15746.6077.170385.423292@grendel.zope.com> References: <1031909630.11241.219.camel@ibook> <15746.6077.170385.423292@grendel.zope.com> Message-ID: <1031936841.11185.790.camel@ibook> Hi Fred, On Fri, 2002-09-13 at 18:52, Fred L. Drake, Jr. wrote: >=20 > Uche Ogbuji writes: > > This is something where I think an XML-SIG project would be > > excellent. Like em or not, data types are everywhere in XML > > processing these days, and why should we even allow false reasons > > for Java envy? :-) > >=20 > > I think we should take on a project to develop a generic XML type > > library implementation for Python. >=20 > Andrew Kuchling started some work, currently checked into the PyXML > CVS repository in the "datatypes" directory of the sandbox module (not > the xml module). I don't really know what the state of it is. >=20 > I'd really like to see Andrew's RELAX-NG package be finished and made > part of the standard PyXML distribution as well (it's also hidden in > the sandbox). I offered to help work on it at one point, but then > never found the time to do so. ;-( >=20 > Eric, perhaps you'd like to take a look at that code and see if it > works for what you're trying to do with XVIF, and help out with > general RELAX-NG support. ;-) I have exchanged a couple of emails with Andrew when I have started my XVIF implementation. I knew that there was some redundancy between what we were doing but I have a set of different objectives (one is to learn the RNG "internals" for a book [1] which I am writing on the subject, the second is to extend it for the work I am doing at the ISO DSDL [2] working group) and I have decided to go by myself. The book will be available under FDL, is open for public review (the 6 first chapters are already available) and comments from the XML Python community are welcome. Also, XVIF [3] is published under a MPL license and thus available for public use if anyone is interested. As for the library, I took a look at what has been done hidden in the sandbox and think that we could consider it as a starting point since it seems generic and well thought. Thanks Eric [1] http://books.xmlschemata.org/relaxng/ [2] http://dsdl.org [3] http://downloads.xmlschemata.org/python/xvif/ >=20 >=20 > -Fred >=20 > --=20 > Fred L. Drake, Jr. > PythonLabs at Zope Corporation >=20 >=20 --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From ht@cogsci.ed.ac.uk Fri Sep 13 19:50:12 2002 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 13 Sep 2002 19:50:12 +0100 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: <1031936841.11185.790.camel@ibook> References: <1031909630.11241.219.camel@ibook> <15746.6077.170385.423292@grendel.zope.com> <1031936841.11185.790.camel@ibook> Message-ID: I think this is a great idea, and will contribute what little there is in XSV if it looks like being useful. The whitespace handling is what comes immediately to mind, as that's slightly tricky to get right. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2002, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me _always_ has this .sig -- mail without it is forged spam] From will.rutherdale@utoronto.ca Sat Sep 14 15:27:13 2002 From: will.rutherdale@utoronto.ca (Will Rutherdale) Date: Sat, 14 Sep 2002 10:27:13 -0400 Subject: [XML-SIG] XML HOWTO Message-ID: <3D834741.975F20@interlog.com> Where can I download a copy of the excellent Python XML HOWTO? I'd rather get the whole thing and browse it locally rather than navigate it a page at a time from http://pyxml.sourceforge.net/topics/docs.html. -Will From english@spiritone.com Sat Sep 14 19:42:31 2002 From: english@spiritone.com (Josh English) Date: 14 Sep 2002 11:42:31 -0700 Subject: [XML-SIG] Broken Link on http://pyxml.sourceforge.net/topics/dtds/index.html Message-ID: <3D838317.6040106@spiritone.com> The link to www.schema.net seems to be broken. I keeping getting a page stating that the domain name is for sale. Josh English english@spiritone.com From martin@v.loewis.de Sat Sep 14 21:38:08 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 14 Sep 2002 22:38:08 +0200 Subject: [XML-SIG] XML HOWTO In-Reply-To: <3D834741.975F20@interlog.com> References: <3D834741.975F20@interlog.com> Message-ID: Will Rutherdale writes: > Where can I download a copy of the excellent Python XML HOWTO? > > I'd rather get the whole thing and browse it locally rather than > navigate it a page at a time from > http://pyxml.sourceforge.net/topics/docs.html. It's part of the PyXML distribution, in the doc subdirectory. Regards, Martin From uche.ogbuji@fourthought.com Sun Sep 15 14:17:10 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Sun, 15 Sep 2002 07:17:10 -0600 Subject: [XML-SIG] Broken Link on http://pyxml.sourceforge.net/topics/dtds/index.html In-Reply-To: Message from "Josh English" of "14 Sep 2002 11:42:31 PDT." <3D838317.6040106@spiritone.com> Message-ID: > The link to www.schema.net seems to be broken. I keeping getting a page > stating that the domain name is for sale. This domain has been on again, off again. You might find the relevant content at http://www.xmlsoftware.com/ I don't know whether we should change that link until James Tauber says there are no plans to revive the domain. After all, xmlsoftware.com also disappeared for a while. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html From noreply@sourceforge.net Sun Sep 15 16:35:40 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sun, 15 Sep 2002 08:35:40 -0700 Subject: [XML-SIG] [ pyxml-Bugs-609590 ] XmlprocDriver does not call EntityResolv Message-ID: Bugs item #609590, was opened at 2002-09-15 15:35 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609590&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: James Kew (jkew) Assigned to: Nobody/Anonymous (nobody) Summary: XmlprocDriver does not call EntityResolv Initial Comment: The xmlproc SAX2 parser does not call the EntityResolver set with setEntityResolver when processing the DOCTYPE declaration: >>> import xml.sax.sax2exts >>> import xml.sax.handler >>> >>> class MyResolver (xml.sax.handler.EntityResolver): ... def resolveEntity(self, publicId, systemId): ... print "resolving:", systemId ... return systemId ... >>> parser = xml.sax.sax2exts.XMLParserFactory.make_parser() >>> parser.setEntityResolver(MyResolver()) >>> doc = parser.parse("module_script.xml") resolving: script.dtd >>> parser = xml.sax.sax2exts.XMLValParserFactory.make_parser () >>> parser.setEntityResolver(MyResolver()) >>> doc = parser.parse("module_script.xml") I want to override the default resolver to search for DTDs first in the current directory and then in the directory the script is working in -- and I _usually_ want to validate as for my input the speed difference is not appreciable but developer mistakes are... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609590&group_id=6473 From Mike.Olson@fourthought.com Sun Sep 15 17:25:10 2002 From: Mike.Olson@fourthought.com (Mike Olson) Date: 15 Sep 2002 10:25:10 -0600 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: References: Message-ID: <1032107113.9922.0.camel@penny> > > If we all came up with an interface here, we could take volunteers for filling > out the library. For one thing, we should probably support W3C XML Schema > (WXS) types, as much as I dislike them, because of their ubiquity. One person > could tackle the numeric types, another the dates (poor sod :-) ), another > text types, another XML structural types etc. Yet more people could add in > useful non-WXS types (geospatial comes to mind, as does a saner date type > implementation). > > Any thoughts? I've wanted to do this for a while now so I'm game. Though, I'm not sure how much time I would have. Do you envision this living in xml-sig or another python package? Mike > > > -- > Uche Ogbuji Fourthought, Inc. > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ > Basic XML and RDF techniques for knowledge management, Part 7 - > http://www-106.ibm.com/developerworks/xml/library/x-think12.html > Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra > ry/x-jclark.html > > > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Mike Olson Principal Consultant mike.olson@fourthought.com +1 303 583 9900 x 102 Fourthought, Inc. http://Fourthought.com PO Box 270590, http://4Suite.org Louisville, CO 80027-5009, USA XML strategy, XML tools, knowledge management From noreply@sourceforge.net Sun Sep 15 20:01:48 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Sun, 15 Sep 2002 12:01:48 -0700 Subject: [XML-SIG] [ pyxml-Bugs-609641 ] minidom nodes not pickleable Message-ID: Bugs item #609641, was opened at 2002-09-15 19:01 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609641&group_id=6473 Category: DOM Group: None Status: Open Resolution: None Priority: 5 Submitted By: James Kew (jkew) Assigned to: Nobody/Anonymous (nobody) Summary: minidom nodes not pickleable Initial Comment: >>> import xml.dom.minidom >>> dom = xml.dom.minidom.parseString("") >>> import cPickle >>> f = open("tmp.pkl", "wb") >>> cPickle.dump(dom, f, 1) Traceback (most recent call last): File "", line 1, in ? File "C:\PYTHON22\lib\copy_reg.py", line 68, in _reduce dict = getstate() TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled Similar (longer, but ends the same) traceback from pickle.dump. I suspect this is arising from the NodeList/EmptyNodeList class definitions in minicompat.py, but don't know enough about the pickling process to diagnose further. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609641&group_id=6473 From uche.ogbuji@fourthought.com Mon Sep 16 00:32:18 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Sun, 15 Sep 2002 17:32:18 -0600 Subject: [XML-SIG] XUL and Python In-Reply-To: Message from "Fred L. Drake, Jr." of "Fri, 13 Sep 2002 12:43:08 EDT." <15746.5532.790651.253270@grendel.zope.com> Message-ID: I can't find the announcement, and PyXPCOM's development has gone very quiet since Mark Hammond left ActiveState, but I'm pretty sure PyXPCOM was incorporated into the Mozilla build just prior to Mozilla 1.0 release. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html From loewis@informatik.hu-berlin.de Mon Sep 16 09:48:01 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: Mon, 16 Sep 2002 10:48:01 +0200 (CEST) Subject: [XML-SIG] PyXML 0.8.1 is released Message-ID: <200209160848.g8G8m1wJ011376@paros.informatik.hu-berlin.de> Version 0.8.1 of the Python/XML distribution is now available. It should be considered a beta release, and can be downloaded from the following URLs: http://prdownloads.sourceforge.net/pyxml/PyXML-0.8.1.tar.gz http://prdownloads.sourceforge.net/pyxml/PyXML-0.8.1.win32-py2.1.exe http://prdownloads.sourceforge.net/pyxml/PyXML-0.8.1.win32-py2.2.exe http://prdownloads.sourceforge.net/pyxml/PyXML-0.8.1-2.2.i386.rpm Changes in this version, compared to 0.8: * Various bug fixes: - tracing works now with pyexpat - fix registry key for MSIE XBEL support - correct ill-formedness of xmlproc.dtd2schema - avoid adding comments and PIs in the internal subset as Document children when building minidom trees - properly close files in xml.dom.ext.reader.* * XSLT is not installed anymore by default, specify --with-xslt if desired * Update Expat to 1.95.5 * Add features to xml.parsers.expat: - new method "UseForeignDTD" - new attribute "features" * Update to 25 July 2002 LS spec for xml.dom.xmlbuilder. Use expatbuilder if no parser is given to xml.dom.minidom.parse[String]. * Fix many obscure DOM bugs * Define and document the implementation-defined behaviors of cloneNode() for xml.dom.minidom. * Use urllib2 instead of urllib throughout. The Python/XML distribution contains the basic tools required for processing XML data using the Python programming language, assembled into one easy-to-install package. The distribution includes parsers and standard interfaces such as SAX and DOM, along with various other useful modules. The package currently contains: * XML parsers: Pyexpat (Jack Jansen), xmlproc (Lars Marius Garshol), sgmlop (Fredrik Lundh). * SAX interface (Lars Marius Garshol) * minidom DOM implementation (Paul Prescod, others) * 4DOM and 4XPath from Fourthought (Uche Ogbuji, Mike Olson) * Schema implementations: TREX (James Tauber) * Various utility modules and functions (various people) * Documentation and example programs (various people) The code is being developed bazaar-style by contributors from the Python XML Special Interest Group, so please send comments and questions to . Bug reports may be filed on SourceForge: http://sourceforge.net/tracker/index.php?group_id=3D6473&atid=3D106473 For more information about Python and XML, see: http://www.python.org/topics/xml/ --=20 Martin v. L=F6wis http://www.informatik.hu-berlin.de/~loewi= s From noreply@sourceforge.net Mon Sep 16 09:49:01 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Mon, 16 Sep 2002 01:49:01 -0700 Subject: [XML-SIG] [ pyxml-Bugs-609826 ] nextNode() refuses to step down 2 levels Message-ID: Bugs item #609826, was opened at 2002-09-16 01:49 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609826&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: nextNode() refuses to step down 2 levels Initial Comment: This is a truly bizzare bug. I have a simple program which walks the tree, thus: #!/usr/bin/python2.2 import xml.dom from xml.dom.ext.reader import Sax2 inxml = open("in.xml") inDoc = inReader.fromStream(inxml) inWalker = inDoc.createTreeWalker(inDoc, NodeFilter.SHOW_ELEMENT, None, 0) while 1: print "quux - '%s' '%s"" % (inWalker.currentNode.namespaceURI, inWalker.currentNode.localName) next = inWalker.nextNode() if next == None: break This is in.xml mark 1: ?xml version="1.0" encoding="utf-8" ?> Running this file through the walker produces the following debug: quux - 'None' 'None' quux - 'DAV:' 'propertyupdate' quux - 'DAV:' 'remove' quux - 'None' 'intact' quux - 'None' 'removeme' Bizzare! It stops at removeme, and doesn't traverse the tree. However, if you nuke removeme, leaving "", you get: quux - 'None' 'None' quux - 'DAV:' 'propertyupdate' quux - 'DAV:' 'remove' quux - 'None' 'intact' quux - 'DAV:' 'set' quux - 'None' 'mjuuk0wz' quux - 'None' 'mook0wz' quux - 'None' 'nest1' quux - 'None' 'nest2' quux - 'None' 'nest3' Odd ... NOW it traverses the whole tree. I cannot for the life of me figure out what this is - an issue with dropping down more than x levels? FWIW, it also happens when you reverse "set" and "remove". This occurs with both 0.11.x, and 0.12.x (Python 2.1 and 2.2). Is there a better way to walk the tree? Thanks in advance! :) d ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=609826&group_id=6473 From vdv@dyomedea.com Mon Sep 16 14:23:26 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 16 Sep 2002 15:23:26 +0200 Subject: [XML-SIG] WXS type library: guidance needed In-Reply-To: <1032107113.9922.0.camel@penny> References: <1032107113.9922.0.camel@penny> Message-ID: <1032182606.21316.335.camel@ibook> On Sun, 2002-09-15 at 18:25, Mike Olson wrote: > >=20 > > If we all came up with an interface here, we could take volunteers for = filling=20 > > out the library. For one thing, we should probably support W3C XML Sch= ema=20 > > (WXS) types, as much as I dislike them, because of their ubiquity. One= person=20 > > could tackle the numeric types, another the dates (poor sod :-) ), ano= ther=20 > > text types, another XML structural types etc. Yet more people could ad= d in=20 > > useful non-WXS types (geospatial comes to mind, as does a saner date ty= pe=20 > > implementation). > >=20 > > Any thoughts? >=20 > I've wanted to do this for a while now so I'm game. Though, I'm not > sure how much time I would have. Great! > Do you envision this living in xml-sig or another python package? If xml-sig is interested by the project I think that, yes, it would make sense develop a type library there --even though it could be considered as more generic than XML... Eric > Mike >=20 > >=20 > >=20 > > --=20 > > Uche Ogbuji Fourthought, Inc. > > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > > Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-= apache/ > > Basic XML and RDF techniques for knowledge management, Part 7 -=20 > > http://www-106.ibm.com/developerworks/xml/library/x-think12.html > > Keeping pace with James Clark - http://www-106.ibm.com/developerworks/x= ml/libra > > ry/x-jclark.html > >=20 > >=20 > >=20 > > _______________________________________________ > > XML-SIG maillist - XML-SIG@python.org > > http://mail.python.org/mailman/listinfo/xml-sig > --=20 > Mike Olson Principal Consultant > mike.olson@fourthought.com +1 303 583 9900 x 102 > Fourthought, Inc. http://Fourthought.com=20 > PO Box 270590, http://4Suite.org > Louisville, CO 80027-5009, USA > XML strategy, XML tools, knowledge management >=20 >=20 --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From vdv@dyomedea.com Mon Sep 16 15:10:12 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 16 Sep 2002 16:10:12 +0200 Subject: [XML-SIG] Some thoughts about types libraries Message-ID: <1032185412.21316.415.camel@ibook> As candidates for type libraries interfaces, we have right now: 1) The work started by Andrew with feedback from Thomas Passin: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pyxml/sandbox/datatypes/ 2) An "extremely simplistic" version which I am currently using in XVIF: http://downloads.xmlschemata.org/python/xvif/rngCoreTypeLib.py in which right now, a type has only a constructor (which performs any canonicalization and syntactical checks) and a isEqual() method. Both are going more or less in the same direction and I am wondering if this the right way to go... What could be objected to these two proposals is that they model XML datatypes as if they were something very different from other classes (including built-in python types). This is true that some datatypes defined by W3C XML Schema have no direct equivalent in Python, but this is not always the case and I am wondering if: a) when there is a mapping (for instance xs:integer is a signed integer of arbitrary lenght ie the same thing than a python "long"), we shouldn't use this fact and implement the corresponding datatypes as a subclass of the python native type b) when it's not the case we shouldn't consider adding methods to make it something which can be used out of the scope of XML validation. I am not sure to be very clear and it's rather abstract and fuzzy to explain, but I think that there is a difference between designing a type library which could only be used in the context of XML validations and a type library which could eventually be usefull "standalone" by applications needing to manipulate the abstract objects represented by the datatype library.=20 Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From rsalz@datapower.com Mon Sep 16 15:21:17 2002 From: rsalz@datapower.com (Rich Salz) Date: Mon, 16 Sep 2002 10:21:17 -0400 Subject: [XML-SIG] Some thoughts about types libraries References: <1032185412.21316.415.camel@ibook> Message-ID: <3D85E8DD.3030200@datapower.com> > As candidates for type libraries interfaces, we have right now: ... Has nobody looked at the ZSI typecode stuff? I'm not sure what folks want in a "type library," but certainly for XML-centric I think it's a good start. /r$ From vdv@dyomedea.com Mon Sep 16 15:25:49 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 16 Sep 2002 16:25:49 +0200 Subject: [XML-SIG] Some thoughts about types libraries In-Reply-To: <3D85E8DD.3030200@datapower.com> References: <1032185412.21316.415.camel@ibook> <3D85E8DD.3030200@datapower.com> Message-ID: <1032186350.21316.433.camel@ibook> On Mon, 2002-09-16 at 16:21, Rich Salz wrote: >=20 > > As candidates for type libraries interfaces, we have right now: >=20 > ... >=20 > Has nobody looked at the ZSI typecode stuff? I'm not sure what folks=20 > want in a "type library," but certainly for XML-centric I think it's a=20 > good start. Ooops, it's the second time you mention it and I am taking a look right now! Thanks Eric > /r$ >=20 >=20 >=20 --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From vdv@dyomedea.com Mon Sep 16 16:04:15 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 16 Sep 2002 17:04:15 +0200 Subject: [XML-SIG] Some thoughts about types libraries In-Reply-To: <1032186350.21316.433.camel@ibook> References: <1032185412.21316.415.camel@ibook> <3D85E8DD.3030200@datapower.com> <1032186350.21316.433.camel@ibook> Message-ID: <1032188656.21316.512.camel@ibook> On Mon, 2002-09-16 at 16:25, Eric van der Vlist wrote: > On Mon, 2002-09-16 at 16:21, Rich Salz wrote: > >=20 > > > As candidates for type libraries interfaces, we have right now: > >=20 > > ... > >=20 > > Has nobody looked at the ZSI typecode stuff? I'm not sure what folks=20 > > want in a "type library," but certainly for XML-centric I think it's a=20 > > good start. >=20 > Ooops, it's the second time you mention it and I am taking a look right > now! Can you give me some hints about the status of the projects under the Python Web Services [1] umbrella? SOAPy seems pretty ambitious with full support for W3C XML Schema but hasn't been updated since April 2001, ie before W3C XML Schema went rec. Your own ZSI [3] seems very interesting too, although I note that some exotic types needed to be fully compliant (such as xs:NOTATION or xs:ENTITY) appear to be missing. Also, a type library should support facets and I don't think yours does. Do you have plans to supporte all the W3C XML Schema datatypes and eventually their facets? Thanks Eric [1] http://pywebsvcs.sourceforge.net/ [2] http://sourceforge.net/projects/soapy [3] http://pywebsvcs.sourceforge.net/zsi.html >=20 > Thanks >=20 > Eric >=20 > > /r$ --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From rsalz@datapower.com Mon Sep 16 16:24:24 2002 From: rsalz@datapower.com (Rich Salz) Date: Mon, 16 Sep 2002 11:24:24 -0400 Subject: [XML-SIG] Some thoughts about types libraries References: <1032185412.21316.415.camel@ibook> <3D85E8DD.3030200@datapower.com> <1032186350.21316.433.camel@ibook> <1032188656.21316.512.camel@ibook> Message-ID: <3D85F7A8.7050104@datapower.com> > Can you give me some hints about the status of the projects under the > Python Web Services [1] umbrella? I'll try... > SOAPy seems pretty ambitious with full support for W3C XML Schema but > hasn't been updated since April 2001, ie before W3C XML Schema went rec. I think it does more types (as you point out, NOTATION and ENTITY), but while it's had a copule of bug-fixes, it's pretty inactive. > Your own ZSI [3] seems very interesting too, although I note that some > exotic types needed to be fully compliant (such as xs:NOTATION or > xs:ENTITY) appear to be missing. Also, a type library should support > facets and I don't think yours does. Right now it supports only the types that can be serialized as a SOAP RPC Encoding item. I don't have a lot of time to spend on ZSI, unfortunately. Certainly if it became the basis of a type library, I'd have an opinion on how to do facets, add other types, etc. My initial reaction would be to add a "check_constraints" method to each TC class ... > Do you have plans to supporte all the W3C XML Schema datatypes and > eventually their facets? I'll do what I can, and hopefully others would find it useful to move forward. /r$ From uche.ogbuji@fourthought.com Tue Sep 17 17:13:45 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 17 Sep 2002 10:13:45 -0600 Subject: [XML-SIG] core dumps in stylesheetreader/pyexpat? Message-ID: <1032279228.693.4285.camel@malatesta> It turns out that the mysterious core dumps I've been having with 4XSLT are in stylesheetReader. I took a closer look at the trace: #0 0x400cb80d in free () from /lib/libc.so.6 (gdb) (gdb) bt #0 0x400cb80d in free () from /lib/libc.so.6 #1 0x400cb6d3 in free () from /lib/libc.so.6 #2 0x4033116f in XML_ParserFree (parser=0x84d1308) at extensions/expat/lib/xmlparse.c:1003 #3 0x4032ecef in xmlparse_dealloc (self=0x84c8a14) at extensions/pyexpat.c:1294 #4 0x080974a6 in collect (young=0x80e826c, old=0x80e8278) at Modules/gcmodule.c:343 #5 0x08097770 in collect_generations () at Modules/gcmodule.c:478 #6 0x08097c6e in _PyObject_GC_New (tp=0x80f1960) at Modules/gcmodule.c:847 #7 0x080ba17d in PyList_New (size=2) at Objects/listobject.c:64 #8 0x0807570c in eval_frame (f=0x84654d4) at Python/ceval.c:1760 #9 0x08076d5d in PyEval_EvalCodeEx (co=0x83a9de0, globals=0x839d18c, locals=0x0, args=0x8465174, argcount=3, kws=0x8465180, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #10 0x08078db8 in fast_function (func=0x839fd34, pp_stack=0xbfffdf44, n=3, na=3, nk=0) at Python/ceval.c:3161 #11 0x08075e21 in eval_frame (f=0x8465014) at Python/ceval.c:2024 #12 0x08076d5d in PyEval_EvalCodeEx (co=0x839bcd8, globals=0x839d18c, locals=0x0, args=0x84c7a40, argcount=3, kws=0x84c7a4c, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #13 0x08078db8 in fast_function (func=0x83ab28c, pp_stack=0xbfffe094, n=3, na=3, nk=0) at Python/ceval.c:3161 ---Type to continue, or q to quit--- #14 0x08075e21 in eval_frame (f=0x84c786c) at Python/ceval.c:2024 #15 0x08076d5d in PyEval_EvalCodeEx (co=0x83944c8, globals=0x839748c, locals=0x0, args=0x85440d0, argcount=3, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #16 0x080b7b49 in function_call (func=0x83db884, arg=0x85440c4, kw=0x0) at Objects/funcobject.c:374 #17 0x080a6710 in PyObject_Call (func=0x83db884, arg=0x85440c4, kw=0x0) at Objects/abstract.c:1684 #18 0x080ad02f in instancemethod_call (func=0x83db884, arg=0x84672ac, kw=0x0) at Objects/classobject.c:2276 #19 0x080a6710 in PyObject_Call (func=0x84ba91c, arg=0x84672ac, kw=0x0) at Objects/abstract.c:1684 #20 0x0807753f in PyEval_CallObjectWithKeywords (func=0x84ba91c, arg=0x84672ac, kw=0x0) at Python/ceval.c:3049 #21 0x4032baa0 in call_with_frame (c=0x84b84c8, func=0x84ba91c, args=0x84672ac) at extensions/pyexpat.c:335 #22 0x4032c028 in my_StartElementHandler (userData=0x84cad14, name=0x85313a8 "http://www.w3.org/1999/XSL/Transform param", atts=0x852f508) at extensions/pyexpat.c:526 #23 0x40331b9e in doContent (parser=0x852f270, startTagLevel=0, enc=0x40352e20, s=0x854232b "\n\n \n \n to continue, or q to quit--- tring($length,1,1) = '0'", ' ' , "or substring($length,1,"..., end=0x854258e "", nextPtr=0x852f288) at extensions/expat/lib/xmlparse.c:2058 #24 0x40337af4 in contentProcessor (parser=0x852f270, start=0x84d4a28 "ribute-set", end=0x732d6574
, endPtr=0x852f288) at extensions/expat/lib/xmlparse.c:1691 #25 0x40337787 in XML_ParseBuffer (parser=0x852f270, len=2048, isFinal=0) at extensions/expat/lib/xmlparse.c:1394 #26 0x4032e899 in xmlparse_ParseFile (self=0x84cad14, args=0x84b7b24) at extensions/pyexpat.c:931 #27 0x080c35b4 in PyCFunction_Call (func=0x83f3400, arg=0x84b7b24, kw=0x0) at Objects/methodobject.c:80 #28 0x08075d83 in eval_frame (f=0x84b17d4) at Python/ceval.c:2004 #29 0x08076d5d in PyEval_EvalCodeEx (co=0x83f9698, globals=0x837ec14, locals=0x0, args=0x817864c, argcount=2, kws=0x8178654, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #30 0x08078db8 in fast_function (func=0x83ff554, pp_stack=0xbfffe624, n=2, na=2, nk=0) at Python/ceval.c:3161 #31 0x08075e21 in eval_frame (f=0x81784e4) at Python/ceval.c:2024 #32 0x08076d5d in PyEval_EvalCodeEx (co=0x83f9780, globals=0x837ec14, locals=0x0, args=0x819e340, argcount=2, kws=0x819e348, kwcount=0, defs=0x83fb338, defcount=1, closure=0x0) at Python/ceval.c:2585 #33 0x08078db8 in fast_function (func=0x84007cc, pp_stack=0xbfffe774, n=2, ---Type to continue, or q to quit--- na=2, nk=0) at Python/ceval.c:3161 #34 0x08075e21 in eval_frame (f=0x819e1c4) at Python/ceval.c:2024 #35 0x08076d5d in PyEval_EvalCodeEx (co=0x8395a48, globals=0x839748c, locals=0x0, args=0x84b86ec, argcount=3, kws=0x84b86f8, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #36 0x08078db8 in fast_function (func=0x83dba7c, pp_stack=0xbfffe8c4, n=3, na=3, nk=0) at Python/ceval.c:3161 #37 0x08075e21 in eval_frame (f=0x84b851c) at Python/ceval.c:2024 #38 0x08076d5d in PyEval_EvalCodeEx (co=0x83944c8, globals=0x839748c, locals=0x0, args=0x84b0ad8, argcount=3, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #39 0x080b7b49 in function_call (func=0x83db884, arg=0x84b0acc, kw=0x0) at Objects/funcobject.c:374 #40 0x080a6710 in PyObject_Call (func=0x83db884, arg=0x84b0acc, kw=0x0) at Objects/abstract.c:1684 #41 0x080ad02f in instancemethod_call (func=0x83db884, arg=0x8468cf4, kw=0x0) at Objects/classobject.c:2276 #42 0x080a6710 in PyObject_Call (func=0x836dfd4, arg=0x8468cf4, kw=0x0) at Objects/abstract.c:1684 #43 0x0807753f in PyEval_CallObjectWithKeywords (func=0x836dfd4, arg=0x8468cf4, kw=0x0) at Python/ceval.c:3049 #44 0x4032baa0 in call_with_frame (c=0x84b84c8, func=0x836dfd4, args=0x8468cf4) at extensions/pyexpat.c:335 ---Type to continue, or q to quit--- #45 0x4032c028 in my_StartElementHandler (userData=0x84216fc, name=0x84b7de8 "http://www.w3.org/1999/XSL/Transform include", atts=0x84b5e38) at extensions/pyexpat.c:526 #46 0x40331b9e in doContent (parser=0x84b5ba0, startTagLevel=0, enc=0x40352e20, s=0x84b76a9 "\n\n\n\n, "xmlns:doc=\"http://nwalsh.com/xsl/documentation/1.0\"\n", ' ' , "exclude-result-prefixes=\"doc\"\n", ' ' , "versi"..., end=0x84b7b10 "", tok=29, next=0x84b7326 ", "xmlns:doc=\"http://nwalsh.com/xsl/documentation/1.0\"\n", ' ' , "exclude-result-prefixes=\"doc\"\n", ' ' , "versi"..., nextPtr=0x84b5bb8) at extensions/expat/lib/xmlparse.c:1691 #48 0x4033903f in prologProcessor (parser=0x84b5ba0, s=0x84b7310 "\n, "xmlns:doc=\"http://nwals---Type to continue, or q to quit--- h.com/xsl/documentation/1.0\"\n", ' ' , "exclude-result-prefixes=\"doc\""..., end=0x84b7b10 "", nextPtr=0x84b5bb8) at extensions/expat/lib/xmlparse.c:3096 #49 0x40337ab3 in prologInitProcessor (parser=0x84b5ba0, s=0x84b7310 "\n, "xmlns:doc=\"http://nwalsh.com/xsl/documentation/1.0\"\n", ' ' , "exclude-result-prefixes=\"doc\""..., end=0x84b7b10 "", nextPtr=0x84b5bb8) at extensions/expat/lib/xmlparse.c:2927 #50 0x40337787 in XML_ParseBuffer (parser=0x84b5ba0, len=2048, isFinal=0) at extensions/expat/lib/xmlparse.c:1394 #51 0x4032e899 in xmlparse_ParseFile (self=0x84216fc, args=0x84b1184) at extensions/pyexpat.c:931 #52 0x080c35b4 in PyCFunction_Call (func=0x840f9e8, arg=0x84b1184, kw=0x0) at Objects/methodobject.c:80 #53 0x08075d83 in eval_frame (f=0x83f7ce4) at Python/ceval.c:2004 #54 0x08076d5d in PyEval_EvalCodeEx (co=0x83f9698, globals=0x837ec14, locals=0x0, args=0x8426aec, argcount=2, kws=0x8426af4, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #55 0x08078db8 in fast_function (func=0x83ff554, pp_stack=0xbfffef14, n=2, na=2, nk=0) at Python/ceval.c:3161 #56 0x08075e21 in eval_frame (f=0x8426984) at Python/ceval.c:2024 #57 0x08076d5d in PyEval_EvalCodeEx (co=0x83f9780, globals=0x837ec14, ---Type to continue, or q to quit--- locals=0x0, args=0x81ce7e8, argcount=2, kws=0x81ce7f0, kwcount=1, defs=0x83fb338, defcount=1, closure=0x0) at Python/ceval.c:2585 #58 0x08078db8 in fast_function (func=0x84007cc, pp_stack=0xbffff064, n=4, na=2, nk=1) at Python/ceval.c:3161 #59 0x08075e21 in eval_frame (f=0x81ce68c) at Python/ceval.c:2024 #60 0x08076d5d in PyEval_EvalCodeEx (co=0x83851c0, globals=0x8378b2c, locals=0x0, args=0x84207c4, argcount=2, kws=0x84207cc, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #61 0x08078db8 in fast_function (func=0x83e5a14, pp_stack=0xbffff1b4, n=2, na=2, nk=0) at Python/ceval.c:3161 #62 0x08075e21 in eval_frame (f=0x84205fc) at Python/ceval.c:2024 #63 0x08076d5d in PyEval_EvalCodeEx (co=0x836da40, globals=0x8378604, locals=0x0, args=0x818ecb4, argcount=2, kws=0x818ecbc, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #64 0x08078db8 in fast_function (func=0x83f2054, pp_stack=0xbffff304, n=2, na=2, nk=0) at Python/ceval.c:3161 #65 0x08075e21 in eval_frame (f=0x818eb5c) at Python/ceval.c:2024 #66 0x08076d5d in PyEval_EvalCodeEx (co=0x841a788, globals=0x841bc8c, locals=0x0, args=0x8378864, argcount=2, kws=0x837886c, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #67 0x08078db8 in fast_function (func=0x841cc44, pp_stack=0xbffff454, n=2, na=2, nk=0) at Python/ceval.c:3161 #68 0x08075e21 in eval_frame (f=0x8378704) at Python/ceval.c:2024 ---Type to continue, or q to quit--- #69 0x08076d5d in PyEval_EvalCodeEx (co=0x83ff2b0, globals=0x8419564, locals=0x0, args=0x81126e8, argcount=1, kws=0x81126ec, kwcount=0, defs=0x83fee40, defcount=2, closure=0x0) at Python/ceval.c:2585 #70 0x08078db8 in fast_function (func=0x841c0ac, pp_stack=0xbffff5a4, n=1, na=1, nk=0) at Python/ceval.c:3161 #71 0x08075e21 in eval_frame (f=0x811259c) at Python/ceval.c:2024 #72 0x08076d5d in PyEval_EvalCodeEx (co=0x8119450, globals=0x810b5e4, locals=0x810b5e4, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2585 #73 0x08078d59 in PyEval_EvalCode (co=0x8119450, globals=0x810b5e4, locals=0x810b5e4) at Python/ceval.c:483 #74 0x08092ee3 in run_node (n=0x811e008, filename=0xbffff969 "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xslt", globals=0x810b5e4, locals=0x810b5e4, flags=0xbffff7a8) at Python/pythonrun.c:1079 #75 0x08092e9e in run_err_node (n=0x811e008, filename=0xbffff969 "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xslt", globals=0x810b5e4, locals=0x810b5e4, flags=0xbffff7a8) at Python/pythonrun.c:1066 #76 0x08092b2a in PyRun_FileExFlags (fp=0x80fbd28, filename=0xbffff969 "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xslt", start=257, globals=0x810b5e4, locals=0x810b5e4, closeit=1, flags=0xbffff7a8) at Python/pythonrun.c:1057 ---Type to continue, or q to quit--- #77 0x080917f1 in PyRun_SimpleFileExFlags (fp=0x80fbd28, filename=0xbffff969 "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xslt", closeit=1, flags=0xbffff7a8) at Python/pythonrun.c:685 #78 0x080926ec in PyRun_AnyFileExFlags (fp=0x80fbd28, filename=0xbffff969 "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xslt", closeit=1, flags=0xbffff7a8) at Python/pythonrun.c:495 #79 0x08053494 in Py_Main (argc=4, argv=0xbffff834) at Modules/main.c:364 #80 0x08052d78 in main (argc=4, argv=0xbffff834) at Modules/python.c:10 --- I am using latest Python 2.2.1 + PyXML from CVS, not Python+pyexpat. Any other developers with this config? It seems it is certain features in the styleshet that trigger the dump. I'm not yet sure which, but I have found that though it sometimes passes without a core dump, when it does, it is always certain files that cause it. Unfortunately, I have not been lucky at all with the huge (nwalsh) docbook stylesheets. I get a core dump every time. Ideas? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From Will.Rutherdale@sciatl.com Tue Sep 17 17:09:16 2002 From: Will.Rutherdale@sciatl.com (Rutherdale, Will) Date: Tue, 17 Sep 2002 12:09:16 -0400 Subject: [XML-SIG] pyXML for Python 1.5.2 Message-ID: <87D9C0FBA428D31195370008C791143A01C94864@mxtor01.sciatl.com> This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C25E64.925B1B20 Content-Type: text/plain; charset="iso-8859-1" Hi. I need to get the pyXML working on some systems which have Python 1.5.2 running. I'm not in a position to change their configuration to a more recent version of Python such as 2.2. When I go to the download page from http://pyxml.sourceforge.net/topics, I see that pyxml is only available down to version 0.8 and for Python 2.1 or 2.2. Is there a place I can get a download of an earlier version of pyXML, so that it will work with Python 1.5.2? -Will - - - - - - - Appended by Scientific-Atlanta, Inc. - - - - - - - This e-mail and any attachments may contain information which is confidential, proprietary, privileged or otherwise protected by law. The information is solely intended for the named addressee (or a person responsible for delivering it to the addressee). If you are not the intended recipient of this message, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete it from your computer. ------_=_NextPart_001_01C25E64.925B1B20 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable pyXML for Python 1.5.2

Hi.

I need to get the pyXML working on some systems which hav= e Python 1.5.2 running.  I'm not in a position to change their configu= ration to a more recent version of Python such as 2.2.

When I go to the download page from http://pyxml.sourceforge.net/t= opics, I see that pyxml is only available down to version 0.8 and for P= ython 2.1 or 2.2.

Is there a place I can get a download of an earlier versi= on of pyXML, so that it will work with Python 1.5.2?

-Will



- - - - - - - Appended by Scientific-Atlanta, Inc. - - - - - - - This e-mail and any attachments may contain information which is confidenti= al, proprietary, privileged or otherwise protected by law. The information = is solely intended for the named addressee (or a person responsible for del= ivering it to the addressee). If you are not the intended recipient of this= message, you are not authorized to read, print, retain, copy or disseminat= e this message or any part of it. If you have received this e-mail in error= , please notify the sender immediately by return e-mail and delete it from = your computer.
------_=_NextPart_001_01C25E64.925B1B20-- From fdrake@acm.org Tue Sep 17 17:17:50 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 17 Sep 2002 12:17:50 -0400 Subject: [XML-SIG] pyXML for Python 1.5.2 In-Reply-To: <87D9C0FBA428D31195370008C791143A01C94864@mxtor01.sciatl.com> References: <87D9C0FBA428D31195370008C791143A01C94864@mxtor01.sciatl.com> Message-ID: <15751.21934.390626.404731@grendel.zope.com> Rutherdale, Will writes: > I need to get the pyXML working on some systems which have Python 1.5.2 > running. I'm not in a position to change their configuration to a more > recent version of Python such as 2.2. Ouch! > When I go to the download page from http://pyxml.sourceforge.net/topics, I > see that pyxml is only available down to version 0.8 and for Python 2.1 or > 2.2. > > Is there a place I can get a download of an earlier version of pyXML, so > that it will work with Python 1.5.2? It's been a while since PyXML has support Python versions earlier than 2.0, but the current code should work just fine with Python 2.0. I'll see if I can get an installer for that posted on SourceForge. Martin, is there any reason the older versions aren't listed on the SourceForge downloads page? At least the most recent version that support Python 1.5.2 should probably still be listed. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fredrik@pythonware.com Tue Sep 17 17:46:29 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 17 Sep 2002 18:46:29 +0200 Subject: [XML-SIG] pyXML for Python 1.5.2 References: <87D9C0FBA428D31195370008C791143A01C94864@mxtor01.sciatl.com> Message-ID: <045701c25e69$c70732f0$0900a8c0@spiff> Will wrote: X-Spam-Status: tests=3DMIME_NULL_BLOCK,EXCUSE_16,SUPERLONG_LINE,BIG_FONT X-Spam-Level: *** > I need to get the pyXML working on some systems which have Python = 1.5.2 > running. I'm not in a position to change their configuration to a = more > recent version of Python such as 2.2. >=20 > When I go to the download page from = http://pyxml.sourceforge.net/topics, I > see that pyxml is only available down to version 0.8 and for Python = 2.1 or > 2.2. >=20 > Is there a place I can get a download of an earlier version of pyXML, = so > that it will work with Python 1.5.2? older versions appears to be available here: http://download.sourceforge.net/pyxml/ From martin@v.loewis.de Tue Sep 17 23:03:07 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 18 Sep 2002 00:03:07 +0200 Subject: [XML-SIG] pyXML for Python 1.5.2 In-Reply-To: <15751.21934.390626.404731@grendel.zope.com> References: <87D9C0FBA428D31195370008C791143A01C94864@mxtor01.sciatl.com> <15751.21934.390626.404731@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Martin, is there any reason the older versions aren't listed on the > SourceForge downloads page? At least the most recent version that > support Python 1.5.2 should probably still be listed. I've restored 0.7.1 now; if people want specific other versions listed as well, please let me know. Regards, Martin From uche.ogbuji@fourthought.com Wed Sep 18 21:45:43 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 18 Sep 2002 14:45:43 -0600 Subject: [XML-SIG] Eureka! (Core dumps in 4Suite and PyXML) Message-ID: <1032381945.2853.1151.camel@malatesta> Well, I found the source of the core dumps, I think. First of all, I used valgrind. After ignoring the well-known spurious error reports for Python, the only reports were for illegal memory overwrites in expat/lib/xmlparse.c For example: ==6975== Invalid write of size 2 ==6975== at 0x43A98077: doContent (Ft/Xml/src/expat/lib/xmlparse.c:2110) ==6975== by 0x43A9E59E: contentProcessor (Ft/Xml/src/expat/lib/xmlparse.c:1691) ==6975== by 0x43A9E1E6: XML_ParseBuffer (Ft/Xml/src/expat/lib/xmlparse.c:1394) ==6975== by 0x43A9E18C: XML_Parse (Ft/Xml/src/expat/lib/xmlparse.c:1382) ==6975== Address 0x40B114DE is 10 bytes before a block of size 28 free'd ==6975== at 0x40044946: free (vg_clientfuncs.c:180) ==6975== by 0x80579CD: _PyObject_Del (Objects/object.c:143) ==6975== by 0x805C9EC: string_dealloc (Objects/stringobject.c:504) ==6975== by 0x805C31B: PyString_InternInPlace (Objects/stringobject.c:3628) ==6975== So, since I recently upgraded 4Suite to use 1.95.5, I backed that out and restored the 1.95.4 files. The core dumps went away in 4Suite. I tried the PyXML versions before the move to the newest expat: both 0.7.1 and 0.8.0. No core dumps in either case. So it seems to me it's something in expat 1.95.5. Following the pointer from valgrind gives the following block of lines as the suspect: 2103 if (ns && localPart) { 2104 /* localPart and prefix may have been overwritten in 2105 tag->name.str, since this points to the binding->uri 2106 buffer which gets re-used; so we have to add them again 2107 */ 2108 uri = (XML_Char *)tag->name.str + tag->name.uriLen; 2109 /* don't need to check for space - already done in storeAtts() */ 2110 while (*localPart) *uri++ = *localPart++; 2111 prefix = (XML_Char *)tag->name.prefix; 2112 if (ns_triplets && prefix) { 2113 *uri++ = namespaceSeparator; 2114 while (*prefix) *uri++ = *prefix++; 2115 } 2116 *uri = XML_T('\0'); 2117 } With 2110 being the line singled out. It says that memory was already freed. Doesn't say whether uri or localPart. I've backed out of the latest expat on my own machine so I can continue working. Since Jeremy is also seeing core dumps, I'll probably check in that reversion. But I'd like any assistance making sure I'm not off my skull. Anyone have any other ideas for verification? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From Juergen Hermann" Hi! Since installing 0.8.1, our SOAPpy Unittest fails. Before investigating = further, did anyone else have such problems? Ciao, J=FCrgen -- J=FCrgen Hermann, Developer WEB.DE AG, http://webde-ag.de/ From scjuonline@web.de Thu Sep 19 12:26:50 2002 From: scjuonline@web.de (=?iso-8859-1?Q?J=FCrgen_Schmidt?=) Date: Thu, 19 Sep 2002 13:26:50 +0200 Subject: [XML-SIG] producing large XML files Message-ID: <002d01c25fcf$736d09f0$0a00a8c0@pamela2000> Hi, I'm quite new to Python and XML. I have to parse files of XML, which could get pretty large. I also need to change things within those files from time to time. I decided to use SAX for the "reading" part. It works well. But now I'm stuck, because I can't find an equivalent for writing large XML files. In the Project source code, I found a class called XmlWriter (within writer.py). But it isn't mentioned somewhere in the docs and I couldn't find any examples. Is this class maintained and will be developed further? Or is there a better solution (besides writing my own classes)? Thx for the help. Juergen From martin@v.loewis.de Thu Sep 19 12:49:16 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 19 Sep 2002 13:49:16 +0200 Subject: [XML-SIG] producing large XML files In-Reply-To: <002d01c25fcf$736d09f0$0a00a8c0@pamela2000> References: <002d01c25fcf$736d09f0$0a00a8c0@pamela2000> Message-ID: J=FCrgen Schmidt writes: > In the Project source code, I found a class called XmlWriter (within > writer.py). But it isn't mentioned somewhere in the docs and I couldn't > find any examples. > Is this class maintained and will be developed further? The class is complete, so no further development is necessary. It is also quite trivial as well. > Or is there a better solution (besides writing my own classes)? I usually recommend to generate XML with print statements. Works like a charm. People will respond that one has to consider charsets and special characters when doing so, which is true. Regards, Martin From scjuonline@web.de Thu Sep 19 14:23:25 2002 From: scjuonline@web.de (=?iso-8859-1?q?J=FCrgen=20Schmidt?=) Date: Thu, 19 Sep 2002 15:23:25 +0200 Subject: [XML-SIG] producing large XML files Message-ID: <200209191523.25376.scjuonline@web.de> Ok.=20 Here is what I've tried: #!/usr/local/bin/python import sys from xml.sax.writer import XmlWriter class layoutwriter(XmlWriter): def __init__(self): XmlWriter.__init__(self,sys.stdout) lw =3D layoutwriter() lw.startDocument() lw.startElement("doc") lw.characters("Hallo",0,len("Hallo")) # lw.handle_cdata("das ist html") # if you uncomment this, one will get: # # Hallo # # but shouldn't the data appear without escaping? # # lw.comment("great",0,len("great")) # if you uncomment this, the comment won't show up (in the xml ;-) # lw.processingInstruction("target",'data=3D"wert\n"') # if you uncomment this, one will get: # #Hallo #Traceback (most recent call last): # File "/home/scju/tmp/xmlwriter/write.py", line 16, in ? # lw.processingInstruction("target",'data=3D"wert"') # File "/usr/local/lib/python2.2/site-packages/_xmlplus/sax/writer.py", = line #392, in processingInstruction # self._offset =3D len(s) - (p + 1) #NameError: global name 'p' is not defined # Is my function call wrong? lw.endElement("doc") lw.endDocument() thx Juergen From martin@v.loewis.de Thu Sep 19 19:51:31 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 19 Sep 2002 20:51:31 +0200 Subject: [XML-SIG] producing large XML files In-Reply-To: <200209191523.25376.scjuonline@web.de> References: <200209191523.25376.scjuonline@web.de> Message-ID: J=FCrgen Schmidt writes: > Ok.=20 > Here is what I've tried: I see. Let me take back my earlier statement: XmlWriter is *not* current, anymore, as it is a SAX1 application; we encourage users to use SAX2 these days. The mostly-equivalent SAX2 class is xml.sax.saxlib.XMLGenerator or xml.sax.saxlib.LexicalXMLGenerator (in your case, you need the LexicalXMLGenerator). That said, XmlWriter ought to work. Feel free to submit bug reports at sf.net/projects/pyxml. If you can, patches would be even better. > lw.characters("Hallo",0,len("Hallo")) In SAX2, this becomes lw.characters("Hallo") > # lw.handle_cdata("das ist html") [...] > # but shouldn't the data appear without escaping? Right, that's a bug. In SAX2, it becomes lw.startCDATA() lw.characters("das ist html") lw.endCDATA() > # lw.comment("great",0,len("great")) > # if you uncomment this, the comment won't show up (in the xml ;-) That's a bug; it works in the LexicalXMLGenerator. > # Is my function call wrong? Your call is right. It is a bug, and works in the LexicalXMLGenerator. Do you really need to output PIs and comments in your application, or was that just for testing purposes? Regards, Martin From uche.ogbuji@fourthought.com Fri Sep 20 07:53:11 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 20 Sep 2002 00:53:11 -0600 Subject: [XML-SIG] New xml.com column: Python & XML Message-ID: <1032504793.2853.7374.camel@malatesta> I've started a column, "Python & XML" on xml.com. The first installment is out, and offers a big tour of the world of Python/XML. http://www.xml.com/pub/a/2002/09/18/py.html -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From scjuonline@web.de Fri Sep 20 08:16:42 2002 From: scjuonline@web.de (=?iso-8859-1?Q?J=FCrgen_Schmidt?=) Date: Fri, 20 Sep 2002 09:16:42 +0200 Subject: [XML-SIG] producing large XML files References: <200209191523.25376.scjuonline@web.de> Message-ID: <004a01c26075$abde54f0$0a00a8c0@pamela2000> >The mostly-equivalent SAX2 class is >xml.sax.saxlib.XMLGenerator or xml.sax.saxlib.LexicalXMLGenerator >(in your case, you need the LexicalXMLGenerator). Thanks, I will use this class. (btw: xml.sax.saxutils.LexicalXMLGenerator) >Do you really need to output PIs and comments in your application, or >was that just for testing purposes? I intend to use PIs and comments would be nice, but don't know if I have the time to put comments in the file ;-) Regards, Juergen From veillard@redhat.com Fri Sep 20 10:39:51 2002 From: veillard@redhat.com (Daniel Veillard) Date: Fri, 20 Sep 2002 05:39:51 -0400 Subject: [XML-SIG] New xml.com column: Python & XML In-Reply-To: <1032504793.2853.7374.camel@malatesta>; from uche.ogbuji@fourthought.com on Fri, Sep 20, 2002 at 12:53:11AM -0600 References: <1032504793.2853.7374.camel@malatesta> Message-ID: <20020920053951.J30374@redhat.com> On Fri, Sep 20, 2002 at 12:53:11AM -0600, Uche Ogbuji wrote: > I've started a column, "Python & XML" on xml.com. The first installment > is out, and offers a big tour of the world of Python/XML. > > http://www.xml.com/pub/a/2002/09/18/py.html Just for the record, libxml/python also support XInclude and XML Base, (as well as XML Catalogs, etc ...) I think it would be a good idea to also add the associated Licence to each tool presented, if you have the opportunity to make an update. Thanks for the article ! Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From uche.ogbuji@fourthought.com Fri Sep 20 15:31:19 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 20 Sep 2002 08:31:19 -0600 Subject: [XML-SIG] New xml.com column: Python & XML In-Reply-To: <20020920053951.J30374@redhat.com> References: <1032504793.2853.7374.camel@malatesta> <20020920053951.J30374@redhat.com> Message-ID: <1032532280.2853.8753.camel@malatesta> On Fri, 2002-09-20 at 03:39, Daniel Veillard wrote: > On Fri, Sep 20, 2002 at 12:53:11AM -0600, Uche Ogbuji wrote: > > I've started a column, "Python & XML" on xml.com. The first installment > > is out, and offers a big tour of the world of Python/XML. > > > > http://www.xml.com/pub/a/2002/09/18/py.html > > Just for the record, libxml/python also support XInclude and XML Base, > (as well as XML Catalogs, etc ...) > I think it would be a good idea to also add the associated Licence to > each tool presented, if you have the opportunity to make an update. > > Thanks for the article ! I've made a note of this in the draft for the next article. Thanks. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From uche.ogbuji@fourthought.com Fri Sep 20 15:33:59 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 20 Sep 2002 08:33:59 -0600 Subject: [XML-SIG] Re: [4suite] New xml.com column: Python & XML In-Reply-To: <1032504793.2853.7374.camel@malatesta> References: <1032504793.2853.7374.camel@malatesta> Message-ID: <1032532441.2853.8762.camel@malatesta> On Fri, 2002-09-20 at 00:53, Uche Ogbuji wrote: > I've started a column, "Python & XML" on xml.com. The first installment > is out, and offers a big tour of the world of Python/XML. > > http://www.xml.com/pub/a/2002/09/18/py.html BTW, I already have plans for the next few columns, but if anyone has thoughts on matters they would particularly like me to cover, drop me a line. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/library/x-jclark.html From vdv@dyomedea.com Mon Sep 23 16:42:42 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 17:42:42 +0200 Subject: [XML-SIG] Issues with unicode type Message-ID: <1032795762.19185.440.camel@ibook> Hi, I have started to work on an implementation of a W3C XML Schema type library for Relax NG and I am hiting my first problems with unicode. One of the test case from the test suite provided by James Clark is: 𐠀 and the length of the text node of the doc element is supposed to be 1 instead of 2 as expected by my (naive) implementation of the length facet. What makes me think that it could be a generic issue with python is the following (kindly contributed by Uche): >>> hex(67584) '0x10800' >>> c =3D u"\u10800" >>> c u'\u10800' >>> len(c) 2 I am not a Unicode expert (in fact I'd rather say I am a Unicode newbie), but shouldn't len(c) return 1?=20 Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From tree@basistech.com Mon Sep 23 17:14:00 2002 From: tree@basistech.com (Tom Emerson) Date: Mon, 23 Sep 2002 12:14:00 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923160005.28564.36214.Mailman@mail.python.org> References: <20020923160005.28564.36214.Mailman@mail.python.org> Message-ID: <15759.15816.342144.891607@magrathea.basistech.com> > > 𐠀 > > and the length of the text node of the doc element is supposed to be 1 > instead of 2 as expected by my (naive) implementation of the length > facet. > > What makes me think that it could be a generic issue with python is the > following (kindly contributed by Uche): > > >>> hex(67584) > '0x10800' > >>> c =3D u"\u10800" > >>> c > u'\u10800' > >>> len(c) > 2 By default Python is using UTF-16 as its Unicode encoding. The code-point that you specify, U+10800, is outside the BMP and hence is represented by two surrogate characters in UTF-16. If you were to recompile your Python installation to use UTF-32 as the Unicode character type then I expect that you will get the length you expect. Consider: >>> c= u"\u4e00" >>> c u'\u4e00' >>> len(c) 1 -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From vdv@dyomedea.com Mon Sep 23 17:48:28 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 18:48:28 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15759.15816.342144.891607@magrathea.basistech.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> Message-ID: <1032799708.19185.520.camel@ibook> On Mon, 2002-09-23 at 18:14, Tom Emerson wrote: > By default Python is using UTF-16 as its Unicode encoding. The > code-point that you specify, U+10800, is outside the BMP and hence is > represented by two surrogate characters in UTF-16. Arg! Does that mean that by default Python isn't strictly conform to XML 1.0? > If you were to recompile your Python installation to use UTF-32 as the > Unicode character type then I expect that you will get the length you > expect. But that would also mean that a library relying on this would work only with Python installations compiled to use UTF-32 :-( > Consider: >=20 > >>> c=3D u"\u4e00" > >>> c > u'\u4e00' > >>> len(c) > 1 Yes, my lenght being "2" was due to the fact that the character takes more than 16 bits... Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From martin@v.loewis.de Mon Sep 23 17:55:59 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 23 Sep 2002 18:55:59 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032799708.19185.520.camel@ibook> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> Message-ID: Eric van der Vlist writes: > > By default Python is using UTF-16 as its Unicode encoding. The > > code-point that you specify, U+10800, is outside the BMP and hence is > > represented by two surrogate characters in UTF-16. >=20 > Arg! Does that mean that by default Python isn't strictly conform to XML > 1.0? No. Why do you think this? >=20 > > If you were to recompile your Python installation to use UTF-32 as the > > Unicode character type then I expect that you will get the length you > > expect. >=20 > But that would also mean that a library relying on this would work only > with Python installations compiled to use UTF-32 :-( >=20 > > Consider: > >=20 > > >>> c=3D u"\u4e00" > > >>> c > > u'\u4e00' > > >>> len(c) > > 1 >=20 > Yes, my lenght being "2" was due to the fact that the character takes > more than 16 bits... >=20 > Thanks >=20 > Eric > --=20 > Rendez-vous =C2=8E=C3=A0 Paris. > http://www.technoforum.fr/integ2002/index.html > ------------------------------------------------------------------------ > Eric van der Vlist http://xmlfr.org http://dyomedea.com > (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema > ------------------------------------------------------------------------ >=20 >=20 > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig From vdv@dyomedea.com Mon Sep 23 18:06:16 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 19:06:16 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> Message-ID: <1032800776.19160.547.camel@ibook> On Mon, 2002-09-23 at 18:55, Martin v. Loewis wrote: > Eric van der Vlist writes: >=20 > > > By default Python is using UTF-16 as its Unicode encoding. The > > > code-point that you specify, U+10800, is outside the BMP and hence is > > > represented by two surrogate characters in UTF-16. > >=20 > > Arg! Does that mean that by default Python isn't strictly conform to XM= L > > 1.0? >=20 > No. Why do you think this? I would say that since a XML document is defined as set of unicode characters, a single character "&x10800;" is not the same thing as a sequence of two characters. The content of my element 𐠀 doesn't seem to be correctly represented as a string of two characters like it is when I parse the document! Or have I missed something? Eric (meaning no offense!) --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From martin@v.loewis.de Mon Sep 23 18:12:10 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 23 Sep 2002 19:12:10 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032799708.19185.520.camel@ibook> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> Message-ID: Eric van der Vlist writes: > > By default Python is using UTF-16 as its Unicode encoding. The > > code-point that you specify, U+10800, is outside the BMP and hence is > > represented by two surrogate characters in UTF-16. > > Arg! Does that mean that by default Python isn't strictly conform to XML > 1.0? No. Why do you think this? Strictly speaking, XML 1.0 defines a "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000. This means only characters in the Basic Multilingual Plane are allowed in XML. James Clark's document is, strictly speaking, ill-formed. That aside, Python does process your document, and represents the character U+10800 as defined in the Python language definition. So if you extend XML 1.0 to Unicode 3.2 in a canonical way, Python supports that character as specified. Any applications that want to count Unicode code points might need to take into account surrogates, and possibly might not use the len() builtin. Notice also that U+10800 is unassigned even in Unicode 3.2. Regards, Martin From martin@v.loewis.de Mon Sep 23 18:15:43 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 23 Sep 2002 19:15:43 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032800776.19160.547.camel@ibook> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032800776.19160.547.camel@ibook> Message-ID: Eric van der Vlist writes: > I would say that since a XML document is defined as set of unicode > characters, a single character "&x10800;" ... is ill-formed. Only characters below ￿ are allowed in XML, strictly speaking. > is not the same thing as a sequence of two characters. So what? > The content of my element 𐠀 doesn't seem to be > correctly represented as a string of two characters like it is when > I parse the document! Or have I missed something? Yes. Python, in a narrow Unicode build, represents this character as a Unicode object which has a length of 2. It still is a single Unicode character. Regards, Martin From vdv@dyomedea.com Mon Sep 23 18:21:41 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 19:21:41 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> Message-ID: <1032801701.19382.572.camel@ibook> On Mon, 2002-09-23 at 19:12, Martin v. Loewis wrote: > Eric van der Vlist writes: >=20 > > > By default Python is using UTF-16 as its Unicode encoding. The > > > code-point that you specify, U+10800, is outside the BMP and hence is > > > represented by two surrogate characters in UTF-16. > >=20 > > Arg! Does that mean that by default Python isn't strictly conform to XM= L > > 1.0? >=20 > No. Why do you think this? Strictly speaking, XML 1.0 defines a > "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000. > This means only characters in the Basic Multilingual Plane are allowed > in XML. James Clark's document is, strictly speaking, ill-formed. That's weird... > That aside, Python does process your document, and represents the > character U+10800 as defined in the Python language definition. So if > you extend XML 1.0 to Unicode 3.2 in a canonical way, Python supports > that character as specified. Any applications that want to count > Unicode code points might need to take into account surrogates, and > possibly might not use the len() builtin. Yep, and that's what James Clark is doing in his Java implementation: public int getLength(Object obj) { String str =3D (String)obj; int len =3D str.length(); int nSurrogatePairs =3D 0; for (int i =3D 0; i < len; i++) if (Utf16.isSurrogate1(str.charAt(i))) nSurrogatePairs++; return len - nSurrogatePairs; } And I need to do the same in Python... =20 > Notice also that U+10800 is unassigned even in Unicode 3.2. I wonder why he has picked this value! Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From uche.ogbuji@fourthought.com Mon Sep 23 18:31:51 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 11:31:51 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Tom Emerson of "Mon, 23 Sep 2002 12:14:00 EDT." <15759.15816.342144.891607@magrathea.basistech.com> Message-ID: > > > > 𐠀 > > > > and the length of the text node of the doc element is supposed to be 1 > > instead of 2 as expected by my (naive) implementation of the length > > facet. > > > > What makes me think that it could be a generic issue with python is the > > following (kindly contributed by Uche): > > > > >>> hex(67584) > > '0x10800' > > >>> c =3D u"\u10800" > > >>> c > > u'\u10800' > > >>> len(c) > > 2 > > By default Python is using UTF-16 as its Unicode encoding. The > code-point that you specify, U+10800, is outside the BMP and hence is > represented by two surrogate characters in UTF-16. > > If you were to recompile your Python installation to use UTF-32 as the > Unicode character type then I expect that you will get the length you > expect. > > Consider: > > >>> c= u"\u4e00" > >>> c > u'\u4e00' > >>> len(c) > 1 Hmm. I'm going to open my mouth and show off my ignorance now. I should probably spend some time with my Tony Graham before ever posting on Unicode, but I don't have the time right now, and besides, there is no better way to get Eric an answer than to say something wrong that has to be corrected by one of the many Unicode gurus who I know hang around here :-) IIRC, UTF-16 supports the representation of characters outside the BMP by using surrogate pairs (SP). If so, then the scary solution of requiring XML users to compile Python to use UCS-4 can be put aside. The question would then be how to get a surrogate pair into a Python unicode object. On a hunch, I tried: >>> c = u"\uD800\uDC00" >>> len(c) 2 So I guess the answer isn't just using the literal characters in the SP. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From uche.ogbuji@fourthought.com Mon Sep 23 18:53:28 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 11:53:28 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Eric van der Vlist of "23 Sep 2002 18:48:28 +0200." <1032799708.19185.520.camel@ibook> Message-ID: > On Mon, 2002-09-23 at 18:14, Tom Emerson wrote: > = > > By default Python is using UTF-16 as its Unicode encoding. The > > code-point that you specify, U+10800, is outside the BMP and hence is= > > represented by two surrogate characters in UTF-16. > = > Arg! Does that mean that by default Python isn't strictly conform to XM= L > 1.0? This is apples and oranges. Python is not an XML app, so I don't think i= t = means anything for Python to conform to XML. The question is how easy it is to write a Python app that does conform to= XML. = Even if Python does not support characters outside the BMP, then this ca= n be = handled by writing code that does the special processing for such charact= ers. The other question is whether PyXML and 4Suite are conformant, since they= are = XML apps. That's what we're really trying to figure out here, I think. -- = Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap= ache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/= 18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerw= orks/w ebservices/library/ws-pyth10.html From tree@basistech.com Mon Sep 23 18:58:49 2002 From: tree@basistech.com (Tom Emerson) Date: Mon, 23 Sep 2002 13:58:49 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <15759.15816.342144.891607@magrathea.basistech.com> Message-ID: <15759.22105.546695.694714@magrathea.basistech.com> Uche Ogbuji writes: > IIRC, UTF-16 supports the representation of characters outside the BMP by > using surrogate pairs (SP). If so, then the scary solution of requiring XML > users to compile Python to use UCS-4 can be put aside. Yes, that is what I (thought I) said in my previous response: since internally Python is representing characters outside the BMP as a surrogate pair in UTF-16, the length of a Unicode string using these characters is 2 --- two UTF-16 characters. > The question would then be how to get a surrogate pair into a Python unicode > object. On a hunch, I tried: > > >>> c = u"\uD800\uDC00" > >>> len(c) > 2 That works. You can also use \U notation: >>> c = u"\U00010000" >>> len(c) 2 >>> c u'\u00010000' >>> c[0] u'\ud800' >>> c[1] u'\udc00' If you compile your Python installation to use "wide" Unicode characters (i.e., UTF-32), then I expect the behavior to be >>> c = u"\U00010000" >>> len(c) 1 >>> len(c) u'\U00010000' -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fdrake@acm.org Mon Sep 23 19:35:59 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 14:35:59 -0400 Subject: [XML-SIG] Broken Link on http://pyxml.sourceforge.net/topics/dtds/index.html In-Reply-To: <3D838317.6040106@spiritone.com> References: <3D838317.6040106@spiritone.com> Message-ID: <15759.24335.306815.281580@grendel.zope.com> Josh English writes: > The link to www.schema.net seems to be broken. I keeping getting a page > stating that the domain name is for sale. Ok, I've fixed this in the CVS version of the website. Can someone who knows how to push changes to the live site on SourceForge pull the right lever to make that happen? Thanks! -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Mon Sep 23 19:40:51 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 14:40:51 -0400 Subject: [XML-SIG] SOAPpy and PyXML 0.8.1 In-Reply-To: References: Message-ID: <15759.24627.252979.3925@grendel.zope.com> Juergen Hermann writes: > Since installing 0.8.1, our SOAPpy Unittest fails. Before investigating > further, did anyone else have such problems? Has there been any response to this? I don't use SOAPpy, so haven't seen anything myself, but I'd be interested in knowing what happened since the messages seems to imply that some change in PyXML has caused the failure. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Mon Sep 23 19:53:09 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 14:53:09 -0400 Subject: [XML-SIG] preserving doctype declaration with xml.dom? In-Reply-To: References: <200208091837.LAA92083@ocean.lucasdigital.com> Message-ID: <15759.25365.400333.593500@grendel.zope.com> Martin v. Loewis writes: > That is not surprising; none of the DOM implementations preserves this > information. If you need the functionality, you are encouraged to > research this issue, and propose fixes; please expect this to be very > difficult. Just a late followup now that PyXML 0.8.1 is out: Using PyXML 0.8.1 and the minidom, you should be able to get a meaningful XML and DOCTYPE declarations back out if you use the parse() or parseString() functions from the xml.dom.minidom or xml.dom.expatbuilder modules, or the DOM Level 3 (draft) Load/Save interfaces on the DOMImplementation object returned by xml.dom.minidom.getDOMImplementation(). -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Mon Sep 23 20:03:25 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 15:03:25 -0400 Subject: [XML-SIG] Re: [XML-checkins]CVS: xml/xml/dom expatbuilder.py,1.5,1.6 In-Reply-To: References: Message-ID: <15759.25981.745476.486311@grendel.zope.com> [Regarding the return of -1/0/1 as the value of the standalone field for the XmlDeclHandler from pyexpat...] Martin v. Loewis writes: > I personally find this change quite ugly. Wouldn't it be much better > if pyexpat would use True and False in the standalone flag? I agree; I'd love it to be None/False/True, but that's an API change, and enough people use the pyexpat interface directly that that's a bad idea at this point. If a new interface to Expat ever emerges, it should definately use a more "Pythonic" value. > Also, what is the purpose of initializing _standalone to -1? -1 indicates that it was never initialized, so we can store None for no-value, rather than False. The truth-behavior works well, and it can be tested for if needed. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Mon Sep 23 20:05:42 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 15:05:42 -0400 Subject: [XML-SIG] Re: [XML-checkins]CVS: xml/xml/utils iso8601.py,1.6,1.7 In-Reply-To: <20020419065145.GC17017@orion.logilab.fr> References: <20020419065145.GC17017@orion.logilab.fr> Message-ID: <15759.26118.997231.804925@grendel.zope.com> Alexandre wrote last April: > Now, this may be dumb, since I'm not very familiar with the intricacies > of real date and time manipulations, but is 60 an allowed value for > seconds? In other words, should not this read > > if not 0 <= seconds < 60: No, allowing 60 seconds is for those people who believe in leap seconds. I think there are even cases where 61 should be allowed, but hesitate to jump into that fray long enough to figure out the "right thing to do." -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Mon Sep 23 20:27:22 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 23 Sep 2002 13:27:22 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15759.22105.546695.694714@magrathea.basistech.com> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> Message-ID: <1032809244.1908.7597.camel@malatesta> On Mon, 2002-09-23 at 11:58, Tom Emerson wrote: > Uche Ogbuji writes: > > IIRC, UTF-16 supports the representation of characters outside the BMP by > > using surrogate pairs (SP). If so, then the scary solution of requiring XML > > users to compile Python to use UCS-4 can be put aside. > > Yes, that is what I (thought I) said in my previous response: since > internally Python is representing characters outside the BMP as a > surrogate pair in UTF-16, the length of a Unicode string using these > characters is 2 --- two UTF-16 characters. No. A surrogate pair is one character. It takes up 2 16-bit values, but this is not the same as taking up 2 characters. The whole point of a variable-length encoding such as UTF-16 is that the number of storage values is not always the same as the number of characters. Eric found this message where Guido does a decent job of summarizing the various issues, though I'm not sure I agree with his conclusion: http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html I should note that based on code Eric found in James Clark's code, Java doesn't treat surrogates specially internally, either, which I guess tends to bolster Guido's POV :-( > > The question would then be how to get a surrogate pair into a Python unicode > > object. On a hunch, I tried: > > > > >>> c = u"\uD800\uDC00" > > >>> len(c) > > 2 > > That works. You can also use \U notation: No. My whole point is that it didn't work. len(c) would be 1, not 2 if the characters were properly treated as a surrogate pair. > >>> c = u"\U00010000" > >>> len(c) > 2 > >>> c > u'\u00010000' > >>> c[0] > u'\ud800' > >>> c[1] > u'\udc00' > > If you compile your Python installation to use "wide" Unicode > characters (i.e., UTF-32), then I expect the behavior to be > > >>> c = u"\U00010000" > >>> len(c) > 1 > >>> len(c) > u'\U00010000' Yes. Don't you see that this means that the behavior as compiled with UTF-16 is wrong from a *character set* point of view? The same code point is *one* character whether encoded in UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, etc. It is never more than one character. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From tree@basistech.com Mon Sep 23 20:29:06 2002 From: tree@basistech.com (Tom Emerson) Date: Mon, 23 Sep 2002 15:29:06 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032809244.1908.7597.camel@malatesta> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> Message-ID: <15759.27522.35916.363703@magrathea.basistech.com> Uche Ogbuji writes: > No. A surrogate pair is one character. It takes up 2 16-bit values, > but this is not the same as taking up 2 characters. The whole point of > a variable-length encoding such as UTF-16 is that the number of storage > values is not always the same as the number of characters. Yes, I'm aware of that. The problem is one of me being sloppy in the use of the word 'character'. > Yes. Don't you see that this means that the behavior as compiled with > UTF-16 is wrong from a *character set* point of view? The same code > point is *one* character whether encoded in UTF-7, UTF-8, UTF-16, > UTF-32, UCS-2, UCS-4, etc. It is never more than one character. Sure, but the *implementation* within the Python interpreter is treating characters in the astral planes as two 16-bit words, not one. The len() value that you get is the number of UTF-16-encoded words in the string. There was a very long, very drawn out discussion on the representation of Unicode characters in Python a while back on the python-i18n mailing list where this whole thing was beaten to death and which eventually lead to the option to compile the interpreter to use a 32-bit character representation. > > -- > Uche Ogbuji Fourthought, Inc. > http://uche.ogbuji.net http://4Suite.org http://fourthought.com > Apache 2.0 API - > http://www-106.ibm.com/developerworks/linux/library/l-apache/ > Python&XML column: Tour of Python/XML - > http://www.xml.com/pub/a/2002/09/18/py.html > Python/Web Services column: xmlrpclib - > http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From vdv@dyomedea.com Mon Sep 23 20:31:57 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 21:31:57 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032809244.1908.7597.camel@malatesta> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> Message-ID: <1032809518.19160.712.camel@ibook> On Mon, 2002-09-23 at 21:27, Uche Ogbuji wrote: > > Eric found this message where Guido does a decent job of summarizing the > various issues, though I'm not sure I agree with his conclusion: >=20 > http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html >=20 > I should note that based on code Eric found in James Clark's code, Java > doesn't treat surrogates specially internally, either, which I guess > tends to bolster Guido's POV :-( Yes... however, there seems to be *some* notion of surrogates at least in the unicode.__repr__() method: >>> print "%r" % c u'\u10800' Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From uche.ogbuji@fourthought.com Mon Sep 23 21:04:02 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 14:04:02 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Tom Emerson of "Mon, 23 Sep 2002 15:29:06 EDT." <15759.27522.35916.363703@magrathea.basistech.com> Message-ID: > Uche Ogbuji writes: > > No. A surrogate pair is one character. It takes up 2 16-bit values, > > but this is not the same as taking up 2 characters. The whole point of > > a variable-length encoding such as UTF-16 is that the number of storage > > values is not always the same as the number of characters. > > Yes, I'm aware of that. The problem is one of me being sloppy in the > use of the word 'character'. Ah. I wasn't meaning to leap too hard on that. I thhought we had a genuine misunderstanding on tis. > > Yes. Don't you see that this means that the behavior as compiled with > > UTF-16 is wrong from a *character set* point of view? The same code > > point is *one* character whether encoded in UTF-7, UTF-8, UTF-16, > > UTF-32, UCS-2, UCS-4, etc. It is never more than one character. > > Sure, but the *implementation* within the Python interpreter is > treating characters in the astral planes as two 16-bit words, not > one. The len() value that you get is the number of UTF-16-encoded > words in the string. There was a very long, very drawn out discussion > on the representation of Unicode characters in Python a while back on > the python-i18n mailing list where this whole thing was beaten to > death and which eventually lead to the option to compile the > interpreter to use a 32-bit character representation. Yes. I'm learning about all this, and learning a lot that I would probably have preferred to be blissfully ignorant of :-( Thanks. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From uche.ogbuji@fourthought.com Mon Sep 23 21:29:07 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 14:29:07 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Eric van der Vlist of "23 Sep 2002 21:31:57 +0200." <1032809518.19160.712.camel@ibook> Message-ID: > On Mon, 2002-09-23 at 21:27, Uche Ogbuji wrote: > > > > Eric found this message where Guido does a decent job of summarizing = the > > various issues, though I'm not sure I agree with his conclusion: > > = > > http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html > > = > > I should note that based on code Eric found in James Clark's code, Ja= va > > doesn't treat surrogates specially internally, either, which I guess > > tends to bolster Guido's POV :-( > = > Yes... however, there seems to be *some* notion of surrogates at least > in the unicode.__repr__() method: > = > >>> print "%r" % c > u'\u10800' Yeah. It seems that the idea has been to make the representaion machiner= y = smart enough to handle surrogate pairs: from Python-2.2.1/Objects/unicod= eobjec t.c line 1798 (basically the repr implementation): /* Map UTF-16 surrogate pairs to Unicode \UXXXXXXXX escapes */ This is kept idempotent for round trip: >>> c =3D u"\uD800\uDC00" >>> len(c) 2 >>> repr(c) "u'\\U00010000'" >>> r =3D repr(c) >>> roundtrip_c =3D eval(r) >>> roundtrip_c u'\U00010000' >>> len(roundtrip_c) 2 >>> roundtrip_c =3D=3D c = 1 And yet len and friends are not smart enough to regognize it. I assume r= e = would have the same problem with ".". This just deepens my unease at Guido's reluctance to support surrogates i= n the = code that handles UTF-16 in Python. The inconsistency seems ugly. But as Tom says, it looks like this matter has been beaten to death, and = it's = pretty much settled. Now I see why Red Hat plumped on compiling Python w= ith = UTF-32 support (and wchar_t). I think it's the only route to sanity. Having said all this, Martin is right about XML and the BMP. I'd forgott= en. Here you go, right out of the XML 1.0 spec: """ 4.1 Character and Entity References [Definition:] A character reference refers to a specific character in the= = ISO/IEC 10646 character set, for example one not directly accessible from= = available input devices. Character Reference [66] CharRef ::=3D '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' [ WFC: Legal Character ] Well-Formedness Constraint: Legal Character Characters referred to using character references must match the producti= on = for Char. """ and so... """ 2.2 Characters [Definition:] A parsed entity contains text, a sequence of characters, wh= ich = may represent markup or character data. [Definition:] A character is an a= tomic = unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal charact= ers = are tab, carriage return, line feed, and the legal graphic characters of = Unicode and ISO/IEC 10646. The use of "compatibility characters", as defi= ned = in section 6.8 of [Unicode], is discouraged. Character Range [2] Char ::=3D #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | = [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate bloc= ks, = FFFE, and FFFF. */ """ So 𐠀 is not WF XML. I'm not sure why JJC uses it. -- = Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap= ache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/= 18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerw= orks/w ebservices/library/ws-pyth10.html From vdv@dyomedea.com Mon Sep 23 21:27:17 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 22:27:17 +0200 Subject: [XML-SIG] Potential issue with re too (Was: Issues with Unicode type) In-Reply-To: References: Message-ID: <1032812838.19185.778.camel@ibook> Still in the context of WXS datatypes and their facets, there is a potential issue with regular expressions (needed for the pattern facet): >>> print c.__repr__() u'\u10800' >>> print re.findall(".", c) [u'\u1080', u'0'] >>> print re.findall(c, c) [u'\u10800'] >>> print re.findall(u'\u1080', c) [u'\u1080'] >>> print re.findall(u'0', c) [u'0'] The re module handles surrogates according to their dual nature, counting them as two characters (which is not what's expected by let's say "." or ".{2}") but still recognizing it as u'\u10800' which doesn't seem like a safe basis to build a compliant type library. Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From veillard@redhat.com Mon Sep 23 21:31:17 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 16:31:17 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Mon, Sep 23, 2002 at 07:15:43PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032800776.19160.547.camel@ibook> Message-ID: <20020923163117.K5635@redhat.com> On Mon, Sep 23, 2002 at 07:15:43PM +0200, Martin v. Loewis wrote: > Eric van der Vlist writes: > > > I would say that since a XML document is defined as set of unicode > > characters, a single character "&x10800;" > > ... is ill-formed. Only characters below ￿ are allowed in XML, > strictly speaking. Wrong, sorry, see the spec ! http://www.w3.org/TR/REC-xml#NT-Char &x10800; is perfectly legal and should be viewed as a single character for example in XPath expressions. This doesn't mean that you have to change your internal encoding, but you need to make sur the wrappers for all the access computing length etc. are doing the computation right. Eric's problem can probably be solved technically by simply providing such a wrapper function. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From veillard@redhat.com Mon Sep 23 21:33:38 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 16:33:38 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Mon, Sep 23, 2002 at 07:12:10PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> Message-ID: <20020923163338.L5635@redhat.com> On Mon, Sep 23, 2002 at 07:12:10PM +0200, Martin v. Loewis wrote: > Eric van der Vlist writes: > > > > By default Python is using UTF-16 as its Unicode encoding. The > > > code-point that you specify, U+10800, is outside the BMP and hence is > > > represented by two surrogate characters in UTF-16. > > > > Arg! Does that mean that by default Python isn't strictly conform to XML > > 1.0? > > No. Why do you think this? Strictly speaking, XML 1.0 defines a > "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000. > This means only characters in the Basic Multilingual Plane are allowed > in XML. James Clark's document is, strictly speaking, ill-formed. No it's not it's a well formed document. Strictly speaking you have either well formed or not, there is not other definition, and that definition is given in the XML specification. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From vdv@dyomedea.com Mon Sep 23 21:34:17 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 22:34:17 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15759.27522.35916.363703@magrathea.basistech.com> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> <15759.27522.35916.363703@magrathea.basistech.com> Message-ID: <1032813257.19160.793.camel@ibook> On Mon, 2002-09-23 at 21:29, Tom Emerson wrote: > Sure, but the *implementation* within the Python interpreter is > treating characters in the astral planes as two 16-bit words, not > one. The len() value that you get is the number of UTF-16-encoded > words in the string. There was a very long, very drawn out discussion > on the representation of Unicode characters in Python a while back on > the python-i18n mailing list where this whole thing was beaten to > death and which eventually lead to the option to compile the > interpreter to use a 32-bit character representation. Having gone through this thread in the archives, I don't want to open it again :-)... OTH, would it really be an option to say that feature X or Y of PyXML (if such a library was added at some point) would require an interpreter compiled for 32-bit character representation to be compliant? Assumining that all the common distributions are shiped compiled for 16-bit (like the Debian sid on which I am doing these tests), it would become a real nightmare for the users! Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From veillard@redhat.com Mon Sep 23 21:35:22 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 16:35:22 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032801701.19382.572.camel@ibook>; from vdv@dyomedea.com on Mon, Sep 23, 2002 at 07:21:41PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: <20020923163522.M5635@redhat.com> On Mon, Sep 23, 2002 at 07:21:41PM +0200, Eric van der Vlist wrote: > Yep, and that's what James Clark is doing in his Java implementation: > > public int getLength(Object obj) { > String str = (String)obj; > int len = str.length(); > int nSurrogatePairs = 0; > for (int i = 0; i < len; i++) > if (Utf16.isSurrogate1(str.charAt(i))) > nSurrogatePairs++; > return len - nSurrogatePairs; > } > > And I need to do the same in Python... yep, that simple, > > Notice also that U+10800 is unassigned even in Unicode 3.2. > > I wonder why he has picked this value! Because he knew this was well formed and that was in a range where this could give troubles to Java (and now Python) implementations I bet :-) Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From uche.ogbuji@fourthought.com Mon Sep 23 21:42:51 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 14:42:51 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Uche Ogbuji of "Mon, 23 Sep 2002 14:29:07 MDT." Message-ID: > > On Mon, 2002-09-23 at 21:27, Uche Ogbuji wrote: > Having said all this, Martin is right about XML and the BMP. I'd forgotten. See, I knew I'd make a silly of myself before this thread went very long. I wasn't even properly reading what I was quoting from the XML spec: > Character Range > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, > FFFE, and FFFF. */ > > """ > > So 𐠀 is not WF XML. I'm not sure why JJC uses it. So I was wrong and 𐠀 is indeed WF, and the problem remains that XML processing code will have to augment Python built-ins such as len with intelligence about surrogates :-( -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From rhinkle@tycoint.com Mon Sep 23 21:40:37 2002 From: rhinkle@tycoint.com (rhinkle@tycoint.com) Date: Mon, 23 Sep 2002 16:40:37 -0400 Subject: [XML-SIG] trouble with PyXML Message-ID: <2141C5F52FD2CA4182AF6927090263B901AD4BC9@flbocexu04> I am trying to run a module (amazon.py) that uses XML on my web host (catalog.com) and am getting an error like this back: xmldoc = minidom.parse(usock) File "xml/dom/minidom.py", line 1594, in parse from xml.dom import expatbuilder File "xml/dom/expatbuilder.py", line 32, in ? from xml.parsers import expat File "xml/parsers/expat.py", line 4, in ? from PyExpat import * ImportError: No module named PyExpat I have run the "python setup.py" build and "python setup.py install" using popen. I redirected the output to two log files and didn't see anything wrong. Any suggestions? BTW, the server is running Python 2.1 > Richard Hinkle > Tyco Safety Products > RHinkle@Tycoint.com > From vdv@dyomedea.com Mon Sep 23 21:50:34 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 22:50:34 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923163522.M5635@redhat.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <20020923163522.M5635@redhat.com> Message-ID: <1032814235.19185.818.camel@ibook> Hi Daniel, On Mon, 2002-09-23 at 22:35, Daniel Veillard wrote: > On Mon, Sep 23, 2002 at 07:21:41PM +0200, Eric van der Vlist wrote: > > Yep, and that's what James Clark is doing in his Java implementation: > >=20 > > public int getLength(Object obj) { > > String str =3D (String)obj; > > int len =3D str.length(); > > int nSurrogatePairs =3D 0; > > for (int i =3D 0; i < len; i++) > > if (Utf16.isSurrogate1(str.charAt(i))) > > nSurrogatePairs++; > > return len - nSurrogatePairs; > > } > >=20 > > And I need to do the same in Python... >=20 > yep, that simple, Except that it's not the only location where it's broken and that won't work with regular expressions. If I define a pattern such as ".{5}" I want to check that this is 5 unicode characters, not 5 words of 16 bits... I am starting to think that compiling Python for 32 bits might be the safest way to solve this issue. Can you confirm that this is what RedHat does by default as mentioned Uche and do you know the motivations (and eventually downsides) for this decision? >=20 > > > Notice also that U+10800 is unassigned even in Unicode 3.2. > >=20 > > I wonder why he has picked this value! >=20 > Because he knew this was well formed and that was in a range where > this could give troubles to Java (and now Python) implementations=20 > I bet :-) Yes, the values in his test cases are usually chosen with care and I was expecting something like that! Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From fdrake@acm.org Mon Sep 23 22:07:32 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 23 Sep 2002 17:07:32 -0400 Subject: [XML-SIG] trouble with PyXML In-Reply-To: <2141C5F52FD2CA4182AF6927090263B901AD4BC9@flbocexu04> References: <2141C5F52FD2CA4182AF6927090263B901AD4BC9@flbocexu04> Message-ID: <15759.33428.598470.237316@grendel.zope.com> rhinkle@tycoint.com writes: > File "xml/parsers/expat.py", line 4, in ? > from PyExpat import * > ImportError: No module named PyExpat It looks to me like this isn't a proper PyXML installation. This line should be: from pyexpat import * which is very different. I'm not sure how you can get the results you report from a standard PyXML installation. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Mon Sep 23 22:16:08 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 15:16:08 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Daniel Veillard of "Mon, 23 Sep 2002 16:35:22 EDT." <20020923163522.M5635@redhat.com> Message-ID: > On Mon, Sep 23, 2002 at 07:21:41PM +0200, Eric van der Vlist wrote: > > Yep, and that's what James Clark is doing in his Java implementation: > > > > public int getLength(Object obj) { > > String str = (String)obj; > > int len = str.length(); > > int nSurrogatePairs = 0; > > for (int i = 0; i < len; i++) > > if (Utf16.isSurrogate1(str.charAt(i))) > > nSurrogatePairs++; > > return len - nSurrogatePairs; > > } > > > > And I need to do the same in Python... > > yep, that simple, Oh, but then Python is so much simpler: SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") def smart_len(u): sp_count = len(SP_PAT.findall(u)) return len(u) - sp_count Problem solved. The great thing about Python is even when it frustrates you one moment, it finds a way to quickly make up for it. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From veillard@redhat.com Mon Sep 23 22:26:26 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 17:26:26 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032814235.19185.818.camel@ibook>; from vdv@dyomedea.com on Mon, Sep 23, 2002 at 10:50:34PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <20020923163522.M5635@redhat.com> <1032814235.19185.818.camel@ibook> Message-ID: <20020923172626.N5635@redhat.com> On Mon, Sep 23, 2002 at 10:50:34PM +0200, Eric van der Vlist wrote: > Except that it's not the only location where it's broken and that won't > work with regular expressions. If I define a pattern such as ".{5}" I > want to check that this is 5 unicode characters, not 5 words of 16 > bits... I don't know about Relax regexp, but for schemas I had to rewrite an engine to cope with the full regexps of the beast. > I am starting to think that compiling Python for 32 bits might be the > safest way to solve this issue. You can't make that assumption, it's the safest for your developper but becomes an user nightmare. If you develop a library I assume it's ultimately to have people use it, if they first need to recompile python and handle multiple version, it's a serious mess. > Can you confirm that this is what RedHat does by default as mentioned > Uche and do you know the motivations (and eventually downsides) for this > decision? By default Red Hat compiles python with unicode support in UTF-16. I'm not in charge of this, I assume it's the default compilation option. IMHO it's a wrong assumption to think that UTF16 is a good cut, because you end up with variable lenght encoding anyway, and UCS32 would seriously bloat the app I'm afraid. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From vdv@dyomedea.com Mon Sep 23 22:31:42 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 23:31:42 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: <1032816703.19185.879.camel@ibook> On Mon, 2002-09-23 at 23:16, Uche Ogbuji wrote: > > yep, that simple, >=20 > Oh, but then Python is so much simpler: >=20 > =20 > SP_PAT =3D re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") > def smart_len(u): > sp_count =3D len(SP_PAT.findall(u)) > return len(u) - sp_count >=20 >=20 > Problem solved. Unfortunately only half solved (apart from the fact that it won't work on Python interpreters compiled for 32 bits but this would be easy to test) since this won't fix regular expressions that easily! > The great thing about Python is even when it frustrates you one moment, i= t=20 > finds a way to quickly make up for it. I reckon that this is a smart smart_len :-) Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From veillard@redhat.com Mon Sep 23 22:32:04 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 17:32:04 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from uche.ogbuji@fourthought.com on Mon, Sep 23, 2002 at 03:16:08PM -0600 References: Message-ID: <20020923173203.O5635@redhat.com> On Mon, Sep 23, 2002 at 03:16:08PM -0600, Uche Ogbuji wrote: > Oh, but then Python is so much simpler: > > > SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") > def smart_len(u): > sp_count = len(SP_PAT.findall(u)) > return len(u) - sp_count > > > Problem solved. modulo the space and CPU requirements for the operation (okay you can tell I'm primarilly a C coder :-) > The great thing about Python is even when it frustrates you one moment, it > finds a way to quickly make up for it. I don't think chars are classes but types, and hence one cannot make a subclass of strings whose instances could have all length/walk/extract operations being special cased to reflect XML unicode string. I (and Eric I bet) would like to be wrong on this :-) Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From mike@skew.org Mon Sep 23 22:38:52 2002 From: mike@skew.org (Mike Brown) Date: Mon, 23 Sep 2002 15:38:52 -0600 (MDT) Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15759.22105.546695.694714@magrathea.basistech.com> "from Tom Emerson at Sep 23, 2002 01:58:49 pm" Message-ID: <200209232138.g8NLcrXL071961@chilled.skew.org> Tom Emerson wrote: > internally Python is representing characters outside the BMP as a > surrogate pair in UTF-16, the length of a Unicode string using these > characters is 2 --- two UTF-16 characters. To be pedantic, characters are on a different level of abstraction than surrogate pairs, which are pairs of 16-bit code values. code value != character rather, code value sequence (1 or more) may be equivalent to a character In UTF-16, many characters can be represented with a single code value, but some require two code values, both selected from a range of values that are not individually assigned to characters. Programming languages still take shortcuts by saying that a 'character' data type is whatever approximate kind of code value is correct 99% of the time, which often means you're stuck with no differentiation between the idea of a character and a single 16-bit code value that represents it internally. Consequently you find that len(someString) gives you not the number of characters but the number of code values in the string. And 99% of the time, that's fine ... until your string contains one of the other (1.1 million minus 65536) characters in Unicode. So I think the problem here is not that Python says len(u"\uD800\uDC00") is 2 (unless somewhere it says that Python supports Unicode 3.2) but that someone assumed len() returns a count of Unicode characters... > If you compile your Python installation to use "wide" Unicode > characters (i.e., UTF-32), then I expect the behavior to be > > >>> c = u"\U00010000" > >>> len(c) > 1 Agreed. > >>> len(c) > u'\U00010000' I think you mean c, not len(c) - Mike ____________________________________________________________________________ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ From vdv@dyomedea.com Mon Sep 23 22:41:04 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 23:41:04 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923172626.N5635@redhat.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <20020923163522.M5635@redhat.com> <1032814235.19185.818.camel@ibook> <20020923172626.N5635@redhat.com> Message-ID: <1032817264.19160.895.camel@ibook> On Mon, 2002-09-23 at 23:26, Daniel Veillard wrote: > On Mon, Sep 23, 2002 at 10:50:34PM +0200, Eric van der Vlist wrote: > > Except that it's not the only location where it's broken and that won't > > work with regular expressions. If I define a pattern such as ".{5}" I > > want to check that this is 5 unicode characters, not 5 words of 16 > > bits... >=20 > I don't know about Relax regexp, but for schemas I had to rewrite > an engine to cope with the full regexps of the beast. That's the same beast :-( ... there is no such thing as Relax NG regexp and it's just borrowing the datatypes from W3C XML Schema and most of their facets including patterns. Would you have Python bindings available for this regexps engine? > > I am starting to think that compiling Python for 32 bits might be the > > safest way to solve this issue. >=20 > You can't make that assumption, it's the safest for your developper > but becomes an user nightmare. If you develop a library I assume > it's ultimately to have people use it, if they first need to recompile > python and handle multiple version, it's a serious mess. >=20 > > Can you confirm that this is what RedHat does by default as mentioned > > Uche and do you know the motivations (and eventually downsides) for thi= s > > decision? >=20 > By default Red Hat compiles python with unicode support in UTF-16. > I'm not in charge of this, I assume it's the default compilation option. >=20 > IMHO it's a wrong assumption to think that UTF16 is a good cut, because > you end up with variable lenght encoding anyway, and UCS32 would seriousl= y > bloat the app I'm afraid. Yes, looks like the two options are equally bad :-( Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From uche.ogbuji@fourthought.com Mon Sep 23 22:47:33 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 15:47:33 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Daniel Veillard of "Mon, 23 Sep 2002 17:32:04 EDT." <20020923173203.O5635@redhat.com> Message-ID: > On Mon, Sep 23, 2002 at 03:16:08PM -0600, Uche Ogbuji wrote: > > Oh, but then Python is so much simpler: > > > > > > SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") > > def smart_len(u): > > sp_count = len(SP_PAT.findall(u)) > > return len(u) - sp_count > > > > > > Problem solved. > > modulo the space and CPU requirements for the operation (okay you can tell > I'm primarilly a C coder :-) I don't see the significant space requirments. As for CPU, Python's len() is already much slower than wstrlen() anyway, so I don't think your point is very valid once someone has already made the choice to use Python. > > The great thing about Python is even when it frustrates you one moment, it > > finds a way to quickly make up for it. > > I don't think chars are classes but types, and hence one cannot > make a subclass of strings whose instances could have all length/walk/extract > operations being special cased to reflect XML unicode string. I (and Eric > I bet) would like to be wrong on this :-) You can subclass strings in Python 2.2 and more recent. Tyes and classes were unified in Python 2.2. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From mike@skew.org Mon Sep 23 22:46:27 2002 From: mike@skew.org (Mike Brown) Date: Mon, 23 Sep 2002 15:46:27 -0600 (MDT) Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: "from Uche Ogbuji at Sep 23, 2002 02:29:07 pm" Message-ID: <200209232146.g8NLkRo1072075@chilled.skew.org> Uche Ogbuji wrote: > [#x10000-#x10FFFF] > > So 𐠀 is not WF XML. I'm not sure why JJC uses it. I'm behind on this thread, but just wanted to reiterate, decimal 67,584 most certainly is in the range of hex 10000 (decimal 65,536) to 10FFFF (decimal 1,114,111). From veillard@redhat.com Mon Sep 23 22:46:59 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 17:46:59 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032817264.19160.895.camel@ibook>; from vdv@dyomedea.com on Mon, Sep 23, 2002 at 11:41:04PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <20020923163522.M5635@redhat.com> <1032814235.19185.818.camel@ibook> <20020923172626.N5635@redhat.com> <1032817264.19160.895.camel@ibook> Message-ID: <20020923174659.P5635@redhat.com> On Mon, Sep 23, 2002 at 11:41:04PM +0200, Eric van der Vlist wrote: > On Mon, 2002-09-23 at 23:26, Daniel Veillard wrote: > > On Mon, Sep 23, 2002 at 10:50:34PM +0200, Eric van der Vlist wrote: > > > Except that it's not the only location where it's broken and that won't > > > work with regular expressions. If I define a pattern such as ".{5}" I > > > want to check that this is 5 unicode characters, not 5 words of 16 > > > bits... > > > > I don't know about Relax regexp, but for schemas I had to rewrite > > an engine to cope with the full regexps of the beast. > > That's the same beast :-( ... there is no such thing as Relax NG regexp > and it's just borrowing the datatypes from W3C XML Schema and most of > their facets including patterns. okay I see, > Would you have Python bindings available for this regexps engine? Should not be too hard but they would operate on UTF8 string like all libxml2 internals. So far regexps were not compiled by default in libxml2, I switched it on last week, so now could be a good time to add bindings, I will try to do this before the next release. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From vdv@dyomedea.com Mon Sep 23 22:46:54 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 23 Sep 2002 23:46:54 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923173203.O5635@redhat.com> References: <20020923173203.O5635@redhat.com> Message-ID: <1032817615.19185.907.camel@ibook> On Mon, 2002-09-23 at 23:32, Daniel Veillard wrote: > I don't think chars are classes but types, and hence one cannot > make a subclass of strings whose instances could have all length/walk/ext= ract > operations being special cased to reflect XML unicode string. I (and Eric > I bet) would like to be wrong on this :-) Well, the dream has become reallity: it's possible since Python 2.2 and I am actually working on an implementation which would use classes (including builtin types) and introspection to define Relax NG (or W3C XML Schema) simple datatypes rather than redefining what a datatype is... Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From uche.ogbuji@fourthought.com Mon Sep 23 22:54:35 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 23 Sep 2002 15:54:35 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <200209232138.g8NLcrXL071961@chilled.skew.org> References: <200209232138.g8NLcrXL071961@chilled.skew.org> Message-ID: <1032818077.3243.7899.camel@malatesta> On Mon, 2002-09-23 at 15:38, Mike Brown wrote: > So I think the problem here is not that Python says len(u"\uD800\uDC00") is 2 > (unless somewhere it says that Python supports Unicode 3.2) but that someone > assumed len() returns a count of Unicode characters... I think the real problem is rather than nothing says that len() operating on Unicode objects is *not* a count of characters. There is nothing that says that len is strictly a count of storage values. I think it's perfectly natural to assume len() is a count of characters, and Python's docs should be clarified in this regard. Consider that other built-ins such as repr and the literal parsing code does deal in characters and not storage values. So why should anyone expect len() to be different. As I said the main problem I see with all this in Python is inconsistency and lack of docs. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From uche.ogbuji@fourthought.com Mon Sep 23 22:58:11 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 15:58:11 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Daniel Veillard of "Mon, 23 Sep 2002 17:26:26 EDT." <20020923172626.N5635@redhat.com> Message-ID: > > Can you confirm that this is what RedHat does by default as mentioned > > Uche and do you know the motivations (and eventually downsides) for this > > decision? > > By default Red Hat compiles python with unicode support in UTF-16. > I'm not in charge of this, I assume it's the default compilation option. Not from what we found. Jeremy was the one who encountered this, not me, but I'm pretty sure he said he found that starting with RH 7.3, Red Hat started building Python 2.x with UTF-32 and whchar_t support. > IMHO it's a wrong assumption to think that UTF16 is a good cut, because > you end up with variable lenght encoding anyway, and UCS32 would seriously > bloat the app I'm afraid. Just as a side observation: Guido called this FUD. I'm not so sure. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From veillard@redhat.com Mon Sep 23 22:59:26 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 17:59:26 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from uche.ogbuji@fourthought.com on Mon, Sep 23, 2002 at 03:58:11PM -0600 References: Message-ID: <20020923175925.Q5635@redhat.com> On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote: > > > Can you confirm that this is what RedHat does by default as mentioned > > > Uche and do you know the motivations (and eventually downsides) for this > > > decision? > > > > By default Red Hat compiles python with unicode support in UTF-16. > > I'm not in charge of this, I assume it's the default compilation option. > > Not from what we found. Jeremy was the one who encountered this, not me, but > I'm pretty sure he said he found that starting with RH 7.3, Red Hat started > building Python 2.x with UTF-32 and whchar_t support. Hum, here on 2 recent versions :-) paphio:~ -> python2.2 Python 2.2 (#1, Apr 12 2002, 15:29:57) [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> c = u"\u10800" >>> len(c) 2 >>> gnome:~ -> python Python 2.2.1 (#1, Aug 30 2002, 12:15:30) [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> c = u"\u10800" >>> len(c) 2 >>> looks like UTF16 to me ! > > IMHO it's a wrong assumption to think that UTF16 is a good cut, because > > you end up with variable lenght encoding anyway, and UCS32 would seriously > > bloat the app I'm afraid. > > Just as a side observation: Guido called this FUD. I'm not so sure. It's just my opinion, and as a whole me and other in the Gnome and KDE projects all went UTF8 without apriori concertation, it was just natural to us (okay this also keep strings 0 terminated which is crucial). Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From jeremy.kloth@fourthought.com Mon Sep 23 23:10:52 2002 From: jeremy.kloth@fourthought.com (Jeremy Kloth) Date: Mon, 23 Sep 2002 16:10:52 -0600 Subject: [XML-SIG] Re: Issues with Unicode type References: <20020923175925.Q5635@redhat.com> Message-ID: <00a001c2634e$169aae30$1a01a8c0@zeus> ----- Original Message ----- From: "Daniel Veillard" To: "Uche Ogbuji" Cc: "Eric van der Vlist" ; Sent: Monday, September 23, 2002 3:59 PM Subject: Re: [XML-SIG] Re: Issues with Unicode type > On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote: > > > > Can you confirm that this is what RedHat does by default as mentioned > > > > Uche and do you know the motivations (and eventually downsides) for this > > > > decision? > > > > > > By default Red Hat compiles python with unicode support in UTF-16. > > > I'm not in charge of this, I assume it's the default compilation option. > > > > Not from what we found. Jeremy was the one who encountered this, not me, but > > I'm pretty sure he said he found that starting with RH 7.3, Red Hat started > > building Python 2.x with UTF-32 and whchar_t support. > > Hum, here on 2 recent versions :-) > > paphio:~ -> python2.2 > Python 2.2 (#1, Apr 12 2002, 15:29:57) > [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> c = u"\u10800" > >>> len(c) > 2 > >>> > > gnome:~ -> python > Python 2.2.1 (#1, Aug 30 2002, 12:15:30) > [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> c = u"\u10800" > >>> len(c) > 2 > >>> > > looks like UTF16 to me ! However that is really two characters 0x1080 and 0x0030. \u (lowercase) only takes 4 hex digits. \U (uppercase) takes 8 digits. So to create the character 0x10800, the sequence should be u'\U0010800'. To truly see if Python has wide unicode support: import sys print sys.maxunicode if the result is >65536, then it was compiled with "--enable-unicode=ucs4", which the RPM spec file for python 2.2.1 does use. -- Jeremy Kloth From martin@v.loewis.de Mon Sep 23 23:16:18 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:16:18 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032814235.19185.818.camel@ibook> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <20020923163522.M5635@redhat.com> <1032814235.19185.818.camel@ibook> Message-ID: Eric van der Vlist writes: > I am starting to think that compiling Python for 32 bits might be the > safest way to solve this issue. Again, I recommend to reconsider your requirements. Why does this problem need to be "solved"? If people use such characters - fine: have them use UCS-4 builds. On Windows, you will not get UCS-4 builds for atleast 5 years - since Microsoft won't change the Windows API to UCS-4. Regards, Martin From martin@v.loewis.de Mon Sep 23 23:21:25 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:21:25 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923163338.L5635@redhat.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <20020923163338.L5635@redhat.com> Message-ID: Daniel Veillard writes: > No it's not it's a well formed document. Strictly speaking you have > either well formed or not, there is not other definition, and that definition > is given in the XML specification. It's ill-formed: it contains illegal characters, according to section 2.2 of http://www.w3.org/TR/2000/REC-xml-20001006: "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. " The cited version of the standard is both the 1993 revision, and part 1 of the 2000 revision; both revisions have legal characters in the BMP only. Regards, Martin From martin@v.loewis.de Mon Sep 23 23:13:55 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:13:55 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032801701.19382.572.camel@ibook> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: Eric van der Vlist writes: > > No. Why do you think this? Strictly speaking, XML 1.0 defines a > > "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000. > > This means only characters in the Basic Multilingual Plane are allowed > > in XML. James Clark's document is, strictly speaking, ill-formed. > > That's weird... I'm not surprised. James is interested in funny and strange cases. He is, as usual, ahead of his time, and predicts the future - most likely correctly. He does not care about strict conformance, but acts as an early adaptor, making things work that aren't supposed to work just yet. You should use his test suite only if you can follow his principles. > And I need to do the same in Python... Not necessarily. You can 1. Ignore the problem. This is probably fine: nobody is using non-BMP characters right now. Most systems have serious problem displaying them, since font systems are restricted to 64k glyphs, and, in many cases, to displaying characters in the BMP only. 2. Declare that this works correctly in UCS-4 builds of Python only. People that need such characters will use an UCS-4 build of Python, anyway; Guido expects Chinese users to be early adaptors here. Notice that James has no such option: Java is inherently tied to UTF-16. 3. Implement it properly. Please understand that you will be trading efficiency for correctness. > > Notice also that U+10800 is unassigned even in Unicode 3.2. > > I wonder why he has picked this value! Out of the blue. He is not really interested in non-BMP characters, but this particular value is "even", so a good choice for a test case. Regards, Martin From martin@v.loewis.de Mon Sep 23 23:33:56 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:33:56 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032809244.1908.7597.camel@malatesta> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> Message-ID: Uche Ogbuji writes: > > Yes, that is what I (thought I) said in my previous response: since > > internally Python is representing characters outside the BMP as a > > surrogate pair in UTF-16, the length of a Unicode string using these > > characters is 2 --- two UTF-16 characters. > > No. A surrogate pair is one character. Yes: the question is what the len() function returns. The number of characters? Apparently not. The number of code units? Yes, definitely. > It takes up 2 16-bit values, but this is not the same as taking up 2 > characters. Nobody said the len function would return the number of characters: it returns the number of code units, which is somtimes different from the number of code points. > No. My whole point is that it didn't work. len(c) would be 1, not 2 if > the characters were properly treated as a surrogate pair. No. It depends on what you expect len to return. If len would return the number of code points, it would not be additive, i.e. you code create strings A and B such that len(A) + len(B) <> len(A+B) That would be confusing to implementations; it would also mean that len(X) cannot be computed in O(1), which also would be confusing. > Yes. Don't you see that this means that the behavior as compiled with > UTF-16 is wrong from a *character set* point of view? The same code > point is *one* character whether encoded in UTF-7, UTF-8, UTF-16, > UTF-32, UCS-2, UCS-4, etc. It is never more than one character. Sure. That makes it clear that len() does not count characters. Regards, Martin From martin@v.loewis.de Mon Sep 23 23:35:33 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:35:33 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032813257.19160.793.camel@ibook> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> <15759.27522.35916.363703@magrathea.basistech.com> <1032813257.19160.793.camel@ibook> Message-ID: Eric van der Vlist writes: > Having gone through this thread in the archives, I don't want to open it > again :-)... OTH, would it really be an option to say that feature X or > Y of PyXML (if such a library was added at some point) would require an > interpreter compiled for 32-bit character representation to be > compliant? Yes, that's a reasonable option. > Assumining that all the common distributions are shiped > compiled for 16-bit (like the Debian sid on which I am doing these > tests), it would become a real nightmare for the users! No, it wouldn't. Users' don't care about non-BMP characters. Regards, Martin From uche.ogbuji@fourthought.com Mon Sep 23 23:41:58 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Mon, 23 Sep 2002 16:41:58 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from "Jeremy Kloth" of "Mon, 23 Sep 2002 16:10:52 MDT." <00a001c2634e$169aae30$1a01a8c0@zeus> Message-ID: > However that is really two characters 0x1080 and 0x0030. \u (lowercase) > only takes 4 hex digits. \U (uppercase) takes 8 digits. So to create the > character 0x10800, the sequence should be u'\U0010800'. Right, Jeremy. I wasn't squinting hard enough at Daniel's example. In my own examples, I've been using u"\U00010000" or u"\uD800\uDC00" These are actually equivalent if Python is compiled for UTF-16 encoding: In the top example, Python breaks the full code point into its UTF-16 representation, and so ends up with the same internal object as the second form. I'm not sure whether they would be equivalent if Python is compiled for UCS-4 (BTW, there is no diff between UTF-32 and UCS-4, is there?). I would imagine Python would blindly create 2 pseudo code points D800 and DC00. I say "pseudo" since, because these values are in the surrogate blocks, they are not valid characters in themselves. Which leads me to believe that even though u"\uD800\uDC00" would be treated equivalently to u"\U00010000" as long as Python is compiled for UTF-16, that it is a *very* bad idea to write unicode literals that way. I'm learning a lot today :-) -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From martin@v.loewis.de Mon Sep 23 23:40:36 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:40:36 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: Uche Ogbuji writes: > This just deepens my unease at Guido's reluctance to support > surrogates in the code that handles UTF-16 in Python. The > inconsistency seems ugly. However, it is unavoidable. It also has all been decided long ago, see PEP 261. > But as Tom says, it looks like this matter has been beaten to death, > and it's pretty much settled. Now I see why Red Hat plumped on > compiling Python with UTF-32 support (and wchar_t). I think it's > the only route to sanity. On Unix, I was indeed fighting to make Py_UNICODE equal to wchar_t where possible. Guido disliked this on the basis of uniformity, and space savings. > Having said all this, Martin is right about XML and the BMP. I'd > forgotten. Actually, I now think that the XML spec is inconsistent. In one place, it allows non-BMP references; in another place, it points to specifications that restrict themselves to the BMP. Regards, Martin From veillard@redhat.com Mon Sep 23 23:40:57 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 18:40:57 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Tue, Sep 24, 2002 at 12:21:25AM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <20020923163338.L5635@redhat.com> Message-ID: <20020923184057.S5635@redhat.com> On Tue, Sep 24, 2002 at 12:21:25AM +0200, Martin v. Loewis wrote: > Daniel Veillard writes: > > > No it's not it's a well formed document. Strictly speaking you have > > either well formed or not, there is not other definition, and that definition > > is given in the XML specification. > > It's ill-formed: it contains illegal characters, according to section > 2.2 of http://www.w3.org/TR/2000/REC-xml-20001006: > > "Legal characters are tab, carriage return, line feed, and the legal > characters of Unicode and ISO/IEC 10646. " and goes on to say... "Consequently, XML processors must accept any character in the range specified for Char. " It's a must with the IETF semantic (section 1.2). You can call it "ill-formed" if you want, it is well-formed XML-wise and processors are required to accept those ! Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From martin@v.loewis.de Mon Sep 23 23:42:31 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 00:42:31 +0200 Subject: [XML-SIG] Broken Link on http://pyxml.sourceforge.net/topics/dtds/index.html In-Reply-To: <15759.24335.306815.281580@grendel.zope.com> References: <3D838317.6040106@spiritone.com> <15759.24335.306815.281580@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Josh English writes: > > The link to www.schema.net seems to be broken. I keeping getting a page > > stating that the domain name is for sale. > > Ok, I've fixed this in the CVS version of the website. Can someone > who knows how to push changes to the live site on SourceForge pull the > right lever to make that happen? In theory, a cronjob should do this within six hours. If it fails to do so, please let me know: the magic is to run /home/groups/p/py/pyxml/doupdate. Regards, Martin From veillard@redhat.com Mon Sep 23 23:43:16 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 18:43:16 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Tue, Sep 24, 2002 at 12:35:33AM +0200 References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> <15759.27522.35916.363703@magrathea.basistech.com> <1032813257.19160.793.camel@ibook> Message-ID: <20020923184316.T5635@redhat.com> On Tue, Sep 24, 2002 at 12:35:33AM +0200, Martin v. Loewis wrote: > Eric van der Vlist writes: > > Assumining that all the common distributions are shiped > > compiled for 16-bit (like the Debian sid on which I am doing these > > tests), it would become a real nightmare for the users! > > No, it wouldn't. Users' don't care about non-BMP characters. Users tends to care about conformance of their software. Such an attitude may not be a service for the popularity of python for XML processing ... really ! Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From veillard@redhat.com Mon Sep 23 23:46:12 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 18:46:12 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Tue, Sep 24, 2002 at 12:40:36AM +0200 References: Message-ID: <20020923184612.U5635@redhat.com> On Tue, Sep 24, 2002 at 12:40:36AM +0200, Martin v. Loewis wrote: > Uche Ogbuji writes: > > Having said all this, Martin is right about XML and the BMP. I'd > > forgotten. > > Actually, I now think that the XML spec is inconsistent. In one place, > it allows non-BMP references; in another place, it points to > specifications that restrict themselves to the BMP. if you read the full paragraph and don't stop at the sentence referencing the Unicode spec I think it is relatively clear. You could ask xml-dev or the xml-editor list if you really think there is a risk of confusion there but this was carefully crafted to avoid confusion precisely. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From veillard@redhat.com Mon Sep 23 23:51:21 2002 From: veillard@redhat.com (Daniel Veillard) Date: Mon, 23 Sep 2002 18:51:21 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <00a001c2634e$169aae30$1a01a8c0@zeus>; from jeremy.kloth@fourthought.com on Mon, Sep 23, 2002 at 04:10:52PM -0600 References: <20020923175925.Q5635@redhat.com> <00a001c2634e$169aae30$1a01a8c0@zeus> Message-ID: <20020923185121.V5635@redhat.com> On Mon, Sep 23, 2002 at 04:10:52PM -0600, Jeremy Kloth wrote: > However that is really two characters 0x1080 and 0x0030. \u (lowercase) > only takes 4 hex digits. \U (uppercase) takes 8 digits. So to create the > character 0x10800, the sequence should be u'\U0010800'. Oops, my bad, I just tried to reproduce Eric's problem case > To truly see if Python has wide unicode support: > > import sys > print sys.maxunicode > > if the result is >65536, then it was compiled with "--enable-unicode=ucs4", > which the RPM spec file for python 2.2.1 does use. 7.3: paphio:~ -> python2.2 Python 2.2 (#1, Apr 12 2002, 15:29:57) [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print sys.maxunicode 1114111 >>> latest: gnome:~ -> python Python 2.2.1 (#1, Aug 30 2002, 12:15:30) [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print sys.maxunicode 65535 >>> Hum, maybe you should not count on it :-\ aniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From martin@v.loewis.de Tue Sep 24 00:06:10 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 01:06:10 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032818077.3243.7899.camel@malatesta> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> Message-ID: Uche Ogbuji writes: > I think the real problem is rather than nothing says that len() > operating on Unicode objects is *not* a count of characters. There is > nothing that says that len is strictly a count of storage values. I > think it's perfectly natural to assume len() is a count of characters, > and Python's docs should be clarified in this regard. I somewhat disagree. For over a year, I think this is the first time that anybody ever noticed. By the time somebody notices the next time, we might be all using UCS-4 builds, and the problem is gone. > Consider that other built-ins such as repr and the literal parsing > code does deal in characters and not storage values. So why should > anyone expect len() to be different. Actually, up to Python 2.3, literal parsing operates on bytes, not characters. If you have a non-ASCII encoding in your sources, the escape backslash would escape only the next byte - which may or may not be the next character. Again, few people ever notice. > As I said the main problem I see with all this in Python is > inconsistency and lack of docs. You are just not reading all the docs. There is a PEP that spells out all these details, deeper than you ever wanted to know. Regards, Martin From martin@v.loewis.de Tue Sep 24 00:08:36 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 01:08:36 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923184316.T5635@redhat.com> References: <15759.15816.342144.891607@magrathea.basistech.com> <15759.22105.546695.694714@magrathea.basistech.com> <1032809244.1908.7597.camel@malatesta> <15759.27522.35916.363703@magrathea.basistech.com> <1032813257.19160.793.camel@ibook> <20020923184316.T5635@redhat.com> Message-ID: Daniel Veillard writes: > Users tends to care about conformance of their software. > Such an attitude may not be a service for the popularity of python > for XML processing ... really ! Only if they know - and then only for checkmark lists, not in real life. In any case, I can't spot any non-conformance here; it works all as designed and specified. Regards, Martin From mal@lemburg.com Tue Sep 24 08:49:33 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 24 Sep 2002 09:49:33 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> Message-ID: <3D90190D.9050107@lemburg.com> Martin v. Loewis wrote: > Uche Ogbuji writes: > >>I think the real problem is rather than nothing says that len() >>operating on Unicode objects is *not* a count of characters. There is >>nothing that says that len is strictly a count of storage values. I >>think it's perfectly natural to assume len() is a count of characters, >>and Python's docs should be clarified in this regard. len() counts the number of Unicode code units, not code points and not even close to graphemes, which is what users usually identify "characters" with. It's a technical necessity. Special algorithms would be needed to provide the length and index information in terms of code points and graphemes (and words). See my Unicode talk for details on the different terms: http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From sjoerd@acm.org Tue Sep 24 09:19:56 2002 From: sjoerd@acm.org (Sjoerd Mullender) Date: Tue, 24 Sep 2002 10:19:56 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923175925.Q5635@redhat.com> References: <20020923175925.Q5635@redhat.com> Message-ID: <200209240819.g8O8JuR08470@indus.ins.cwi.nl> Nobody seems to have bothered looking at the two characters produced by u'\u10800'. I'd say: try it: + python Python 2.3a0 (#78, Sep 20 2002, 11:19:50) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> c = u"\u10800" >>> len(c) 2 >>> c u'\u10800' >>> c[0] u'\u1080' >>> c[1] u'0' >>> In other words, the \u escape takes the next 4 hex digits and uses those to create a unicode character, and what's left over is just appended. If you use the \U escape you need to provide 8 hex digits: >>> c = u'\U00010800' >>> len(c) 2 >>> c[0] u'\ud802' >>> c[1] u'\udc00' >>> And here we see the surrogates appear. It's still 2 characters long. On Mon, Sep 23 2002 Daniel Veillard wrote: > On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote: > > > > Can you confirm that this is what RedHat does by default as mentioned > > > > Uche and do you know the motivations (and eventually downsides) for this > > > > decision? > > > > > > By default Red Hat compiles python with unicode support in UTF-16. > > > I'm not in charge of this, I assume it's the default compilation option. > > > > Not from what we found. Jeremy was the one who encountered this, not me, but > > I'm pretty sure he said he found that starting with RH 7.3, Red Hat started > > building Python 2.x with UTF-32 and whchar_t support. > > Hum, here on 2 recent versions :-) > > paphio:~ -> python2.2 > Python 2.2 (#1, Apr 12 2002, 15:29:57) > [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> c = u"\u10800" > >>> len(c) > 2 > >>> > > gnome:~ -> python > Python 2.2.1 (#1, Aug 30 2002, 12:15:30) > [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> c = u"\u10800" > >>> len(c) > 2 > >>> > > looks like UTF16 to me ! > > > > IMHO it's a wrong assumption to think that UTF16 is a good cut, because > > > you end up with variable lenght encoding anyway, and UCS32 would seriously > > > bloat the app I'm afraid. > > > > Just as a side observation: Guido called this FUD. I'm not so sure. > > It's just my opinion, and as a whole me and other in the Gnome and KDE > projects all went UTF8 without apriori concertation, it was just natural > to us (okay this also keep strings 0 terminated which is crucial). > > Daniel > > -- > Daniel Veillard | Red Hat Network https://rhn.redhat.com/ > veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ > http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig > -- Sjoerd Mullender From vdv@dyomedea.com Tue Sep 24 09:30:48 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 24 Sep 2002 10:30:48 +0200 Subject: [XML-SIG] Issues with Unicode (wrap-up and moving along) Message-ID: <1032856249.22170.116.camel@ibook> First, thanks for the very helpfull answers! As a wrap-up, I think that we can say that: 1) Unicode is supported as code units rather than code points in Python. 2) This is visible on unicode.len() but also in other modules such as re. 3) Even though the impact seems more theoratical than real world, this makes it difficult to be compliant with XML 1.0 in the support of associated specifications (W3C XML Schema datatypes is an example but XPath is probably also impacted). 4) The solution which is most conform with the decisions taken by Python is to give the choice to users between using an interpreter compiled with unicode 16 or 32 bits. In the first case (which is the default) the result will not be totally compliant and will not pass all the test suites. In the second one, the result will eventually be totally compliant. Note that as this is quite easy to detect, implementations could eventually raise exceptions when unaccurate results might happen. To facilitate using 32 bits unicode Python binaries, we could also suggest to major distributions to provide alternative packages compiled with this option. Now, I have also tried to use such an interpreter. The good news is that the unicode class works as expected: vdv@ibook:~$ python Python 2.2.1 (#1, Sep 24 2002, 09:37:13)=20 [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print sys.maxunicode 1114111 >>> u =3D u'\U00010800' >>> print len(u) 1 The bad news is that the migration doesn't seem to be so easy, at least for 4Suite and it blows up when I try to run my test suite: File "/usr/lib/python2.2/site-packages/Ft/Xml/cDomlette.py", line 14, in ? import cDomlettec ImportError: /usr/lib/python2.2/site-packages/Ft/Xml/cDomlettec.so: undefined symbol: PyUnicodeUCS2_AsEncodedString Should I fill a bug :-) ? Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From noreply@sourceforge.net Tue Sep 24 11:51:33 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 24 Sep 2002 03:51:33 -0700 Subject: [XML-SIG] [ pyxml-Bugs-613759 ] getContentHandler returns instance Message-ID: Bugs item #613759, was opened at 2002-09-24 03:51 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=613759&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: getContentHandler returns instance Initial Comment: >>> from xml.sax import make_parser >>> p = make_parser() >>> p.getContentHandler() Shouldn't getContentHandler() return None since no ContentHandler has been assigned? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=613759&group_id=6473 From fredrik@pythonware.com Tue Sep 24 12:38:58 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 24 Sep 2002 13:38:58 +0200 Subject: [XML-SIG] Issues with Unicode (wrap-up and moving along) References: <1032856249.22170.116.camel@ibook> Message-ID: <011201c263be$f95ddf10$0900a8c0@spiff> Eric van der Vlist wrote: > As a wrap-up, I think that we can say that: >=20 > 1) Unicode is supported as code units rather than code points in = Python. that's an implementation detail that happens to be exposed in the current crop of interpreters. from a design perspective, Python uses code points, and only fully supports Unicode BMP characters (Unicode 2.0). if you go outside the BMP, expect version-dependent behaviour, and expect that behaviour to change in future versions. > The bad news is that the migration doesn't seem to be so easy, at = least > for 4Suite and it blows up when I try to run my test suite: >=20 > File "/usr/lib/python2.2/site-packages/Ft/Xml/cDomlette.py", line = 14, > in ? > import cDomlettec > ImportError: /usr/lib/python2.2/site-packages/Ft/Xml/cDomlettec.so: > undefined symbol: PyUnicodeUCS2_AsEncodedString >=20 > Should I fill a bug :-) ? is the extension UCS-4 aware? From fdrake@acm.org Tue Sep 24 14:59:11 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 09:59:11 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032818077.3243.7899.camel@malatesta> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> Message-ID: <15760.28591.497946.198455@grendel.zope.com> Uche Ogbuji writes: > As I said the main problem I see with all this in Python is > inconsistency and lack of docs. I've just added a note to the docs for Python 2.2.2 and 2.3 that len() returns the number of storage units, not abstract characters. I don't expect that to change given that it's been doing it that way since the Unicode type was introduced. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From tree@basistech.com Tue Sep 24 15:05:10 2002 From: tree@basistech.com (Tom Emerson) Date: Tue, 24 Sep 2002 10:05:10 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.28591.497946.198455@grendel.zope.com> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> Message-ID: <15760.28950.855181.718163@magrathea.basistech.com> Fred L. Drake, Jr. writes: > I've just added a note to the docs for Python 2.2.2 and 2.3 that len() > returns the number of storage units, not abstract characters. I don't > expect that to change given that it's been doing it that way since the > Unicode type was introduced. Since this appears to be a point of some confusion to people who aren't indoctrinated, perhaps a discussion needs to be put into the documentation (extracted from the PEP or written anew) about the reasons for this. Or is there a feeling that such detail doesn't belong in the "regular" docs? -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fdrake@acm.org Tue Sep 24 15:05:37 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 10:05:37 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923173203.O5635@redhat.com> Message-ID: <15760.28977.657697.559643@grendel.zope.com> Uche Ogbuji writes: > I don't see the significant space requirments. As for CPU, > Python's len() is already much slower than wstrlen() anyway, so I > don't think your point is very valid once someone has already made > the choice to use Python. What do you wanna bet that a lot of that is the global lookup of len() and not the call itself? Not that either is trivial compared to wstrlen(), unless of course the string is long. ;-) Remember, the slow global looks are already a target for serious optimizations, so expect some of the overhead to disappear. > You can subclass strings in Python 2.2 and more recent. Tyes and > classes were unified in Python 2.2. Python 2.2 implemented the first stage of the type/class unification; it did not complete it. There are still "class" and "instance" types, and you still deal with them using a traditional class statement without a new-style base class of module-global setting of __metatype__. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fdrake@acm.org Tue Sep 24 15:10:59 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 10:10:59 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: <15760.29299.779002.254777@grendel.zope.com> Martin v. Loewis writes: > 3. Implement it properly. Please understand that you will be trading > efficiency for correctness. I'm sure a small C extension could provide the needed helpers quite efficiently. Even with a UCS-4 version of Python, a Unicode literal containing a surrogate pair (explicitly, using two \u sequences) will exhibit the behavior that Eric wants to see suppressed. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From mal@lemburg.com Tue Sep 24 15:20:12 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 24 Sep 2002 16:20:12 +0200 Subject: [XML-SIG] Indexing Unicode (Re: Issues with Unicode type) References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <15760.28950.855181.718163@magrathea.basistech.com> Message-ID: <3D90749C.7000201@lemburg.com> Tom Emerson wrote: > Fred L. Drake, Jr. writes: > >>I've just added a note to the docs for Python 2.2.2 and 2.3 that len() >>returns the number of storage units, not abstract characters. I don't >>expect that to change given that it's been doing it that way since the >>Unicode type was introduced. > > > Since this appears to be a point of some confusion to people who > aren't indoctrinated, perhaps a discussion needs to be put into the > documentation (extracted from the PEP or written anew) about the > reasons for this. Or is there a feeling that such detail doesn't > belong in the "regular" docs? It would probably help to at least raise the issue in the Unicode docs and make the reader aware of the differences between code units, code points and graphemes. len() traditionally refers to the number of items stored in a sequence, so in Unicode terms it returns the number of code units stored in the Unicode object. The same is true for indexing: u[i] will give you the i-th code unit, not necessarily the i-th code point or even i-th grapheme. Depending on how you view this, you could say that any given Unicode implementation is a variable length encoding of graphemes -- the talk I referenced earlier in this thread has a slide explaining this. Would be nice to have a Unicode indexing module which provides different indexing and length measuring methods than just code units. Here's a PEP I started for this last year but which never got finished: PEP: 0XXX Title: Unicode Indexing Helper Module Version: $Revision: 1.0 $ Author: mal@lemburg.com (Marc-Andr? Lemburg) Status: Draft Type: Standards Track Python-Version: 2.3 Created: 06-Jun-2001 Post-History: Abstract This PEP proposes a new module "unicodeindex" which provides means to index Unicode objects in various higher level abstractions of "characters". Problem and Terminology Unicode objects can be indexed just like string object using what in Unicode terms is called a code unit as index basis. Code units are the storage entities used by the Unicode implementation to store a single Unicode information unit and do not necessarily map 1-1 to code points which are the smallest entities encoded by the Unicode standard. Python exposes code units to the programmer via the Unicode object indexing and slicing API, e.g. u[10] or u[12:15] refer to the code units at index 10 and indices 12 to 14. These code points can sometimes be composed to form graphemes which are then displayed by the Unicode output device as one character. A word is then a sequence of characters separated by space characters or punctuation, a line is a sequence of code points separated by line breaking code point sequences. For addressing Unicode, there are basically five different methods by which you can reference the data: 1. per code unit (codeunit) 2. per code point (codepoint) 3. per grapheme (grapheme) 4. per word (word) 5. per line (line) The indexing type name is given in parenthesis and used in the module interface. Proposed Solution I propose to add a new module to the standard Python library which provides interfaces implementing the above indexing methods. Module Interface The module should provide the following interfaces for all four indexing styles: next_(u, index) -> integer Returns the Unicode object index for the start of the next found after u[index] or -1 in case no next element of this type exists. prev_(u, index) -> integer Returns the Unicode object index for the start of the previous found before u[index] or -1 in case no previous element of this type exists. _index(u, n) -> integer Returns the Unicode object index for the start of the n-th element in u. Raises an IndexError in case no n-th element can be found. _count(u, index) -> integer Counts the number of complete elements found in u[:index] and returns the count as integer. _start(u, index) -> integer Returns 1 or 0 depending on u[index] marks the start of an element. _end(u, index) -> integer Returns 1 or 0 depending on u[index] marks the end of an element. _slice(u, index) -> slice object or None Returns the slice pointing to the element found in u at the given index or None in case no such element can be found at that position. Symbols used in the above definitions: one of: codeunit, codepoint, grapheme, word, line u is the Unicode object index the Unicode object index, e.g. 10 in u[10] n is an integer Note that in Unicode terms, the Unicode object index refers to a code unit. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From fredrik@pythonware.com Tue Sep 24 15:34:16 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 24 Sep 2002 16:34:16 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <200209232138.g8NLcrXL071961@chilled.skew.org><1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> Message-ID: <035301c263d7$782982a0$0900a8c0@spiff> fred wrote: > I've just added a note to the docs for Python 2.2.2 and 2.3 that len() > returns the number of storage units, not abstract characters.=20 imo (as the original author of the unicode type), that's an = implementation artifact, not a feature. > I don't expect that to change given that it's been doing it that way = since > the Unicode type was introduced. the original Unicode type used UCS-2 for internal storage, and all = string operations worked on code points. adding UTF-16 support in a couple of places doesn't really change that; an UTF-16-encoded unicode string should be treated just like an encoded 8-bit string -- standard string operations are not guaranteed to work on encoded strings. (if we document all bugs and half-baked solutions as supported features, we will never be able to fix anything...) From martin@v.loewis.de Tue Sep 24 15:38:28 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 16:38:28 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.29299.779002.254777@grendel.zope.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <15760.29299.779002.254777@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > I'm sure a small C extension could provide the needed helpers quite > efficiently. Even with a UCS-4 version of Python, a Unicode literal > containing a surrogate pair (explicitly, using two \u sequences) will > exhibit the behavior that Eric wants to see suppressed. Of course, producing such a literal is an application error. Regards, Martin From martin@v.loewis.de Tue Sep 24 15:41:06 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 16:41:06 +0200 Subject: [XML-SIG] Issues with Unicode (wrap-up and moving along) In-Reply-To: <1032856249.22170.116.camel@ibook> References: <1032856249.22170.116.camel@ibook> Message-ID: Eric van der Vlist writes: > ImportError: /usr/lib/python2.2/site-packages/Ft/Xml/cDomlettec.so: > undefined symbol: PyUnicodeUCS2_AsEncodedString > > Should I fill a bug :-) ? No. It is by design that you can't use narrow-unicode extension modules in a wide-unicode build. Extensions may access the internal representation of a Unicode string, and thus crash if silently accepted. Regards, Martin From uche.ogbuji@fourthought.com Tue Sep 24 15:58:36 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 24 Sep 2002 08:58:36 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <200209240819.g8O8JuR08470@indus.ins.cwi.nl> References: <20020923175925.Q5635@redhat.com> <200209240819.g8O8JuR08470@indus.ins.cwi.nl> Message-ID: <1032879518.1908.10920.camel@malatesta> On Tue, 2002-09-24 at 02:19, Sjoerd Mullender wrote: > Nobody seems to have bothered looking at the two characters produced > by u'\u10800'. I'd say: try it: Man, you Europeans are just *sooooo* yesterday ;-) I sorted this out later on in the thread. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From martin@v.loewis.de Tue Sep 24 16:10:58 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 17:10:58 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <035301c263d7$782982a0$0900a8c0@spiff> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <035301c263d7$782982a0$0900a8c0@spiff> Message-ID: "Fredrik Lundh" writes: > > I've just added a note to the docs for Python 2.2.2 and 2.3 that len() > > returns the number of storage units, not abstract characters. > > imo (as the original author of the unicode type), that's an implementation > artifact, not a feature. Yes, but it still needs to be documented. > (if we document all bugs and half-baked solutions as supported features, > we will never be able to fix anything...) Can you present similar problems where it is clear that a proper solution won't be implemented in any foreseeable feature, yet this is not documented? Regards, Martin From fdrake@acm.org Tue Sep 24 16:27:20 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 11:27:20 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <15760.29299.779002.254777@grendel.zope.com> Message-ID: <15760.33880.359177.435704@grendel.zope.com> Martin v. Loewis writes: > Of course, producing such a literal is an application error. That depends on whether you think a Unicode literal is supposed to contain a sequence of characters or a sequence of code units. I suspect this is entirely application-dependent. I can see where it could come in handy when writing test cases, so it may be appropriate at other times as well. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From mal@lemburg.com Tue Sep 24 16:54:34 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 24 Sep 2002 17:54:34 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <15760.29299.779002.254777@grendel.zope.com> <15760.33880.359177.435704@grendel.zope.com> Message-ID: <3D908ABA.3020603@lemburg.com> Fred L. Drake, Jr. wrote: > Martin v. Loewis writes: > > Of course, producing such a literal is an application error. > > That depends on whether you think a Unicode literal is supposed to > contain a sequence of characters or a sequence of code units. I > suspect this is entirely application-dependent. I can see where it > could come in handy when writing test cases, so it may be appropriate > at other times as well. Right. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From rodsenra@gpr.com.br Tue Sep 24 19:25:09 2002 From: rodsenra@gpr.com.br (Rodrigo Senra) Date: Tue, 24 Sep 2002 15:25:09 -0300 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D90190D.9050107@lemburg.com> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <3D90190D.9050107@lemburg.com> Message-ID: <20020924182129.EB0FC684D2@pavuna.terra.com.br> On Tue, 24 Sep 2002 09:49:33 +0200 "M.-A. Lemburg" wrote about Re: [XML-SIG] Re: Issues with Unicode type: -------------------------------------------- | | See my Unicode talk for details on the different terms: | | http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf | I'm afraid it is not *yet* available ;o) wget http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf --15:20:37-- http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf => `Unicode-EPC2002-Talk.pdf' Resolving www.egenix.com... done. Connecting to www.egenix.com[217.115.138.139]:80... connected. HTTP request sent, awaiting response... 403 Forbidden 15:20:39 ERROR 403: Forbidden. regards, Senra rodrigo.senra@ic.unicamp.br From brian@sweetapp.com Tue Sep 24 19:38:08 2002 From: brian@sweetapp.com (Brian Quinlan) Date: Tue, 24 Sep 2002 11:38:08 -0700 Subject: [XML-SIG] ANN: Pyana 0.6.0 Message-ID: <001c01c263f9$879487e0$df7e4e18@brianspiv1700> Pyana 0.6.0 has been released. Source and binary distributions are available at: http://sourceforge.net/project/showfiles.php?group_id=28142 Pyana is an extension module that allows Python scripts to access the Apache Group's Xalan XSLT transformation engine. For usage examples and other information, see: http://pyana.sourceforge.net/ What's new in this release? - Updated to use Xalan 1.4/Xerces 2.1 - A default ErrorHandler and ProblemListener are automatically installed when a Transformer instance is created. The global variables defaultErrorHandlerFactory and defaultProblemListenerFactory determine which classes are used. - Messages generated with xsl:message now appear correctly - Exception decoding is now done using a preprocessor macro instead of an include file. This should fix a build problem on some platforms e.g. HP-UX. - Better exception decoding of XSLExceptions Cheers, Brian From fdrake@acm.org Tue Sep 24 19:42:35 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 14:42:35 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.28950.855181.718163@magrathea.basistech.com> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <15760.28950.855181.718163@magrathea.basistech.com> Message-ID: <15760.45595.411596.618828@grendel.zope.com> Tom Emerson writes: > Since this appears to be a point of some confusion to people who > aren't indoctrinated, perhaps a discussion needs to be put into the > documentation (extracted from the PEP or written anew) about the > reasons for this. Or is there a feeling that such detail doesn't > belong in the "regular" docs? Certainly a fair portion of this should be discussed in more detail. I'm trying to figure out where it should all go; suggestions beyond "witht the Unicode documentation" are welcome. (I know it should go with the Unicode docs, it's just that we don't really *have* a specific location for Unicode docs yet. ;-[ ) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Tue Sep 24 20:03:05 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 21:03:05 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.45595.411596.618828@grendel.zope.com> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <15760.28950.855181.718163@magrathea.basistech.com> <15760.45595.411596.618828@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Certainly a fair portion of this should be discussed in more detail. > I'm trying to figure out where it should all go; suggestions beyond > "witht the Unicode documentation" are welcome. I'd recommend to place a fairly elaborate text with Unicode literals. This can mention the two forms of Python builds while explaining why len(u"\U00xxyyyy") might be 2. Then, there should be a Unicode section in builtin types, which explains the notion of encodings, and the directions in which .encode and .decode operate (and the relationship to the unicode builtin). Furthermore, the codecs module should: - provide a list of codecs included in a certain Python release, - possibly provide a list of recognized aliases, - explain the notion of error handling, and, for 2.3, the extensibility thereof. Regards, Martin From fdrake@acm.org Tue Sep 24 21:10:14 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 24 Sep 2002 16:10:14 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <15760.28950.855181.718163@magrathea.basistech.com> <15760.45595.411596.618828@grendel.zope.com> Message-ID: <15760.50854.382923.93504@grendel.zope.com> Martin v. Loewis writes: > I'd recommend to place a fairly elaborate text with Unicode > literals. This can mention the two forms of Python builds while > explaining why len(u"\U00xxyyyy") might be 2. I presume you're referring to the language reference, section 2.4.1, which covers all string literals? > Then, there should be a Unicode section in builtin types, which > explains the notion of encodings, and the directions in which .encode > and .decode operate (and the relationship to the unicode builtin). Ok. > Furthermore, the codecs module should: > - provide a list of codecs included in a certain Python release, > - possibly provide a list of recognized aliases, > - explain the notion of error handling, and, for 2.3, the > extensibility thereof. Yep. Sounds like a plan. I won't get it done today, but I'll try to repair my recent change by moving it out of the description of the len() function, which really is not the right place for it. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Tue Sep 24 21:32:00 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 24 Sep 2002 22:32:00 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.50854.382923.93504@grendel.zope.com> References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <15760.28591.497946.198455@grendel.zope.com> <15760.28950.855181.718163@magrathea.basistech.com> <15760.45595.411596.618828@grendel.zope.com> <15760.50854.382923.93504@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Martin v. Loewis writes: > > I'd recommend to place a fairly elaborate text with Unicode > > literals. This can mention the two forms of Python builds while > > explaining why len(u"\U00xxyyyy") might be 2. > > I presume you're referring to the language reference, section 2.4.1, > which covers all string literals? Actually, I'd add a section 2.4.3, "Unicode literals", which logical fits there (IMO) after the concatenation section (actually, concatenation of Unicode and non-Unicode literals is underspecified, too...) For 2.4.3, the \N notation should get more prominent notation. It is probably not appropriate to list all accepted character names, but referring to the relevant database (http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt - I *think*; I'm not sure which version was used to generate the 2.2 data - that ought to be documented). Regards, Martin From noreply@sourceforge.net Tue Sep 24 22:14:13 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Tue, 24 Sep 2002 14:14:13 -0700 Subject: [XML-SIG] [ pyxml-Bugs-614049 ] expatbuilder Unicode filenames on Win32 Message-ID: Bugs item #614049, was opened at 2002-09-24 21:14 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=614049&group_id=6473 Category: DOM Group: None Status: Open Resolution: None Priority: 5 Submitted By: Brian Lenihan (brianl) Assigned to: Nobody/Anonymous (nobody) Summary: expatbuilder Unicode filenames on Win32 Initial Comment: Trent Mick's go.by broke after installing PyXML 0.8.1 for Python 2.2.1 C:\Python22>go d Traceback (most recent call last): File "C:\Python22\go.py", line 435, in ? sys.exit( main(sys.argv) ) File "C:\Python22\go.py", line 380, in main generateShellScript(shellScript) # no-op, overwrite old one File "C:\Python22\go.py", line 302, in generateShellScript shortcuts = getShortcuts() File "C:\Python22\go.py", line 291, in getShortcuts dom = xml.dom.minidom.parse(shortcutsXml) File "C:\Python22\Lib\site-packages\_xmlplus\dom\minidom.py", line 1595, in parse return expatbuilder.parse(file) File "C:\Python22\lib\site-packages\_xmlplus\dom\expatbuilder.py", line 932, in parse result = builder.parseFile(file) File "C:\Python22\lib\site-packages\_xmlplus\dom\expatbuilder.py", line 170, in parseFile buffer = file.read(16*1024) AttributeError: 'unicode' object has no attribute 'read' getShortcuts is using win32.shell.SHGetFolderPath to find the shortcults.xml file and it returns a Unicode string. Lines 926 and 959 in expatbuilder.py both test whether they have been handed a string or a file like this: if isinstance(file, type('')): fp = open(file, 'rb') result = builder.parseFile(fp) fp.close() else: result = builder.parseFile(file) I worked around it by adding or isinstance(file, type(u'')) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=614049&group_id=6473 From mal@lemburg.com Tue Sep 24 22:55:35 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 24 Sep 2002 23:55:35 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <200209232138.g8NLcrXL071961@chilled.skew.org> <1032818077.3243.7899.camel@malatesta> <3D90190D.9050107@lemburg.com> <20020924182129.EB0FC684D2@pavuna.terra.com.br> Message-ID: <3D90DF57.7020108@lemburg.com> Rodrigo Senra wrote: > On Tue, 24 Sep 2002 09:49:33 +0200 > "M.-A. Lemburg" wrote > about Re: [XML-SIG] Re: Issues with Unicode type: > -------------------------------------------- > | > | See my Unicode talk for details on the different terms: > | > | http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf > | > > I'm afraid it is not *yet* available ;o) > > wget http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf > --15:20:37-- http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf > => `Unicode-EPC2002-Talk.pdf' > Resolving www.egenix.com... done. > Connecting to www.egenix.com[217.115.138.139]:80... connected. > HTTP request sent, awaiting response... 403 Forbidden > 15:20:39 ERROR 403: Forbidden. Yeah, was a permissions problem. Should work now. > regards, > Senra > rodrigo.senra@ic.unicamp.br > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From uche.ogbuji@fourthought.com Wed Sep 25 00:52:21 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 24 Sep 2002 17:52:21 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from "Fred L. Drake, Jr." of "Tue, 24 Sep 2002 10:10:59 EDT." <15760.29299.779002.254777@grendel.zope.com> Message-ID: > > Martin v. Loewis writes: > > 3. Implement it properly. Please understand that you will be trading > > efficiency for correctness. > > I'm sure a small C extension could provide the needed helpers quite > efficiently. Even with a UCS-4 version of Python, a Unicode literal > containing a surrogate pair (explicitly, using two \u sequences) will > exhibit the behavior that Eric wants to see suppressed. Yes. That was what I figured to in my recent rumination on such literals. My conclusion was *never* to use "naked" surrogate pairs in Unicode literals, even with UTF-16 Python. I get the sense this is a "best practice" that should be clearly articulated: Do *not* express Unicode literals using direct UTF-16 surrogate pairs, e.g. u"\uD800\uDC00". *Always* use the high-order unicode literal character form (big-U notation), e.g. u"\U00010000". Unless someone weighs in with reasoning against this, I'll plan to add something to this effect to the Akara. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From oberscheid@doctronic.de Wed Sep 25 08:13:04 2002 From: oberscheid@doctronic.de (Carsten Oberscheid) Date: Wed, 25 Sep 2002 09:13:04 +0200 Subject: [XML-SIG] saxutils.XMLGenerator: Output encoding Message-ID: <20020925071303.GY30717@doctronic.de> Hello everybody, I have not followed this list for some time, so this may have been discussed before: to the XMLGenerator, an output encoding can be given. All output is then written through saxutils.escape() using this encoding. As a result, any character in the document that can not be represented in the output encoding raises a UnicodeException. So one single special character in a file can force me to produce UTF-8 encoding, although for further processing ISO 8859-1 or even ASCII would be much more handy. An alternative would be to catch the UnicodeException and, as a reaction, encode the offensive characters as character references (e.g. "“"). Shouldn't this be the XML way to do it? I can provide a very primitive patch for saxutils.py, if anybody is interested. I even would try to make it less primitive, if there are no objections against taking this fix into the distribution :^) Thanks for your feedback .co. -- carsten oberscheid d o c t r o n i c email oberscheid@doctronic.de information publishing + retrieval phone +49 2222 9292 90 http://www.doctronic.de From vdv@dyomedea.com Wed Sep 25 10:24:19 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 25 Sep 2002 11:24:19 +0200 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: References: Message-ID: <1032945860.12566.23.camel@ibook> On Mon, 2002-09-23 at 23:16, Uche Ogbuji wrote: > Oh, but then Python is so much simpler: >=20 > =20 > SP_PAT =3D re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") > def smart_len(u): > sp_count =3D len(SP_PAT.findall(u)) > return len(u) - sp_count >=20 I am trying to use this when python is compiled with ucs2, but I am seeing a weird behavior when using this function: it seems that it can't stand being compiled as a .pyc! I have: test.py: #!/usr/bin/env python import Smart_len print Smart_len.smart_len(u'\U00010800') and Smart_len.py: import re SP_PAT =3D re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") def smart_len(u): sp_count =3D len(SP_PAT.findall(u)) return len(u) - sp_count It's working the 1st time (or when I remove Smart_len.pyc) but fails after the second execution: vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ rm Smart_len.pyc vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ ./test.py=20 1 vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ ./test.py=20 Traceback (most recent call last): File "./test.py", line 2, in ? import Smart_len UnicodeError: UTF-8 decoding error: unexpected code byte Weird, isn't it? Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From vdv@dyomedea.com Wed Sep 25 11:13:47 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 25 Sep 2002 12:13:47 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: <1032948827.12566.157.camel@ibook> On Wed, 2002-09-25 at 01:52, Uche Ogbuji wrote: > >=20 > > Martin v. Loewis writes: > > > 3. Implement it properly. Please understand that you will be trading > > > efficiency for correctness. > >=20 > > I'm sure a small C extension could provide the needed helpers quite > > efficiently. Even with a UCS-4 version of Python, a Unicode literal > > containing a surrogate pair (explicitly, using two \u sequences) will > > exhibit the behavior that Eric wants to see suppressed. >=20 > Yes. That was what I figured to in my recent rumination on such literals= . My=20 > conclusion was *never* to use "naked" surrogate pairs in Unicode literals= ,=20 > even with UTF-16 Python. I get the sense this is a "best practice" that=20 > should be clearly articulated: >=20 > Do *not* express Unicode literals using direct UTF-16 surrogate pairs, e.= g.=20 > u"\uD800\uDC00". *Always* use the high-order unicode literal character f= orm=20 > (big-U notation), e.g. u"\U00010000". I am not 100% sure if this is the same issue, but the script [1] with the definition of the XML productions generated by chargen [2] which I am using in my implementation doesn't seem to work correctly on a Python interpreter compiled with ucs4. [1] http://downloads.xmlschemata.org/python/xvif/characters.py [2] http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pyxml/xml/utils/xmlchargen.p= y What makes me say that is the fact that with a Python interpreter compiled with ucs4, my Relax NG implementation doesn't catch any longer incorrect XML names such as u'\u0E35' while this is working fine with the same version compiled for ucs2. This can be checked quite easily: 1) with a ucs2 interpreter: vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ python Python 2.2.1 (#1, Sep 13 2002, 22:38:05)=20 [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import characters >>> print characters.re_NCName().match(u'\u0E35') None 2) with a ucs4 interpreter: vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ python Python 2.2.1 (#5, Sep 25 2002, 11:18:57)=20 [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import characters >>> print characters.re_NCName().match(u'\u0E35') <_sre.SRE_Match object at 0x10068670> Does that mean that chargen.py should be rewritten for ucs4? Could a single avoiding surrogates version handle both?=20 Thanks Eric PS: if someone could help me with chargen.py which looks like black magic to me, I would really appreciate! --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From fdrake@acm.org Wed Sep 25 14:11:56 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 25 Sep 2002 09:11:56 -0400 Subject: [XML-SIG] Re: [XML-checkins]xml/xml/dom expatbuilder.py,1.24,1.25 In-Reply-To: References: Message-ID: <15761.46620.734779.200240@grendel.zope.com> Martin v. Loewis writes: > > ! if isinstance(file, StringTypes): > > As another note, I think this won't work in Python 2.0: in that > version, isinstance allows a type as the second argument only. Except that the compatibility version defined in minicompat is being used in this case, and it does accept a tuple as the second arg. And I should be getting StringTypes from minicompat; I'll correct that when I get to the office and can run the tests. Thanks for catching this! -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From tpassin@comcast.net Wed Sep 25 14:15:58 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Wed, 25 Sep 2002 09:15:58 -0400 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) References: <1032945860.12566.23.camel@ibook> Message-ID: <000601c26495$b022ca90$fe193044@tbp1> I can confirm this behavior - tested on Win2000 with Python 2.2. But if the same code is all in one module in does not fail on the repeated attempts, only when the function def is imported from another model as Eric has it. Eric is also right about the .pyc (and .pyo which I also tried) - if you delete it the next one execution succeeds. Furthermore, it is specifically the regular expression that does it, not its being called in the function. This is easy to show - if you change the code so that it is not used, the failure does not happen: import re #SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") def smart_len(u): #sp_count = len(SP_PAT.findall(u)) return 0 # return len(u) - sp_count Now just uncomment the SP_PAT line - leaving it still unused - and presto! The failure returns. Now that is really strange! Hope someone who knows Python really well can explain this and help get it fixed. Cheers, Tom P [Eric van der Vlist] [[ I am trying to use this when python is compiled with ucs2, but I am seeing a weird behavior when using this function: it seems that it can't stand being compiled as a .pyc! I have: test.py: #!/usr/bin/env python import Smart_len print Smart_len.smart_len(u'\U00010800') and Smart_len.py: import re SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") def smart_len(u): sp_count = len(SP_PAT.findall(u)) return len(u) - sp_count It's working the 1st time (or when I remove Smart_len.pyc) but fails after the second execution: ]] From larsga@garshol.priv.no Wed Sep 25 16:15:52 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 25 Sep 2002 17:15:52 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020923173203.O5635@redhat.com> References: <20020923173203.O5635@redhat.com> Message-ID: * Uche Ogbuji | | SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]") | def smart_len(u): | sp_count = len(SP_PAT.findall(u)) | return len(u) - sp_count | | | Problem solved. In a sense. You now have len(u""), which counts code units (thus giving different results in UTF-16 and UTF-32 builds) and smart_len(u""), which counds characters, and thus always gives the same result. Java has the same problem, in that length() there counts code units, but on the other hand String is defined to always contain UTF-16 code units. Note that this problem is also inherent in the XML family of specifications. The DOM 1.0 definition of string was broken, while the 2.0 one equates strings with arrays of UTF-16 code units. In XPath, on the other hand, strings consist of abstract Unicode characters... Note also that there is one further problem. How long is this string u"\u0041\u030A" according to RELAX/XPath/XSDL? * Daniel Veillard | | I don't think chars are classes but types, and hence one cannot make | a subclass of strings whose instances could have all | length/walk/extract operations being special cased to reflect XML | unicode string. I (and Eric I bet) would like to be wrong on this | :-) This has nothing to do with XML, it's just that XML is one of the few technologies that are sufficiently modern to make this problem show up. If you want to have proper Unicode support in any application you will run into this problem. The problem here is that the UTF-16 == Unicode assumption is built into all sorts of technologies, from Python to Java to Ada-95 to Win32 to DOM 2.0 to ..., and in most cases people are not even aware of the problem. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From larsga@garshol.priv.no Wed Sep 25 16:16:09 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 25 Sep 2002 17:16:09 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: * Uche Ogbuji | | Right, Jeremy. I wasn't squinting hard enough at Daniel's example. | In my own examples, I've been using | | u"\U00010000" | | or | | u"\uD800\uDC00" | | [...] | | I'm not sure whether they would be equivalent if Python is compiled | for UCS-4 Python needs to decide what the \uXXXX escape syntax is referring to: UTF-16 code units or Unicode code points. If the former, the first example should be illegal. If the latter, the second example is highly dubious (it's referring to unassigned code points that have a special meaning in one of the encodings). I'm not sure whether the second should be outlawed, but it probably should be. It's a sure way to create problems for yourself and if the Unicode strings actually contain Unicode characters those values are not legal. | (BTW, there is no diff between UTF-32 and UCS-4, is there?). UTF-32 is Unicode, UCS-4 is ISO 10646. The Unicode code space used to be more restricted than the ISO 10646 one, which ISO was supposed to fix. Not sure whether that fix has gone through yet, but probably it has. Once it has there will be no difference. | I would imagine Python would blindly create 2 pseudo code points | D800 and DC00. I say "pseudo" since, because these values are in | the surrogate blocks, they are not valid characters in themselves. Yup. | Which leads me to believe that even though u"\uD800\uDC00" would be | treated equivalently to u"\U00010000" as long as Python is compiled | for UTF-16, that it is a *very* bad idea to write unicode literals | that way. Yup. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From larsga@garshol.priv.no Wed Sep 25 16:20:01 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 25 Sep 2002 17:20:01 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: * Martin v. Loewis | | 1. Ignore the problem. This is probably fine: nobody is using non-BMP | characters right now. Most systems have serious problem displaying | them, since font systems are restricted to 64k glyphs, and, in many | cases, to displaying characters in the BMP only. Actually, Windows 2000 displays non-BMP characters just fine. MSIE can be made to do it, Opera 6.0 does it just fine, Mozilla does not (I think) do it. Also, there are locales where non-BMP characters are essential. Cantonese is probably the best example. You can't write the Cantonese equivalent of the "-ing" ending in Cantonese with the BMP... Getting this right is actually more than purely an exercise in conformance, though as you say it is less important now than it will be in 1-2 years. | 2. Declare that this works correctly in UCS-4 builds of Python | only. People that need such characters will use an UCS-4 build of | Python, anyway; Guido expects Chinese users to be early adaptors | here. Notice that James has no such option: Java is inherently tied | to UTF-16. Is the plan that Python will eventually be UCS-4 only? | 3. Implement it properly. Please understand that you will be trading | efficiency for correctness. :-) -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From larsga@garshol.priv.no Wed Sep 25 16:20:55 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 25 Sep 2002 17:20:55 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <15760.33880.359177.435704@grendel.zope.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <15760.29299.779002.254777@grendel.zope.com> <15760.33880.359177.435704@grendel.zope.com> Message-ID: * Martin v. Loewis | | Of course, producing such a literal is an application error. * Fred L. Drake, Jr. | | That depends on whether you think a Unicode literal is supposed to | contain a sequence of characters or a sequence of code units. I | suspect this is entirely application-dependent. I can see where it | could come in handy when writing test cases, so it may be | appropriate at other times as well. I think probably it would be best to disallow such literals. Those doing test cases should find other ways to do them. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From mal@lemburg.com Wed Sep 25 16:48:53 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 25 Sep 2002 17:48:53 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: <3D91DAE5.1090009@lemburg.com> Lars Marius Garshol wrote: > Is the plan that Python will eventually be UCS-4 only? Eventually, yes, but this can take some time -- cheaper memory, faster machines, etc. For now we have the compile time option and since RedHat chose to activate it, there's a good chance that we'll get forced to do the same sooner rather than later ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From larsga@garshol.priv.no Wed Sep 25 16:51:16 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 25 Sep 2002 17:51:16 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D91DAE5.1090009@lemburg.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <3D91DAE5.1090009@lemburg.com> Message-ID: * Lars Marius Garshol | | Is the plan that Python will eventually be UCS-4 only? * mal@lemburg.com | | Eventually, yes, but this can take some time -- cheaper memory, | faster machines, etc. For now we have the compile time option and | since RedHat chose to activate it, there's a good chance that we'll | get forced to do the same sooner rather than later ;-) That's good news (both items :-). Of course, the abstract character issue remains. Are we likely to see support for normalization in the Python C core any time soon? Specifically Normalization Form D... -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From mal@lemburg.com Wed Sep 25 17:17:23 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 25 Sep 2002 18:17:23 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <3D91DAE5.1090009@lemburg.com> Message-ID: <3D91E193.3030904@lemburg.com> Lars Marius Garshol wrote: > * Lars Marius Garshol > | > | Is the plan that Python will eventually be UCS-4 only? > > * mal@lemburg.com > | > | Eventually, yes, but this can take some time -- cheaper memory, > | faster machines, etc. For now we have the compile time option and > | since RedHat chose to activate it, there's a good chance that we'll > | get forced to do the same sooner rather than later ;-) > > That's good news (both items :-). > > Of course, the abstract character issue remains. Are we likely to see > support for normalization in the Python C core any time soon? > Specifically Normalization Form D... Not unless someone contributes the code... we still need support for normalization and collation. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From reagle@w3.org Wed Sep 25 18:13:23 2002 From: reagle@w3.org (Joseph Reagle) Date: Wed, 25 Sep 2002 13:13:23 -0400 Subject: [XML-SIG] Fwd: Re: XML-DSIG interop test vectors Message-ID: <200209251313.23785.reagle@w3.org> --------------Boundary-00=_BU70A0WFRJYGUOVVZG4M Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit c14n.pymakes a number of simplifying assumptions and consequently doesn't correctly serialize many "exotic" subsets. For instance, if an element is selected by XPath, then all of its attributes are rendered regardless of whether they are in the selected subset. Since I recently encountered this question in the context of a specific test, I added two tweaks that does the right thing: before an attribute is added to xml_attrs or other_attrs, I check to see if it's in the subset. ---------- Forwarded Message ---------- Subject: Re: XML-DSIG interop test vectors Date: Mon, 23 Sep 2002 16:24:56 -0400 From: Joseph Reagle To: "Ari Kermaier" On Monday 23 September 2002 01:11 pm, Ari Kermaier wrote: > The result of the location path in this case is the set of all nodes in > the document. Right, resulting from: (//. | //@* | namespace::*) > The predicate is then applied to each node in the location path set, > resulting in true or false for each node. So for every node, we're testing with the predicate [@*] which as an expression "selects all the attributes of the context node". So this evaluates to true for the "player" element. > This is true for the element node > (which has 3 attributes), but should be false for all other nodes in the > document because they have no attributes. In particular, the attribute > nodes owned by the element have no attributes themselves, so > the predicate should evaluate to false for them, and they should be > excluded from the final result. Because they are not in the subset. Ok, I understand now, and the pyXML code isn't very instructive on this front because it (stupidly) renders every attribute, not testing whether each attribute itself is in the nodeset. When I do, the result is as you say: nodelist is [] Did you get a response from anyone else? If not, you could always feed to Aleksey's script. http://www.aleksey.com/xmlsec/xmldsig-verifier.html ------------------------------------------------------- -- *Note: I will be traveling and attending meetings Oct 2/3 in California; and Oct 5-15 in Australia. I will not be very responsive during this period; I will fully respond to any email as soon as possible after my return. Joseph Reagle Jr. http://www.w3.org/People/Reagle/ W3C Policy Analyst mailto:reagle@w3.org IETF/W3C XML-Signature Co-Chair http://www.w3.org/Signature/ W3C XML Encryption Chair http://www.w3.org/Encryption/2001/ --------------Boundary-00=_BU70A0WFRJYGUOVVZG4M Content-Type: text/x-python; charset="iso-8859-1"; name="c14n.py" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="c14n.py" #! /usr/bin/env python '''XML Canonicalization This module generates canonical XML of a document or element. http://www.w3.org/TR/2001/REC-xml-c14n-20010315 and includes a prototype of exclusive canonicalization http://www.w3.org/Signature/Drafts/xml-exc-c14n Requires PyXML 0.7.0 or later. Known issues if using Ft.Lib.pDomlette: 1. Unicode 2. does not white space normalize attributes of type NMTOKEN and ID? 3. seems to be include "\n" after importing external entities? Note, this version processes a DOM tree, and consequently it processes namespace nodes as attributes, not from a node's namespace axis. This permits simple document and element canonicalization without XPath. When XPath is used, the XPath result node list is passed and used to determine if the node is in the XPath result list, but little else. Authors: "Joseph M. Reagle Jr." "Rich Salz" $Date: 2002/04/26 21:22:00 $ by $Author: reagle $ ''' _copyright = '''Copyright 2001, Zolera Systems Inc. All Rights Reserved. Copyright 2001, MIT. All Rights Reserved. Distributed under the terms of: Python 2.0 License or later. http://www.python.org/2.0.1/license.html or W3C Software License http://www.w3.org/Consortium/Legal/copyright-software-19980720 ''' import string from xml.dom import Node try: from xml.ns import XMLNS except: class XMLNS: BASE = "http://www.w3.org/2000/xmlns/" XML = "http://www.w3.org/XML/1998/namespace" try: import cStringIO StringIO = cStringIO except ImportError: import StringIO _attrs = lambda E: (E.attributes and E.attributes.values()) or [] _children = lambda E: E.childNodes or [] _IN_XML_NS = lambda n: n.namespaceURI == XMLNS.XML _inclusive = lambda n: n.unsuppressedPrefixes == None # Does a document/PI has lesser/greater document order than the # first element? _LesserElement, _Element, _GreaterElement = range(3) def _sorter(n1,n2): '''_sorter(n1,n2) -> int Sorting predicate for non-NS attributes.''' i = cmp(n1.namespaceURI, n2.namespaceURI) if i: return i return cmp(n1.localName, n2.localName) def _sorter_ns(n1,n2): '''_sorter_ns((n,v),(n,v)) -> int "(an empty namespace URI is lexicographically least)."''' if n1[0] == 'xmlns': return -1 if n2[0] == 'xmlns': return 1 return cmp(n1[0], n2[0]) def _utilized(n, node, other_attrs, unsuppressedPrefixes): '''_utilized(n, node, other_attrs, unsuppressedPrefixes) -> boolean Return true if that nodespace is utilized within the node''' if n.startswith('xmlns:'): n = n[6:] elif n.startswith('xmlns'): n = n[5:] if (n=="" and node.prefix in ["#default", None]) or \ n == node.prefix or n in unsuppressedPrefixes: return 1 for attr in other_attrs: if n == attr.prefix: return 1 return 0 #_in_subset = lambda subset, node: not subset or node in subset _in_subset = lambda subset, node: subset is None or node in subset # rich's tweak class _implementation: '''Implementation class for C14N. This accompanies a node during it's processing and includes the parameters and processing state.''' # Handler for each node type; populated during module instantiation. handlers = {} def __init__(self, node, write, **kw): '''Create and run the implementation.''' self.write = write self.subset = kw.get('subset') self.comments = kw.get('comments', 0) self.unsuppressedPrefixes = kw.get('unsuppressedPrefixes') nsdict = kw.get('nsdict', { 'xml': XMLNS.XML, 'xmlns': XMLNS.BASE }) # Processing state. self.state = (nsdict, {'xml':''}, {}) #0422 if node.nodeType == Node.DOCUMENT_NODE: self._do_document(node) elif node.nodeType == Node.ELEMENT_NODE: self.documentOrder = _Element # At document element if not _inclusive(self): self._do_element(node) else: inherited = self._inherit_context(node) self._do_element(node, inherited) elif node.nodeType == Node.DOCUMENT_TYPE_NODE: pass else: raise TypeError, str(node) def _inherit_context(self, node): '''_inherit_context(self, node) -> list Scan ancestors of attribute and namespace context. Used only for single element node canonicalization, not for subset canonicalization.''' # Collect the initial list of xml:foo attributes. xmlattrs = filter(_IN_XML_NS, _attrs(node)) # Walk up and get all xml:XXX attributes we inherit. inherited, parent = [], node.parentNode while parent and parent.nodeType == Node.ELEMENT_NODE: for a in filter(_IN_XML_NS, _attrs(parent)): n = a.localName if n not in xmlattrs: xmlattrs.append(n) inherited.append(a) parent = parent.parentNode return inherited def _do_document(self, node): '''_do_document(self, node) -> None Process a document node. documentOrder holds whether the document element has been encountered such that PIs/comments can be written as specified.''' self.documentOrder = _LesserElement for child in node.childNodes: if child.nodeType == Node.ELEMENT_NODE: self.documentOrder = _Element # At document element self._do_element(child) self.documentOrder = _GreaterElement # After document element elif child.nodeType == Node.PROCESSING_INSTRUCTION_NODE: self._do_pi(child) elif child.nodeType == Node.COMMENT_NODE: self._do_comment(child) elif child.nodeType == Node.DOCUMENT_TYPE_NODE: pass else: raise TypeError, str(child) handlers[Node.DOCUMENT_NODE] = _do_document def _do_text(self, node): '''_do_text(self, node) -> None Process a text or CDATA node. Render various special characters as their C14N entity representations.''' if not _in_subset(self.subset, node): return s = string.replace(node.data, "&", "&") s = string.replace(s, "<", "<") s = string.replace(s, ">", ">") s = string.replace(s, "\015", " ") if s: self.write(s) handlers[Node.TEXT_NODE] = _do_text handlers[Node.CDATA_SECTION_NODE] = _do_text def _do_pi(self, node): '''_do_pi(self, node) -> None Process a PI node. Render a leading or trailing #xA if the document order of the PI is greater or lesser (respectively) than the document element. ''' if not _in_subset(self.subset, node): return W = self.write if self.documentOrder == _GreaterElement: W('\n') W('') if self.documentOrder == _LesserElement: W('\n') handlers[Node.PROCESSING_INSTRUCTION_NODE] = _do_pi def _do_comment(self, node): '''_do_comment(self, node) -> None Process a comment node. Render a leading or trailing #xA if the document order of the comment is greater or lesser (respectively) than the document element. ''' if not _in_subset(self.subset, node): return if self.comments: W = self.write if self.documentOrder == _GreaterElement: W('\n') W('') if self.documentOrder == _LesserElement: W('\n') handlers[Node.COMMENT_NODE] = _do_comment def _do_attr(self, n, value): ''''_do_attr(self, node) -> None Process an attribute.''' W = self.write W(' ') W(n) W('="') s = string.replace(value, "&", "&") s = string.replace(s, "<", "<") s = string.replace(s, '"', '"') s = string.replace(s, '\011', ' ') s = string.replace(s, '\012', ' ') s = string.replace(s, '\015', ' ') W(s) W('"') def _do_element(self, node, initial_other_attrs = []): '''_do_element(self, node, initial_other_attrs = []) -> None Process an element (and its children).''' # Get state (from the stack) make local copies. # ns_parent -- NS declarations in parent # ns_rendered -- NS nodes rendered by ancestors # ns_local -- NS declarations relevant to this element # xml_attrs -- Attributes in XML namespace from parent # xml_attrs_local -- Local attributes in XML namespace. ns_parent, ns_rendered, xml_attrs = \ self.state[0], self.state[1].copy(), self.state[2].copy() #0422 ns_local = ns_parent.copy() xml_attrs_local = {} # Divide attributes into NS, XML, and others. other_attrs = initial_other_attrs[:] in_subset = _in_subset(self.subset, node) for a in _attrs(node): if a.namespaceURI == XMLNS.BASE: n = a.nodeName if n == "xmlns:": n = "xmlns" # DOM bug workaround ns_local[n] = a.nodeValue elif a.namespaceURI == XMLNS.XML: if _inclusive(self) or (in_subset and _in_subset(self.subset, a)): #020925 Test to see if attribute node in subset xml_attrs_local[a.nodeName] = a #0426 else: if _in_subset(self.subset, a): #020925 Test to see if attribute node in subset other_attrs.append(a) #add local xml:foo attributes to ancestor's xml:foo attributes xml_attrs.update(xml_attrs_local) # Render the node W, name = self.write, None if in_subset: name = node.nodeName W('<') W(name) # Create list of NS attributes to render. ns_to_render = [] for n,v in ns_local.items(): # If default namespace is XMLNS.BASE or empty, # and if an ancestor was the same if n == "xmlns" and v in [ XMLNS.BASE, '' ] \ and ns_rendered.get('xmlns') in [ XMLNS.BASE, '', None ]: continue # "omit namespace node with local name xml, which defines # the xml prefix, if its string value is # http://www.w3.org/XML/1998/namespace." if n in ["xmlns:xml", "xml"] \ and v in [ 'http://www.w3.org/XML/1998/namespace' ]: continue # If not previously rendered # and it's inclusive or utilized if (n,v) not in ns_rendered.items() \ and (_inclusive(self) or \ _utilized(n, node, other_attrs, self.unsuppressedPrefixes)): ns_to_render.append((n, v)) # Sort and render the ns, marking what was rendered. ns_to_render.sort(_sorter_ns) for n,v in ns_to_render: self._do_attr(n, v) ns_rendered[n]=v #0417 # If exclusive or the parent is in the subset, add the local xml attributes # Else, add all local and ancestor xml attributes # Sort and render the attributes. if not _inclusive(self) or _in_subset(self.subset,node.parentNode): #0426 other_attrs.extend(xml_attrs_local.values()) else: other_attrs.extend(xml_attrs.values()) other_attrs.sort(_sorter) for a in other_attrs: self._do_attr(a.nodeName, a.value) W('>') # Push state, recurse, pop state. state, self.state = self.state, (ns_local, ns_rendered, xml_attrs) for c in _children(node): _implementation.handlers[c.nodeType](self, c) self.state = state if name: W('' % name) handlers[Node.ELEMENT_NODE] = _do_element def Canonicalize(node, output=None, **kw): '''Canonicalize(node, output=None, **kw) -> UTF-8 Canonicalize a DOM document/element node and all descendents. Return the text; if output is specified then output.write will be called to output the text and None will be returned Keyword parameters: nsdict: a dictionary of prefix:uri namespace entries assumed to exist in the surrounding context comments: keep comments if non-zero (default is 0) subset: Canonical XML subsetting resulting from XPath (default is []) unsuppressedPrefixes: do exclusive C14N, and this specifies the prefixes that should be inherited. ''' if output: apply(_implementation, (node, output.write), kw) else: s = StringIO.StringIO() apply(_implementation, (node, s.write), kw) return s.getvalue() --------------Boundary-00=_BU70A0WFRJYGUOVVZG4M-- From martin@v.loewis.de Wed Sep 25 18:32:03 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 19:32:03 +0200 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: <1032945860.12566.23.camel@ibook> References: <1032945860.12566.23.camel@ibook> Message-ID: Eric van der Vlist writes: > Weird, isn't it? That's a known bug in Python 2.2, which has been fixed in Python 2.3. MAL says the fix cannot be backported to 2.2.2, since it requires bumping the pyc revision. I recommend to use unicode("your-string-as-utf8","utf-8") instead of u"your-string-as-unicode-literal" Regards, Martin From martin@v.loewis.de Wed Sep 25 18:37:46 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 19:37:46 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: Lars Marius Garshol writes: > | (BTW, there is no diff between UTF-32 and UCS-4, is there?). > > UTF-32 is Unicode, UCS-4 is ISO 10646. The Unicode code space used to > be more restricted than the ISO 10646 one, which ISO was supposed to > fix. Not sure whether that fix has gone through yet, but probably it > has. Once it has there will be no difference. In addition, UTF-32 is a transfer form, UCS-4 is a code set. In some revisions, ISO 10646 seems to imply that UTF-32 is thus a byte encoding, but this has now been clarified that it is rather a transfer form based on 32-bit code units, with UTF-32BE and UTF-32LE being possible byte encodings. Appart from that: every character assigned in UCS-4 has the code unit with the same value in UTF-32. Regards, Martin From walter@livinglogic.de Wed Sep 25 18:49:32 2002 From: walter@livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Wed, 25 Sep 2002 19:49:32 +0200 Subject: [XML-SIG] saxutils.XMLGenerator: Output encoding References: <20020925071303.GY30717@doctronic.de> Message-ID: <3D91F72C.7070106@livinglogic.de> Carsten Oberscheid wrote: > Hello everybody, > > I have not followed this list for some time, so this may have been > discussed before: to the XMLGenerator, an output encoding can be > given. All output is then written through saxutils.escape() using this > encoding. As a result, any character in the document that can not be > represented in the output encoding raises a UnicodeException. So one > single special character in a file can force me to produce UTF-8 > encoding, although for further processing ISO 8859-1 or even ASCII > would be much more handy. > > An alternative would be to catch the UnicodeException and, as a > reaction, encode the offensive characters as character references > (e.g. "“"). Shouldn't this be the XML way to do it? That's exactly the purpose of PEP 293, which will go into Python 2.3. With it you can write: u"x\u201cx".encode("ascii", "xmlcharrefreplace") and you'll get: "x“x" > I can provide a very primitive patch for saxutils.py, if anybody is > interested. I even would try to make it less primitive, if there are > no objections against taking this fix into the distribution :^) Using the new functionality in PyXML is another matter, because of backwards compatibility. If you'd like to provide a patch for PyXML that work for versions prior to 2.3, go ahead. Bye, Walter Dörwald From martin@v.loewis.de Wed Sep 25 18:47:20 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 19:47:20 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: Lars Marius Garshol writes: > | 1. Ignore the problem. This is probably fine: nobody is using non-BMP > | characters right now. Most systems have serious problem displaying > | them, since font systems are restricted to 64k glyphs, and, in many > | cases, to displaying characters in the BMP only. > > Actually, Windows 2000 displays non-BMP characters just fine. MSIE can > be made to do it, Opera 6.0 does it just fine, Mozilla does not (I > think) do it. Can you demonstrate this? I failed trying for myself, because: - I have no fonts that has characters outside the BMP, - TrueType is limited to 64k glyphs, - OpenType fonts that want to include non-BMP characters need to char-to-glyph tables, one for UCS-2, and one for UCS-4. Reportedly, W2k will only use the UCS-2 table in a font that contains non-BMP characters, so I somewhat doubt your statement. WXP reportedly does support such fonts - but I have none. - charmap.exe cannot display characters outside the BMP. > Also, there are locales where non-BMP characters are essential. > Cantonese is probably the best example. You can't write the Cantonese > equivalent of the "-ing" ending in Cantonese with the BMP... W2k/WXP support GB18030 with a special support package, but the font included (SimSun18030 aka NSimSun) does *not* support the CJK Extensions B, only CJK extensions A. > Is the plan that Python will eventually be UCS-4 only? It's my plan, but I think I don't share this plan with GvR. When I first presented a Unicode type for Python on IPC6, Guido was quite upset about my proposal to use a 4-byte wchar_t as the underlying type, since he considered the space wastage unacceptable. When Fredrik and I implemented PEP 261, I had to back out my change to make Py_UNICODE equal to wchar_t by default if wchar_t is four bytes. Regards, Martin From martin@v.loewis.de Wed Sep 25 19:01:36 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 20:01:36 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923173203.O5635@redhat.com> Message-ID: Lars Marius Garshol writes: > Note also that there is one further problem. How long is this string > > u"\u0041\u030A" > > according to RELAX/XPath/XSDL? In XML 1.1, you are required to produce NFC "early", i.e. before the XML document becomes visible. XPath points out that things may work incorrectly unless the W3C charmod canonical form is used. This is not only relevant for length operations, but also for string comparison. Relax does not bother mentioning normalization. XSDL seems to be largely ignorant of normalization as well, although it refers to the character model as a non-normative reference. > The problem here is that the UTF-16 == Unicode assumption is built > into all sorts of technologies, from Python to Java to Ada-95 to Win32 > to DOM 2.0 to ..., and in most cases people are not even aware of the > problem. Notice that was even in Unicode a problem for a long time. Some revisions of the Unicode spec ruled out wchar_t implementations as non-conforming which use UCS-4 in memory. Regards, Martin From martin@v.loewis.de Wed Sep 25 19:05:16 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 20:05:16 +0200 Subject: [XML-SIG] Re: [XML-checkins]xml/xml/dom expatbuilder.py,1.24,1.25 In-Reply-To: <15761.46620.734779.200240@grendel.zope.com> References: <15761.46620.734779.200240@grendel.zope.com> Message-ID: "Fred L. Drake, Jr." writes: > Except that the compatibility version defined in minicompat is being > used in this case, and it does accept a tuple as the second arg. I didn't notice minicompat is that powerful :-) Regards, Martin From fdrake@acm.org Wed Sep 25 19:10:25 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 25 Sep 2002 14:10:25 -0400 Subject: [XML-SIG] Re: [XML-checkins]xml/xml/dom expatbuilder.py,1.24,1.25 In-Reply-To: References: <15761.46620.734779.200240@grendel.zope.com> Message-ID: <15761.64529.818172.669181@grendel.zope.com> Martin v. Loewis writes: > I didn't notice minicompat is that powerful :-) Aye, minicompat is truly the work of the devil! ;) -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Wed Sep 25 19:55:13 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 20:55:13 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032948827.12566.157.camel@ibook> References: <1032948827.12566.157.camel@ibook> Message-ID: Eric van der Vlist writes: > Does that mean that chargen.py should be rewritten for ucs4? No. It means that Unicode character classes don't work in SRE, for ucs4 builds; this is http://python.org/sf/599377. It is likely that it was me who introduced this bug, when I added the optimization for large Unicode character classes, but I haven't found the time to investigate that further, and may not be able to do so in the coming months. Contributions are welcome. > PS: if someone could help me with chargen.py which looks like black > magic to me, I would really appreciate! What do you want to know? It parses the character definitions of XML 1.0 2nd edition, and generates sre definitions from that. Regards, Martin From fdrake@acm.org Wed Sep 25 18:57:13 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 25 Sep 2002 13:57:13 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: <15761.63737.77586.112928@grendel.zope.com> Martin v. Loewis writes: > It's my plan, but I think I don't share this plan with GvR. When I > first presented a Unicode type for Python on IPC6, Guido was quite > upset about my proposal to use a 4-byte wchar_t as the underlying > type, since he considered the space wastage unacceptable. When Guido saw my documentation updates yesterday, he entered "rant mode" fairly quickly, and didn't want to back down (even though I didn't express any interest in taking any position on any matter related to Unicode!). From that I presume it's safe to think he's still incredibly hostile to anything beyond 7-bit ASCII. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From martin@v.loewis.de Wed Sep 25 20:17:52 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 21:17:52 +0200 Subject: [XML-SIG] saxutils.XMLGenerator: Output encoding In-Reply-To: <3D91F72C.7070106@livinglogic.de> References: <20020925071303.GY30717@doctronic.de> <3D91F72C.7070106@livinglogic.de> Message-ID: Walter D=F6rwald writes: > Using the new functionality in PyXML is another matter, because of > backwards compatibility. If you'd like to provide a patch for PyXML > that work for versions prior to 2.3, go ahead. Let me second this suggestion. The working model for PyXML is to accept any features that people are willing to contribute or maintain, no matter how much used a certain feature is, assuming it does not break any existing application. For new features, it is ok to require certain minimum Python versions if this is documented. Requiring CVS Python makes a feature probably available to *really* few users, though. Having less-than-ideal fallbacks for Python 2.<3 is ok. Please submit patches to sf.net/projects/pyxml. Regards, Martin From uche.ogbuji@fourthought.com Wed Sep 25 20:47:15 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 25 Sep 2002 13:47:15 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <1032948827.12566.157.camel@ibook> Message-ID: <1032983242.8608.669.camel@malatesta> On Wed, 2002-09-25 at 12:55, Martin v. Loewis wrote: > Eric van der Vlist writes: > > > Does that mean that chargen.py should be rewritten for ucs4? > > No. It means that Unicode character classes don't work in SRE, for > ucs4 builds; this is http://python.org/sf/599377. > > It is likely that it was me who introduced this bug, when I added the > optimization for large Unicode character classes, but I haven't found > the time to investigate that further, and may not be able to do so in > the coming months. Contributions are welcome. Hah! And I see that Fred Drake was sneaky enough to assign it to /F instead of you. I guess that means you're off the hook for good ;-) -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From fdrake@acm.org Wed Sep 25 20:44:15 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 25 Sep 2002 15:44:15 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1032983242.8608.669.camel@malatesta> References: <1032948827.12566.157.camel@ibook> <1032983242.8608.669.camel@malatesta> Message-ID: <15762.4623.492178.123510@grendel.zope.com> Uche Ogbuji writes: > Hah! And I see that Fred Drake was sneaky enough to assign it to /F > instead of you. I guess that means you're off the hook for good ;-) I'm happy to see it re-assigned to Martin, but I'm more concerned that it actually get fixed. -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From uche.ogbuji@fourthought.com Wed Sep 25 21:15:39 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 25 Sep 2002 14:15:39 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from "Fred L. Drake, Jr." of "Wed, 25 Sep 2002 13:57:13 EDT." <15761.63737.77586.112928@grendel.zope.com> Message-ID: > > Martin v. Loewis writes: > > It's my plan, but I think I don't share this plan with GvR. When I > > first presented a Unicode type for Python on IPC6, Guido was quite > > upset about my proposal to use a 4-byte wchar_t as the underlying > > type, since he considered the space wastage unacceptable. > > When Guido saw my documentation updates yesterday, he entered "rant > mode" fairly quickly, and didn't want to back down (even though I > didn't express any interest in taking any position on any matter > related to Unicode!). From that I presume it's safe to think he's > still incredibly hostile to anything beyond 7-bit ASCII. Oh dear. Does Guido not understand that currently things are a bit confused and very confusing in Python? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From martin@v.loewis.de Wed Sep 25 21:24:11 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 25 Sep 2002 22:24:11 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: Uche Ogbuji writes: > Oh dear. Does Guido not understand that currently things are a bit > confused and very confusing in Python? I think he does. However, he might come to the conclusion that we would be better off if the Unicode type had not been added to the language :-) I think there is also some unpleasant feeling about having to make decisions without fully understanding all consequences, together with the feeling, that all people who give guidance lack full understanding as well... Regards, Martin From mal@lemburg.com Wed Sep 25 21:39:38 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 25 Sep 2002 22:39:38 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: Message-ID: <3D921F0A.9050209@lemburg.com> Martin v. Loewis wrote: > Uche Ogbuji writes: > > >>Oh dear. Does Guido not understand that currently things are a bit >>confused and very confusing in Python? > > > I think he does. However, he might come to the conclusion that we > would be better off if the Unicode type had not been added to the > language :-) > > I think there is also some unpleasant feeling about having to make > decisions without fully understanding all consequences, together with > the feeling, that all people who give guidance lack full understanding > as well... I book all this under FUD. It'll take a bit of time, but we'll eventually move there. For now, I think the issues around surrogates and the need for non-BMP code points in real life applications are a bit overhyped. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From fdrake@acm.org Wed Sep 25 21:40:42 2002 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 25 Sep 2002 16:40:42 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: <15762.8010.975188.424165@grendel.zope.com> Martin v. Loewis writes: > I think he does. However, he might come to the conclusion that we > would be better off if the Unicode type had not been added to the > language :-) Ah, but the concern over backward compatibility will prevent him from removing it. > I think there is also some unpleasant feeling about having to make > decisions without fully understanding all consequences, together with > the feeling, that all people who give guidance lack full understanding > as well... I think there are a couple of other issues, which perhaps you're implying here: - The addition of Unicode has proven to be quite invasive in the C code of the core; it has certainly affected more than we really expected (though others may have had more appropriate expectations). - The Unicode data type is quite contagious -- since string operations that involve Unicode objects tend to produce Unicode objects, even if all the data is ASCII, there's a serious element of surprise. The traditional view that text strings and byte strings are the same certainly fosters a reactionary position with regard to Unicode. I don't know how to work around this historical baggage without calling it "Python3K" (doesn't that sound a lot like "MST3K"? ;). -Fred -- Fred L. Drake, Jr. PythonLabs at Zope Corporation From fredrik@pythonware.com Wed Sep 25 22:07:07 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 25 Sep 2002 23:07:07 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <200209232138.g8NLcrXL071961@chilled.skew.org><1032818077.3243.7899.camel@malatesta><15760.28591.497946.198455@grendel.zope.com><035301c263d7$782982a0$0900a8c0@spiff> Message-ID: <005401c264d7$84578670$ced241d5@hagrid> Martin wrote: > > imo (as the original author of the unicode type), that's an implementation > > artifact, not a feature. > > Yes, but it still needs to be documented. sure, as long as the documentation makes it clear that "here be dragons". From ps_python@yahoo.com Wed Sep 25 23:53:55 2002 From: ps_python@yahoo.com (kumar s) Date: Wed, 25 Sep 2002 15:53:55 -0700 (PDT) Subject: [XML-SIG] XML Parsing problem Message-ID: <20020925225355.73818.qmail@web13004.mail.yahoo.com> Dear Group, I am trying to parse some 1200 XML files. I am using XMLParser.py script. I wrote a shell script to pass files to this script. However when I try to execute my shell script I get the following error: $ ./format.sh : bad interpreter: Permission deniedn/python2.1 : bad interpreter: Permission deniedn/python2.1 My script is : #!/bin/sh rm entries; for file in ./home/files/xml/* do ./XMLParse1.py $file>>entries ./XMLParse2.py $file >>entries done I have all my xml files in xml directory. and my shell script is residing in /home/files/ Can any one please help me out why I am getting this problem. when I execute my XMLParse1.py via command line I get the result. $ python XMLParse1.py 10245.xml works for me. Is there any way I can parse all 1024 XML files. Please help me. thanks PS My XMLparse.py file #!/usr/bin/python2.1 from xml.dom import minidom import sys from xml.sax._exceptions import SAXParseException import StringIO class XMLParse: def _load(self, source): """ Function to load an XML document from disk/Internet/standard input/XML document as a string. """ sock = self.openAnything(source) try: xmld = minidom.parse(sock).documentElement except SAXParseException: raise "ParseError", "Check tags" sock.close() return xmld # Following function assumes user has uploaded a file instead of # giving a URL pointing to the file on the Internet def parseFile(self, file): """ Opens specified XML document and parses it """ self.xmldoc = self._load(file) return self.xmldoc def parseString(self, str): """ Parses XML formatted string -str- """ self.xmldoc = self._load(str) return self.xmldoc def getTag(self, name): """ Given a tag with name "name", GetTag returns the contents within and including the tags. """ reflist = self.xmldoc.getElementsByTagName(name) return reflist def getWithinTag(self, tag_name, name): """ Given a tag name, this function only returs the contents within the tag (NOT the entire XML document, like getTag). """ refList = tag_name.getElementsByTagName(name) return refList def getText(self, object, name): nodelist = self.getTag(name) rc = "" for node in nodelist: if node.nodeType == node.TEXT_NODE: rc = rc + node.data return rc def openAnything(self, source): """URI, filename, or string --> stream This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. Examples: >>> from xml.dom import minidom >>> sock = openAnything("http://localhost/myfile.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything("c:\\inetpub\\wwwroot\\myfile.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything(" andor") >>> doc = minidom.parse(sock) >>> sock.close() """ if hasattr(source, "read"): return source if source == '-': import sys return sys.stdin # try to open with urllib (if source is http, ftp, or file # URL) import urllib try: return urllib.urlopen(source) except (IOError, OSError): pass # try to open with native open function (pathname) try: return open(source) except (IOError, OSError): pass # treat source as string return StringIO.StringIO(str(source)) def sanitize(self, data): """ Cleans up XML data into a string. """ from re import sub, compile delchars = "[ \t\n]+" return sub(delchars, " ", data.strip()) def xmlProcess(self, ele): rc = "" cNode = ele nodeAttr = None if cNode.hasAttributes(): nodeAttr = cNode.attributes if ele.hasChildNodes(): cNode = ele.firstChild while cNode.nodeType != cNode.TEXT_NODE: cNode = cNode.nextSibling while cNode is not None and cNode.nodeType == cNode.TEXT_NODE: cNode.normalize() rc = rc + cNode.data cNode = cNode.nextSibling else: #if cNode is not None: # if cNode.nodeType == cNode.ELEMENT_NODE: return self.sanitize(rc), nodeAttr def create_set(self, attr): """ Given a node's attributes (a NamedNodeMap object), creates a list that is easy to use. Return value: attr_set[] Usage: attr_set[0].name = name of attribute attr_set[0].value = value of attribute """ keys = attr.keys() attr_set = [] for e in range(len(keys)): attr_set.append(attr[keys[e]]) return attr_set if __name__=='__main__': file = sys.argv[1] doc = XMLParse() doc.parseFile(file) li = doc.getTag("entry_cDNA") # print li res = "" for i in range(len(li)): res, attr = doc.xmlProcess(li[i]) print res #if attr is None: # break #else: # print "Has attributes" # set = doc.create_set(attr) # for e in range(len(set)): # print set[e].name,":",set[e].value __________________________________________________ Do you Yahoo!? New DSL Internet Access from SBC & Yahoo! http://sbc.yahoo.com From veillard@redhat.com Thu Sep 26 00:06:05 2002 From: veillard@redhat.com (Daniel Veillard) Date: Wed, 25 Sep 2002 19:06:05 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D91DAE5.1090009@lemburg.com>; from mal@lemburg.com on Wed, Sep 25, 2002 at 05:48:53PM +0200 References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <3D91DAE5.1090009@lemburg.com> Message-ID: <20020925190605.L5635@redhat.com> On Wed, Sep 25, 2002 at 05:48:53PM +0200, M.-A. Lemburg wrote: > Lars Marius Garshol wrote: > > Is the plan that Python will eventually be UCS-4 only? > > Eventually, yes, but this can take some time -- cheaper memory, > faster machines, etc. For now we have the compile time option > and since RedHat chose to activate it, there's a good chance > that we'll get forced to do the same sooner rather than > later ;-) Just in case my post was missed about it, this not the case for the upcoming release apparently: gnome:~ -> python Python 2.2.1 (#1, Aug 30 2002, 12:15:30) [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print sys.maxunicode 65535 >>> gnome:~ -> I would rather suggest people stabilize their APIs and provide the correct wrappers to an unified view of character counting independant of the specificty of the underlying in-memory representation, especially if the Windows port of python depends on Windows character representation (UTF16). Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From Matt Gushee Wed Sep 25 23:58:55 2002 From: Matt Gushee (Matt Gushee) Date: Wed, 25 Sep 2002 16:58:55 -0600 Subject: [XML-SIG] XML Parsing problem In-Reply-To: <20020925225355.73818.qmail@web13004.mail.yahoo.com> References: <20020925225355.73818.qmail@web13004.mail.yahoo.com> Message-ID: <20020925225854.GI577@swordfish> On Wed, Sep 25, 2002 at 03:53:55PM -0700, kumar s wrote: > > $ ./format.sh > : bad interpreter: Permission deniedn/python2.1 > : bad interpreter: Permission deniedn/python2.1 > > My script is : > > #!/bin/sh > > rm entries; > > for file in ./home/files/xml/* > do > ./XMLParse1.py $file>>entries > ./XMLParse2.py $file >>entries > done What is the first line of XMLParse1.py? If it is the same as this: > My XMLparse.py file > > #!/usr/bin/python2.1 then what is the result of $ file /usr/bin/python2.1 and $ ls -l /usr/bin/python2.1 ? By the way, unless you have more than one version of Python on your system, you're probably better off using #!/usr/bin/env python It's much more portable. -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From veillard@redhat.com Thu Sep 26 00:12:33 2002 From: veillard@redhat.com (Daniel Veillard) Date: Wed, 25 Sep 2002 19:12:33 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from martin@v.loewis.de on Wed, Sep 25, 2002 at 08:01:36PM +0200 References: <20020923173203.O5635@redhat.com> Message-ID: <20020925191233.M5635@redhat.com> On Wed, Sep 25, 2002 at 08:01:36PM +0200, Martin v. Loewis wrote: > Lars Marius Garshol writes: > > > Note also that there is one further problem. How long is this string > > > > u"\u0041\u030A" > > > > according to RELAX/XPath/XSDL? > > In XML 1.1, you are required to produce NFC "early", i.e. before the > XML document becomes visible. XPath points out that things may work I will just say that "required" is too strongly worded, Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From uche.ogbuji@fourthought.com Thu Sep 26 03:24:10 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 25 Sep 2002 20:24:10 -0600 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: Message from martin@v.loewis.de (Martin v. Loewis) of "25 Sep 2002 19:32:03 +0200." Message-ID: > Eric van der Vlist writes: > > > Weird, isn't it? > > That's a known bug in Python 2.2, which has been fixed in Python > 2.3. MAL says the fix cannot be backported to 2.2.2, since it > requires bumping the pyc revision. Do you know the SF tracker number? I'm working on an Akara page on all this. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From martin@v.loewis.de Thu Sep 26 04:51:58 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 05:51:58 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020925191233.M5635@redhat.com> References: <20020923173203.O5635@redhat.com> <20020925191233.M5635@redhat.com> Message-ID: Daniel Veillard writes: > > In XML 1.1, you are required to produce NFC "early", i.e. before the > > XML document becomes visible. > > I will just say that "required" is too strongly worded, How else would I interpret "In order to be well-formed, all XML parsed entities (including document entities ) must be fully normalized as per the definition of [Charmod] supplemented...", and "It is a fatal error for a parsed entity not to be in fully normalized form."? I assume "fatal error" continues to be defined as "An error which a conforming XML processor must detect and report to the application... Once a fatal error is detected, however, the processor must not continue normal processing. " Regards, Martin From uche.ogbuji@fourthought.com Thu Sep 26 04:59:56 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 25 Sep 2002 21:59:56 -0600 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: Message from Uche Ogbuji of "Wed, 25 Sep 2002 20:24:10 MDT." Message-ID: > > Eric van der Vlist writes: > > > > > Weird, isn't it? > > > > That's a known bug in Python 2.2, which has been fixed in Python > > 2.3. MAL says the fix cannot be backported to 2.2.2, since it > > requires bumping the pyc revision. > > Do you know the SF tracker number? I'm working on an Akara page on all this. I think I found it: #593581 -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From martin@v.loewis.de Thu Sep 26 05:05:03 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 06:05:03 +0200 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: References: Message-ID: Uche Ogbuji writes: > > That's a known bug in Python 2.2, which has been fixed in Python > > 2.3. MAL says the fix cannot be backported to 2.2.2, since it > > requires bumping the pyc revision. > > Do you know the SF tracker number? I'm working on an Akara page on all this. I think one of the original bug reports was #433882. It appears that this was fixed with (among others) import.c 2.194, even though the checkin message does not explicitly lists this bug number. For 2.2, this bug is #610783. MAL says this is fixed in the 2.2 branch, although there is no indication what specific changes fixed the problem. Regards, Martin From martin@v.loewis.de Thu Sep 26 05:08:48 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 06:08:48 +0200 Subject: [XML-SIG] Weirdness (bug?) with smart_len (wasRe: Issues with Unicode type) In-Reply-To: References: Message-ID: Uche Ogbuji writes: > > Do you know the SF tracker number? I'm working on an Akara page > > on all this. > I think I found it: #593581 That's a different issue: it does not affect .pyc files. Regards, Martin From vdv@dyomedea.com Thu Sep 26 08:08:53 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 26 Sep 2002 09:08:53 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D921F0A.9050209@lemburg.com> References: <3D921F0A.9050209@lemburg.com> Message-ID: <1033024134.23888.47.camel@ibook> On Wed, 2002-09-25 at 22:39, M.-A. Lemburg wrote: >=20 > I book all this under FUD. It'll take a bit of time, but we'll > eventually move there. For now, I think the issues around > surrogates and the need for non-BMP code points in real life > applications are a bit overhyped. I think that it depends what we call real life and more precisely if you consider that the full conformance to standards and W3C recommendations is part of the real life or not. Having never met the need before, I can't consider non BMP code points as an absolute requirement by themselves. OTH, working on implementations of standards (or recs) without aiming for complete conformance is something which I consider as dangerous and I am reaching a point where Python doesn't look as a adequate plateform to implement W3C XML Schema datatypes (and hardly an adequate platform to implement Relax NG) because of the lack of support of non BMP code points. 1) For Relax NG: The issue can be solved by using other mechanisms to test "NCName"s but the Regular Expression which I am using right now doesn't work when the Python interpreter has been compiled with support of ucs4. 2) For W3C XML Schema Datatypes: The two issues which I am currently aware of are the length of the strings which can be solved by implementing an application level length algorithm and, more serious, the support of the regular expressions required for the "pattern" facet for which I don't see how we could rely on the Python regexp features which are buggy when compiled as ucs4 and will not produce the expected result when compiled as ucs2.=20 Unless we rely on external C extensions such as the ones developed by Daniel for libxml, I just see no way to be "natively conform"! Again, we can say that it won't matter for "real life applications" and that we don't care about conformance but that's a dangerous path. Thanks, Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From veillard@redhat.com Thu Sep 26 10:18:13 2002 From: veillard@redhat.com (Daniel Veillard) Date: Thu, 26 Sep 2002 05:18:13 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033024134.23888.47.camel@ibook>; from vdv@dyomedea.com on Thu, Sep 26, 2002 at 09:08:53AM +0200 References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> Message-ID: <20020926051813.O5635@redhat.com> On Thu, Sep 26, 2002 at 09:08:53AM +0200, Eric van der Vlist wrote: > Unless we rely on external C extensions such as the ones developed by > Daniel for libxml, I just see no way to be "natively conform"! Hum, independantly, the XML Schemas regexp support will be in the next version of libxml2 python bindings: paphio:~/XML/python -> python Python 1.5.2 (#1, Apr 3 2002, 18:16:26) [GCC 2.96 20000731 (Red Hat Linux 7.2 2 on linux-i386 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> import libxml2 >>> re = libxml2.regexpCompile("a(b|c){2,3}d") >>> re.regexpExec("abcd") 1 >>> re.regexpExec("acccd") 1 >>> re.regexpExec("abd") 0 >>> re.regexpExec("accccd") 0 >>> re = libxml2.regexpCompile("((a|b|\p{Nd}){1,2}|aaa|bbbb){1,2}") >>> re.regexpExec("bab") 1 >>> re.regexpExec("aaca") 0 >>> re.regexpExec("aaabbbb") 1 >>> re.regexpExec("a0b") 1 >>> re.regexpExec("aa0aaa") 0 >>> re.regexpExec("b0aaa") 1 >>> strings consumed are expected to be UTF8 encoded, there is also support for the block escape \p{IsX} based on the version of the Unicode map database (April 2002). Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From dehora@eircom.net Thu Sep 26 10:16:48 2002 From: dehora@eircom.net (=?iso-8859-1?Q?Bill_de_h=D3ra?=) Date: Thu, 26 Sep 2002 10:16:48 +0100 Subject: [XML-SIG] XML Parsing problem In-Reply-To: <20020925225355.73818.qmail@web13004.mail.yahoo.com> Message-ID: <002f01c2653d$7126e4e0$1fc8c8c8@mitchum> > However when I try to execute my shell script I get > the following error: >=20 > $ ./format.sh > : bad interpreter: Permission deniedn/python2.1 > : bad interpreter: Permission deniedn/python2.1 Kumar, Check your script for weird eol characters like ^M. That's usually the reason I get a "bad interpreter" message. There's a switch in vi that can highlight ^M (which I don't remember, perhaps someone here knows it). Bill de h=D3ra=20 -- Propylon www.propylon.com=20 =20 =20 From noreply@sourceforge.net Thu Sep 26 10:30:32 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Thu, 26 Sep 2002 02:30:32 -0700 Subject: [XML-SIG] [ pyxml-Bugs-614875 ] sigsegv with large input Message-ID: Bugs item #614875, was opened at 2002-09-26 11:30 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=614875&group_id=6473 Category: pyexpat Group: None Status: Open Resolution: None Priority: 5 Submitted By: Joerg Beyer (jbeyer) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: sigsegv with large input Initial Comment: When I was parsing a large input file with a sax2 parser,I triggered a sigsegv. I was able to reduce the input to the still large file, that I upload to this bug report. I use pyxml 0.8.1 This is my python parser script: ------------------------------------------ import xml.sax.sax2exts parser = xml.sax.sax2exts.XMLParserFactory.make_parser() parser.setFeature(xml.sax.handler.feature_namespaces, 1) parser.parse('in.xml') ------------------------------------------ This is the traceback: (gdb) r parser Starting program: /netsite/python/python2.2/bin/python2.2 parser [New Thread 1024 (runnable)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1024 (runnable)] 0x2ac04302 in chunk_free (ar_ptr=0x2ac8f2c0, p=0x8356d30) at malloc.c:3100 3100 malloc.c: Datei oder Verzeichnis nicht gefunden. in malloc.c Current language: auto; currently c (gdb) bt #0 0x2ac04302 in chunk_free (ar_ptr=0x2ac8f2c0, p=0x8356d30) at malloc.c:3100 #1 0x2ac041cf in __libc_free (mem=0x8356d38) at malloc.c:3023 #2 0x2acbb520 in XML_ParserFree (parser=0x83674f0) at extensions/expat/lib/xmlparse.c:1003 #3 0x2acb8f17 in xmlparse_dealloc (self=0x82b919c) at extensions/pyexpat.c:1294 #4 0x080fa3e7 in PyDict_SetItem (op=0x81f125c, key=0x82b84c8, value=0x819ab5c) at Objects/dictobject.c:373 #5 0x080e1fe7 in instance_setattr (inst=0x82b6e64, name=0x82b84c8, v=0x819ab5c) at Objects/classobject.c:741 #6 0x0806d668 in PyObject_SetAttr (v=0x82b6e64, name=0x82b84c8, value=0x819ab5c) at Objects/object.c:1153 #7 0x0808dc6f in eval_frame (f=0x822f544) at Python/ceval.c:1606 #8 0x0808f78e in PyEval_EvalCodeEx (co=0x82bac38, globals=0x82b817c, locals=0x0, args=0x8210094, argcount=1, kws=0x8210098, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2590 #9 0x0809193d in fast_function (func=0x8353d1c, pp_stack=0x7fffef84, n=1, na=1, nk=0) at Python/ceval.c:3166 #10 0x0808e811 in eval_frame (f=0x820ff34) at Python/ceval.c:2029 #11 0x0808f78e in PyEval_EvalCodeEx (co=0x823b520, globals=0x824221c, locals=0x0, args=0x8280908, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2590 #12 0x080f07fe in function_call (func=0x82563fc, arg=0x82808fc, kw=0x0) at Objects/funcobject.c:374 #13 0x080de562 in PyObject_Call (func=0x82563fc, arg=0x82808fc, kw=0x0) at Objects/abstract.c:1684 #14 0x080e5390 in instancemethod_call (func=0x82563fc, arg=0x82808fc, kw=0x0) at Objects/classobject.c:2276 #15 0x080de562 in PyObject_Call (func=0x82b9494, arg=0x82808fc, kw=0x0) at Objects/abstract.c:1684 #16 0x080919c9 in do_call (func=0x82b9494, pp_stack=0x7ffff1a4, na=2, nk=0) at Python/ceval.c:3267 #17 0x0808e82f in eval_frame (f=0x8292954) at Python/ceval.c:2032 #18 0x0808f78e in PyEval_EvalCodeEx (co=0x82b9ab8, globals=0x82b817c, locals=0x0, args=0x81f82c8, argcount=2, kws=0x81f82d0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2590 #19 0x0809193d in fast_function (func=0x8353324, pp_stack=0x7ffff2f4, n=2, na=2, nk=0) at Python/ceval.c:3166 #20 0x0808e811 in eval_frame (f=0x81f817c) at Python/ceval.c:2029 #21 0x0808f78e in PyEval_EvalCodeEx (co=0x8231c38, globals=0x81f11a4, locals=0x81f11a4, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2590 #22 0x080918c8 in PyEval_EvalCode (co=0x8231c38, globals=0x81f11a4, locals=0x81f11a4) at Python/ceval.c:488 #23 0x080ac9d3 in run_node (n=0x8203358, filename=0x7ffff737 "parser", globals=0x81f11a4, locals=0x81f11a4, flags=0x7ffff4f8) at Python/pythonrun.c:1079 #24 0x080ac986 in run_err_node (n=0x8203358, filename=0x7ffff737 "parser", globals=0x81f11a4, locals=0x81f11a4, flags=0x7ffff4f8) at Python/pythonrun.c:1066 #25 0x080ac5ad in PyRun_FileExFlags (fp=0x81e16e8, filename=0x7ffff737 "parser", start=257, globals=0x81f11a4, locals=0x81f11a4, closeit=1, flags=0x7ffff4f8) at Python/pythonrun.c:1057 #26 0x080ab0c1 in PyRun_SimpleFileExFlags (fp=0x81e16e8, filename=0x7ffff737 "parser", closeit=1, flags=0x7ffff4f8) at Python/pythonrun.c:685 #27 0x080ac0ac in PyRun_AnyFileExFlags (fp=0x81e16e8, filename=0x7ffff737 "parser", closeit=1, flags=0x7ffff4f8) at Python/pythonrun.c:495 #28 0x0806a36b in Py_Main (argc=2, argv=0x7ffff584) at Modules/main.c:364 #29 0x08069c16 in main (argc=2, argv=0x7ffff584) at ./Modules/ccpython.cc:10 #30 0x2abcaa8e in __libc_start_main (main=0x8069c00
, argc=2, argv=0x7ffff584, init=0x8068260 <_init>, fini=0x8173edc <_fini>, rtld_fini=0x2aab5a20 <_dl_fini>, stack_end=0x7ffff57c) at ../sysdeps/generic/libc-start.c:92 (gdb) please ask, if any further information might help you. TIA Joerg ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=614875&group_id=6473 From mal@lemburg.com Thu Sep 26 10:41:04 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 26 Sep 2002 11:41:04 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> Message-ID: <3D92D630.8090505@lemburg.com> Eric van der Vlist wrote: > On Wed, 2002-09-25 at 22:39, M.-A. Lemburg wrote: > >>I book all this under FUD. It'll take a bit of time, but we'll >>eventually move there. For now, I think the issues around >>surrogates and the need for non-BMP code points in real life >>applications are a bit overhyped. > > > I think that it depends what we call real life and more precisely if you > consider that the full conformance to standards and W3C recommendations > is part of the real life or not. > > Having never met the need before, I can't consider non BMP code points > as an absolute requirement by themselves. See, that's what I meant :-) We'll get there in time; until then, I'd suggest to use UCS4 builds to write standards implementations. > ... > Again, we can say that it won't matter for "real life applications" and > that we don't care about conformance but that's a dangerous path. I never suggested that; only to give it some time... heck, Java isn't even near being standards conform and neither is Windows. Both were built on top of Unicode 2.x at a time when people thought that 65k chars would be more than enough for all time (hmm, I remember I thought the same a few years back when I bought a 2GB fixed disk ;-). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From larsga@garshol.priv.no Thu Sep 26 10:42:58 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 26 Sep 2002 11:42:58 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: * Lars Marius Garshol | | Actually, Windows 2000 displays non-BMP characters just fine. MSIE | can be made to do it, Opera 6.0 does it just fine, Mozilla does not | (I think) do it. * Martin v. Loewis | | Can you demonstrate this? I don't know why my word alone is not enough, but here you go: The page contains instructions for how to enable the display of such characters. Note that I never did anything to enable surrogate support. | I failed trying for myself, because: | | - I have no fonts that has characters outside the BMP, Use James Kass's Code 2001. | - OpenType fonts that want to include non-BMP characters need | to char-to-glyph tables, one for UCS-2, and one for UCS-4. | | Reportedly, W2k will only use the UCS-2 table in a font that | contains non-BMP characters, so I somewhat doubt your statement. WXP | reportedly does support such fonts - but I have none. The screenshot above is taken on Windows 2000. The font is Code 2001. * Lars Marius Garshol | | Also, there are locales where non-BMP characters are essential. | Cantonese is probably the best example. You can't write the | Cantonese equivalent of the "-ing" ending in Cantonese with the | BMP... * Martin v. Loewis | | W2k/WXP support GB18030 with a special support package, but the font | included (SimSun18030 aka NSimSun) does *not* support the CJK | Extensions B, only CJK extensions A. That may well be. In Opera we have our own GB 18030 converter. I would prefer to pretend that the wretched mess does not exist, but contracts with mainland Chinese companies require us to support it. * Lars Marius Garshol | | Is the plan that Python will eventually be UCS-4 only? * Martin v. Loewis | | It's my plan, but I think I don't share this plan with GvR. When I | first presented a Unicode type for Python on IPC6, Guido was quite | upset about my proposal to use a 4-byte wchar_t as the underlying | type, since he considered the space wastage unacceptable. | | When Fredrik and I implemented PEP 261, I had to back out my change | to make Py_UNICODE equal to wchar_t by default if wchar_t is four | bytes. That's sad. It would be good if we could eventually get Python to be all UCS-4. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From vdv@dyomedea.com Thu Sep 26 10:52:19 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 26 Sep 2002 11:52:19 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D92D630.8090505@lemburg.com> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <3D92D630.8090505@lemburg.com> Message-ID: <1033033939.23902.221.camel@ibook> On Thu, 2002-09-26 at 11:41, M.-A. Lemburg wrote: > Eric van der Vlist wrote: > > On Wed, 2002-09-25 at 22:39, M.-A. Lemburg wrote: > >=20 > >>I book all this under FUD. It'll take a bit of time, but we'll > >>eventually move there. For now, I think the issues around > >>surrogates and the need for non-BMP code points in real life > >>applications are a bit overhyped. > >=20 > >=20 > > I think that it depends what we call real life and more precisely if yo= u > > consider that the full conformance to standards and W3C recommendations > > is part of the real life or not. > > > > Having never met the need before, I can't consider non BMP code points > > as an absolute requirement by themselves. >=20 > See, that's what I meant :-) We'll get there in time; until then, > I'd suggest to use UCS4 builds to write standards implementations. That would seem reasonable, except that regexp doesn't seem to be usable yet on these platforms and that, I have more failures when I run the Relax NG test suite on these platforms than on UCS2 builds :-( ! I would be quite happy to say: if you want to be 100% compliant, use a UCS4 build, but it looks like the impact of unicode has been too invasive (to quote Fred) and that it's not ready yet either... >=20 > > ... > > Again, we can say that it won't matter for "real life applications" and > > that we don't care about conformance but that's a dangerous path. >=20 > I never suggested that; only to give it some time... heck, Java > isn't even near being standards conform and neither is Windows. > Both were built on top of Unicode 2.x at a time when people thought > that 65k chars would be more than enough for all time (hmm, I remember > I thought the same a few years back when I bought a 2GB fixed disk ;-). Sure! Thanks Eric > --=20 > Marc-Andre Lemburg > CEO eGenix.com Software GmbH > _______________________________________________________________________ > eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... > Python Consulting: http://www.egenix.com/ > Python Software: http://www.egenix.com/files/python/ >=20 >=20 --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From Matt Gushee Thu Sep 26 10:51:07 2002 From: Matt Gushee (Matt Gushee) Date: Thu, 26 Sep 2002 03:51:07 -0600 Subject: [XML-SIG] XML Parsing problem In-Reply-To: <002f01c2653d$7126e4e0$1fc8c8c8@mitchum> References: <20020925225355.73818.qmail@web13004.mail.yahoo.com> <002f01c2653d$7126e4e0$1fc8c8c8@mitchum> Message-ID: <20020926095107.GM577@swordfish> On Thu, Sep 26, 2002 at 10:16:48AM +0100, Bill de hÓra wrote: > > > However when I try to execute my shell script I get > > the following error: > > > > $ ./format.sh > > : bad interpreter: Permission deniedn/python2.1 > > : bad interpreter: Permission deniedn/python2.1 > > Kumar, > > Check your script for weird eol characters like ^M. That's usually the > reason I get a "bad interpreter" message. There's a switch in vi that > can highlight ^M (which I don't remember, perhaps someone here knows > it). Oh, that's right. I knew I had seen this problem before, but didn't remember why. I'm not sure how to highlight the ^Ms in vi, but I know how to globally search-and-replace them (in Vim, that is): :%s/^M/^M/g (in case you're not very familiar with vim '^M' is the single character produced by typing Ctrl-v Enter) That's a little counterintuitive, but you can think of it as 'find the character ^M and replace it with the character produced by the Enter key'. -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From martin@v.loewis.de Thu Sep 26 13:17:26 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 14:17:26 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033024134.23888.47.camel@ibook> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> Message-ID: Eric van der Vlist writes: > OTH, working on implementations of standards (or recs) without aiming > for complete conformance is something which I consider as dangerous and > I am reaching a point where Python doesn't look as a adequate plateform > to implement W3C XML Schema datatypes (and hardly an adequate platform > to implement Relax NG) because of the lack of support of non BMP code > points. Please understand that Python is free software. So if it does not fit your needs, you can: a) adjust your needs, or b) adjust Python, or c) not use Python. It is only for non-free software where b) is no option. > The two issues which I am currently aware of are the length of the > strings which can be solved by implementing an application level length > algorithm and, more serious, the support of the regular expressions > required for the "pattern" facet for which I don't see how we could rely > on the Python regexp features which are buggy when compiled as ucs4 and > will not produce the expected result when compiled as ucs2. > > Unless we rely on external C extensions such as the ones developed by > Daniel for libxml, I just see no way to be "natively conform"! I think this is a simplification: You can certainly implement the len algorithm without regular expressions at all: if sys.maxunicode == 65535: def smart_len(s): l = 0 for c in s: if not 0xd800 <= ord(i) < 0xdc00: # skip high surrogates - only count the low surrogates l += 1 return l else: smart_len = len The same applies for NCName: You do not *have* to use regular expressions. Instead, build a dictionary NCName = {} for char in all_ncname_chars: NCName[char] = 1 With that, you can test whether a character is allowed with NCName.has_key(char). > Again, we can say that it won't matter for "real life applications" and > that we don't care about conformance but that's a dangerous path. My code shows that there is a fourth option, in addition to fixing Python: d) work around the bugs and limitations Python is Turing-complete, so there is no algorithmic problem that cannot be solved in Python. So, saying that you cannot "natively conform" is an oversimplification. Regards, Martin From vdv@dyomedea.com Thu Sep 26 13:32:49 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 26 Sep 2002 14:32:49 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> Message-ID: <1033043569.23888.390.camel@ibook> On Thu, 2002-09-26 at 14:17, Martin v. Loewis wrote: > Eric van der Vlist writes: >=20 > > OTH, working on implementations of standards (or recs) without aiming > > for complete conformance is something which I consider as dangerous and > > I am reaching a point where Python doesn't look as a adequate plateform > > to implement W3C XML Schema datatypes (and hardly an adequate platform > > to implement Relax NG) because of the lack of support of non BMP code > > points. >=20 > Please understand that Python is free software. So if it does not fit > your needs, you can: > a) adjust your needs, or > b) adjust Python, or > c) not use Python. >=20 > It is only for non-free software where b) is no option. Sure, sorry if I have given the impression I was complaining while I am just trying to evaluate the situation! >=20 > > The two issues which I am currently aware of are the length of the > > strings which can be solved by implementing an application level length > > algorithm and, more serious, the support of the regular expressions > > required for the "pattern" facet for which I don't see how we could rel= y > > on the Python regexp features which are buggy when compiled as ucs4 and > > will not produce the expected result when compiled as ucs2.=20 > >=20 > > Unless we rely on external C extensions such as the ones developed by > > Daniel for libxml, I just see no way to be "natively conform"! >=20 > I think this is a simplification: You can certainly implement the len > algorithm without regular expressions at all: >=20 > if sys.maxunicode =3D=3D 65535: > def smart_len(s): > l =3D 0 > for c in s: > if not 0xd800 <=3D ord(i) < 0xdc00: > # skip high surrogates - only count the low surrogates > l +=3D 1 > return l > else: > smart_len =3D len >=20 > The same applies for NCName: You do not *have* to use regular > expressions. Instead, build a dictionary=20 >=20 > NCName =3D {} > for char in all_ncname_chars: > NCName[char] =3D 1 >=20 > With that, you can test whether a character is allowed with > NCName.has_key(char). >=20 > > Again, we can say that it won't matter for "real life applications" and > > that we don't care about conformance but that's a dangerous path. >=20 > My code shows that there is a fourth option, in addition to fixing > Python:=20 >=20 > d) work around the bugs and limitations >=20 > Python is Turing-complete, so there is no algorithmic problem that > cannot be solved in Python. So, saying that you cannot "natively > conform" is an oversimplification. Yes, but when it comes to implement the W3C XML Schema "pattern" facet which is basically regular expressions embedded in schemas, this seems to require rewriting a full regular expressions engine. What I meant by "not natively conform" is that it *seems* not feasable with the builtin re module in its current state. Eric (just trying to see where he is stepping into) >=20 > Regards, > Martin >=20 >=20 --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From larsga@garshol.priv.no Thu Sep 26 13:41:09 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 26 Sep 2002 14:41:09 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <3D91E193.3030904@lemburg.com> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <3D91DAE5.1090009@lemburg.com> <3D91E193.3030904@lemburg.com> Message-ID: * Lars Marius Garshol | | Of course, the abstract character issue remains. Are we likely to | see support for normalization in the Python C core any time soon? | Specifically Normalization Form D... * mal@lemburg.com | | Not unless someone contributes the code... we still need support for | normalization and collation. That makes sense. I guess PyXML is likely to be the first Python code to need this (both of those, actually), once XML 1.1 is ready. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From larsga@garshol.priv.no Thu Sep 26 13:41:50 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 26 Sep 2002 14:41:50 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: * Martin v. Loewis | | In addition, UTF-32 is a transfer form, UCS-4 is a code set. That's interesting. I wasn't aware of that distinction. I assume the same applies to UTF-16/UCS-2, then? -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From veillard@redhat.com Thu Sep 26 13:48:40 2002 From: veillard@redhat.com (Daniel Veillard) Date: Thu, 26 Sep 2002 08:48:40 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033043569.23888.390.camel@ibook>; from vdv@dyomedea.com on Thu, Sep 26, 2002 at 02:32:49PM +0200 References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> Message-ID: <20020926084840.A28714@redhat.com> On Thu, Sep 26, 2002 at 02:32:49PM +0200, Eric van der Vlist wrote: > Yes, but when it comes to implement the W3C XML Schema "pattern" facet > which is basically regular expressions embedded in schemas, this seems > to require rewriting a full regular expressions engine. What I meant by > "not natively conform" is that it *seems* not feasable with the builtin > re module in its current state. Hum, I think you would need a rewrite anyway for full conformance, the XML Schemas regexp have more complext constructs than standard regexps the quantifiers may be more rich (not 100% sure I didn't checked fully) and all the character classes/group/category/blocks are not part of "normal" regexps (well I never saw any such description in regexps help or man before, so I doubt it appeared magically in python). For whose wondering about this see http://www.w3.org/TR/xmlschema-2/#regexs Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From veillard@redhat.com Thu Sep 26 13:50:54 2002 From: veillard@redhat.com (Daniel Veillard) Date: Thu, 26 Sep 2002 08:50:54 -0400 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: ; from larsga@garshol.priv.no on Thu, Sep 26, 2002 at 02:41:09PM +0200 References: <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <3D91DAE5.1090009@lemburg.com> <3D91E193.3030904@lemburg.com> Message-ID: <20020926085054.B28714@redhat.com> On Thu, Sep 26, 2002 at 02:41:09PM +0200, Lars Marius Garshol wrote: > * mal@lemburg.com > | > | Not unless someone contributes the code... we still need support for > | normalization and collation. > > That makes sense. I guess PyXML is likely to be the first Python code > to need this (both of those, actually), once XML 1.1 is ready. Actually XML 1.1 parsers would need normalization *checking* only, it's only serialization which would need normalization and collation. Daniel -- Daniel Veillard | Red Hat Network https://rhn.redhat.com/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ From vdv@dyomedea.com Thu Sep 26 14:07:15 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 26 Sep 2002 15:07:15 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020926084840.A28714@redhat.com> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> Message-ID: <1033045636.23902.438.camel@ibook> On Thu, 2002-09-26 at 14:48, Daniel Veillard wrote: > Hum, I think you would need a rewrite anyway for full conformance,=20 > the XML Schemas regexp have more complext constructs than standard regexp= s > the quantifiers may be more rich (not 100% sure I didn't checked fully) > and all the character classes/group/category/blocks are not part of > "normal" regexps (well I never saw any such description in regexps help > or man before, so I doubt it appeared magically in python). I am not 100% sure... The quantifiers are "*", "+", "?" and the "{}" constructs and they seem to work fine in Python... As for the character classes/group/category/blocks, I was wondering if they couldn't be described and generated with chargen.py. A preparsing of the W3C XML Schema patterns could then be done to use them.=20 What might be tougher are the features such as the complement of classes (ex: [\p{IsBasicLatin}-[^\p{L}]]) which AFAIK is an extension over perl regexps. Eric (impatient to see the libxml bindings for this) --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From fredrik@pythonware.com Thu Sep 26 14:47:07 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 26 Sep 2002 15:47:07 +0200 Subject: [XML-SIG] Re: Issues with Unicode type References: <20020923160005.28564.36214.Mailman@mail.python.org><15759.15816.342144.891607@magrathea.basistech.com><1032799708.19185.520.camel@ibook><1032801701.19382.572.camel@ibook> Message-ID: <029301c26563$361c0990$0900a8c0@spiff> Lars Marius Garshol wrote: > That's sad. It would be good if we could eventually get Python to be > all UCS-4.=20 I think the right phrase is "get Python to behave as if it were all = UCS-4" (at least that's my goal -- MAL and MvL might have other goals...) From mike@skew.org Thu Sep 26 17:04:55 2002 From: mike@skew.org (Mike Brown) Date: Thu, 26 Sep 2002 10:04:55 -0600 (MDT) Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: "from Lars Marius Garshol at Sep 26, 2002 02:41:50 pm" Message-ID: <200209261604.g8QG4tWx081129@chilled.skew.org> Lars Marius Garshol wrote: > > * Martin v. Loewis > | > | In addition, UTF-32 is a transfer form, UCS-4 is a code set. > > That's interesting. I wasn't aware of that distinction. I assume the > same applies to UTF-16/UCS-2, then? Sorta. UCS-4 is more than just a "code set" though. And IIRC there was some debate over whether UTF-32 fit the definition of being a true UTF. If you're going to get into that level of understanding, carefully read the following: http://www.unicode.org/unicode/reports/tr17/ and then reconcile its terminology and examples with this (from Unicode 3.0 chapter 3.8): D29 A Unicode (or UCS) transformation format (UTF) transforms each Unicode scalar value into a sequence of code values. A UTF may also specify a byte order for the serialization of the code values into bytes. A UTF may also specify the use of a byte order mark. and this (from Unicode 3.0 appendix C.2): ISO/IEC 10646 defines two alternative forms of encoding: - A four-octet (32-bit) encoding containing 2^31 code positions. These code positions are conceptually divided into 128 groups of 256 planes, each plane containing 256 rows of 256 cells. - A two-octet (16-bit) encoding consisting of plane zero, the Basic Multilingual Plane. The 32-bit form is referred to as UCS-4 (Universal Character Set coded in 4 octets) and the 16-bit form is referred to as UCS-2 (Universal Character Set coded in 2 octets). Have fun :) - Mike ____________________________________________________________________________ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ From martin@v.loewis.de Thu Sep 26 18:40:34 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 19:40:34 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <200209261604.g8QG4tWx081129@chilled.skew.org> References: <200209261604.g8QG4tWx081129@chilled.skew.org> Message-ID: Mike Brown writes: > Sorta. UCS-4 is more than just a "code set" though. And IIRC there > was some debate over whether UTF-32 fit the definition of being a > true UTF. If you're going to get into that level of understanding, > carefully read the following: I think the answer also somewhat varies depending on whether you read Unicode or ISO 10646. Regards, Martin From martin@v.loewis.de Thu Sep 26 18:45:48 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 19:45:48 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033043569.23888.390.camel@ibook> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> Message-ID: Eric van der Vlist writes: > Yes, but when it comes to implement the W3C XML Schema "pattern" facet > which is basically regular expressions embedded in schemas, this seems > to require rewriting a full regular expressions engine. What I meant by > "not natively conform" is that it *seems* not feasable with the builtin > re module in its current state. Then you may consider the following patch, which I just checked into Python 2.2.2 and 2.3. It should fix your test case. Regards, Martin Index: sre_compile.py =================================================================== RCS file: /cvsroot/python/python/dist/src/Lib/sre_compile.py,v retrieving revision 1.43 retrieving revision 1.44 diff -u -r1.43 -r1.44 --- sre_compile.py 27 Jun 2002 20:08:25 -0000 1.43 +++ sre_compile.py 26 Sep 2002 16:39:20 -0000 1.44 @@ -188,6 +188,9 @@ # XXX: could append to charmap tail return charset # cannot compress except IndexError: + if sys.maxunicode != 65535: + # XXX: big charsets don't work in UCS-4 builds + return charset # character set contains unicode characters return _optimize_unicode(charset, fixup) # compress character map From martin@v.loewis.de Thu Sep 26 18:47:22 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 19:47:22 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <20020926084840.A28714@redhat.com> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> Message-ID: Daniel Veillard writes: > Hum, I think you would need a rewrite anyway for full conformance, > the XML Schemas regexp have more complext constructs than standard regexps > the quantifiers may be more rich (not 100% sure I didn't checked fully) > and all the character classes/group/category/blocks are not part of > "normal" regexps (well I never saw any such description in regexps help > or man before, so I doubt it appeared magically in python). You can get categories and blocks by mapping them onto "normal" Unicode character classes. For a specific version of the Unicode character database, this is a fixed mapping. Regards, Martin From martin@v.loewis.de Thu Sep 26 18:49:54 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 19:49:54 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033045636.23902.438.camel@ibook> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> <1033045636.23902.438.camel@ibook> Message-ID: Eric van der Vlist writes: > As for the character classes/group/category/blocks, I was wondering if > they couldn't be described and generated with chargen.py. No; this doesn't parse the Unicode character database; Tools/unicode/makeunicodedata.py parses the Unicode character database. Generating regexes for classes is straight-forward from that. Generating regexes for blocks is not possible, since the standard Unicode database file does not list the blocks; that's a different file (AFAIK). Regards, Martin From martin@v.loewis.de Thu Sep 26 18:58:35 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 26 Sep 2002 19:58:35 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: Lars Marius Garshol writes: > I don't know why my word alone is not enough, but here you go: > I wanted to see myself :-) Thanks for the pointers. Sorry if I offended you. > | Reportedly, W2k will only use the UCS-2 table in a font that > | contains non-BMP characters, so I somewhat doubt your statement. WXP > | reportedly does support such fonts - but I have none. > > The screenshot above is taken on Windows 2000. The font is Code 2001. I was confused by the requirement in the OpenType spec that a UCS-2 cmap table must be included "for backward compatibility needs" if the font is has "UCS-4 character support for Windows 2000 and later". Apparently, that backward compatibility is meant for other systems, but not for Windows 2000. Regards, Martin From noreply@sourceforge.net Thu Sep 26 19:31:47 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Thu, 26 Sep 2002 11:31:47 -0700 Subject: [XML-SIG] [ pyxml-Patches-615114 ] saxutils.py: CharRef escaping Message-ID: Patches item #615114, was opened at 2002-09-26 20:31 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=306473&aid=615114&group_id=6473 Category: SAX Group: None Status: Open Resolution: None Priority: 5 Submitted By: Carsten Oberscheid (oberscheid) Assigned to: Nobody/Anonymous (nobody) Summary: saxutils.py: CharRef escaping Initial Comment: saxutils.XMLGenerator selects a codec for output according to the encoding argument given to its constructor. All output is written through this codec, and any character in the data that doesn't fit the selected encoding raises a UnicodeError. The patch adds a cr_escape() function that replaces all characters with codes > 127 by XML character references. So the output encoding can be selected independent from the actual characters in the document. This is done for character data and for attribute values, where CharRefs are allowed. It is not done for element names, attribute names etc., where CharRefs are not allowd (although there can be non-ASCII-characters, as well -- these still have to fit the output encoding). It's a brute force thing, it can be slow, but it should do what it's supposed to do. Walter Dörwald pointed out that PEP 239 should deprecate this for Python 2.3, but for Python < 2.3 it may be useful. It's my first patch, so if there's anything wrong with it, give me a chance to learn and tell me. If there's a better way to do it (I'm sure, there is), ditto. Nearly forgot: Patch against saxutils.py from 0.8.1, but I checked the CVS version and it seemed to be unchanged. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=306473&aid=615114&group_id=6473 From rsalz@datapower.com Thu Sep 26 19:31:52 2002 From: rsalz@datapower.com (Rich Salz) Date: Thu, 26 Sep 2002 14:31:52 -0400 Subject: [XML-SIG] Re: [XML-checkins]CVS: xml/xml/utils iso8601.py,1.6,1.7 References: <20020419065145.GC17017@orion.logilab.fr> <15759.26118.997231.804925@grendel.zope.com> Message-ID: <3D935298.8040503@datapower.com> If you want to handle US local time, then you have to allow seconds to be between 0 and 61, inclusive. At least twice the US has had two-leap-seconds in the same year, resulting in a local time of 11:59:61. The details are probably in some POSIX spec. /r$ From larsga@garshol.priv.no Thu Sep 26 22:28:12 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 26 Sep 2002 23:28:12 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <029301c26563$361c0990$0900a8c0@spiff> References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> <029301c26563$361c0990$0900a8c0@spiff> Message-ID: * Lars Marius Garshol | | That's sad. It would be good if we could eventually get Python to be | all UCS-4. * Fredrik Lundh | | I think the right phrase is "get Python to behave as if it were all | UCS-4" You are right, of course. The easiest way to do that is to make it actually *be* UCS-4, but of course that can be simulated. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From larsga@garshol.priv.no Thu Sep 26 22:33:02 2002 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 26 Sep 2002 23:33:02 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: <20020923160005.28564.36214.Mailman@mail.python.org> <15759.15816.342144.891607@magrathea.basistech.com> <1032799708.19185.520.camel@ibook> <1032801701.19382.572.camel@ibook> Message-ID: * Lars Marius Garshol | | I don't know why my word alone is not enough, but here you go: | * Martin v. Loewis | | I wanted to see myself :-) Thanks for the pointers. You're welcome. Note that it also works in MSIE. | Sorry if I offended you. Don't worry. If you had it would have been very clear. :-) | I was confused by the requirement in the OpenType spec that a UCS-2 | cmap table must be included "for backward compatibility needs" if | the font is has "UCS-4 character support for Windows 2000 and | later". | | Apparently, that backward compatibility is meant for other systems, | but not for Windows 2000. That's how I would interpret it, given the knowledge that Windows 2000 can indeed do this. I'm don't know much about Windows internals, but as far as I know it's actually Uniscribe that does this, so the Windows version in use may not even matter as long as you have the right Uniscribe version. -- Lars Marius Garshol, Ontopian ISO SC34/WG3, OASIS GeoLang TC From uche.ogbuji@fourthought.com Thu Sep 26 22:52:17 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Thu, 26 Sep 2002 15:52:17 -0600 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: Message from Eric van der Vlist of "26 Sep 2002 09:08:53 +0200." <1033024134.23888.47.camel@ibook> Message-ID: > On Wed, 2002-09-25 at 22:39, M.-A. Lemburg wrote: > > = > > I book all this under FUD. It'll take a bit of time, but we'll > > eventually move there. For now, I think the issues around > > surrogates and the need for non-BMP code points in real life > > applications are a bit overhyped. > = > I think that it depends what we call real life and more precisely if yo= u > consider that the full conformance to standards and W3C recommendations= > is part of the real life or not. > = > Having never met the need before, I can't consider non BMP code points > as an absolute requirement by themselves. > = > OTH, working on implementations of standards (or recs) without aiming > for complete conformance is something which I consider as dangerous and= > I am reaching a point where Python doesn't look as a adequate plateform= > to implement W3C XML Schema datatypes (and hardly an adequate platform > to implement Relax NG) because of the lack of support of non BMP code > points. This is very unfair. First of all, if Python is inadequate for conformant XML technologies, th= en = you're out of luck. No language is immune from Unicode bugs, and I know = I ran = across some howlers in JDK 1.3. Java doesn't even have built-in regex = capabilities, so people either have to write their own or borrow Oromatch= er or = the like. In real life, conformance is nice, but people need to prioritize bug fixe= s and = development. You say that you didn't run into these problems in a real l= ife = scenario but in trying to conform to some odd bits of a test suite you're= = using. Can you credibly put this forth as a reason for the Python team t= o = drop everything and fix all wide unicode bugs? -- = Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap= ache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/= 18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerw= orks/w ebservices/library/ws-pyth10.html From uche.ogbuji@fourthought.com Thu Sep 26 23:07:17 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Thu, 26 Sep 2002 16:07:17 -0600 Subject: [XML-SIG] Fwd: Re: XML-DSIG interop test vectors In-Reply-To: Message from Joseph Reagle of "Wed, 25 Sep 2002 13:13:23 EDT." <200209251313.23785.reagle@w3.org> Message-ID: > c14n.pymakes a number of simplifying assumptions and consequently doesn't > correctly serialize many "exotic" subsets. For instance, if an element is > selected by XPath, then all of its attributes are rendered regardless of > whether they are in the selected subset. Since I recently encountered this > question in the context of a specific test, I added two tweaks that does > the right thing: before an attribute is added to xml_attrs or other_attrs, > I check to see if it's in the subset. Are you wanting someone to check in this updated c14n.py for you? -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py. html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w ebservices/library/ws-pyth10.html From vdv@dyomedea.com Thu Sep 26 23:16:37 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 27 Sep 2002 00:16:37 +0200 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: References: Message-ID: <1033078598.25577.269.camel@ibook> On Thu, 2002-09-26 at 23:52, Uche Ogbuji wrote: > > OTH, working on implementations of standards (or recs) without aiming > > for complete conformance is something which I consider as dangerous and > > I am reaching a point where Python doesn't look as a adequate plateform > > to implement W3C XML Schema datatypes (and hardly an adequate platform > > to implement Relax NG) because of the lack of support of non BMP code > > points. >=20 > This is very unfair. And maybe too strong, sorry for that! =20 > First of all, if Python is inadequate for conformant XML technologies, th= en=20 > you're out of luck. No language is immune from Unicode bugs, and I know = I ran=20 > across some howlers in JDK 1.3. Java doesn't even have built-in regex=20 > capabilities, so people either have to write their own or borrow Oromatch= er or=20 > the like. Right. > In real life, conformance is nice, but people need to prioritize bug fixe= s and=20 > development. You say that you didn't run into these problems in a real l= ife=20 > scenario but in trying to conform to some odd bits of a test suite you're= =20 > using. Can you credibly put this forth as a reason for the Python team t= o=20 > drop everything and fix all wide unicode bugs? No, I am not asking people to shift priorities but just trying to figure out what can be done (given the time I have to spend on the subject) and what can't be done. Thanks to the help of this list, I can see includes several layers: 1) Core Relax NG Should be fine for both ucs2 and ucs4 platforms if I follow the suggestion from Martin about NCNames (or use the patch he proposes). 2) Length facet Martin's smart_len alternative works just fine. 3) Pattern facet I could propose people wanting full conformance to use libxml bindings and see what I can propose built on the builtin re module for those who don't want to make the effort of installing libxml. All this without moving any priority and thanks to the reactivity of XML-SIG! Thanks Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From mark@mceahern.com Fri Sep 27 01:08:48 2002 From: mark@mceahern.com (Mark McEahern) Date: Thu, 26 Sep 2002 19:08:48 -0500 Subject: [XML-SIG] well-formed xml Message-ID: I'm obviously missing something because this seemingly innocent chunk of xhtml: from xml.dom import minidom s = "search" # ^ # - seems to be the problem # # maybe it thinks I'm trying to reference the &q entity? doc = minidom.parseString(s) Exception traceback follows. Is there a way for me to tell it to ignore apparent entity references inside attribute values? // m $ python junk.py Traceback (most recent call last): File "junk.py", line 5, in ? doc = minidom.parseString(s) File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 965, in parseString return _doparse(pulldom.parseString, args, kwargs) File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 952, in _doparse toktype, rootNode = events.getEvent() File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 256, in getEvent self.parser.feed(buf) File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", line 148, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in f atalError raise exception xml.sax._exceptions.SAXParseException: :1:41: not well-formed (invalid token) This is with Python 2.2.1 without PyXML installed separately. The same thing happens with PyXML 0.8.1: $ python junk.py Traceback (most recent call last): File "junk.py", line 5, in ? doc = minidom.parseString(s) File "/usr/local/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 16 05, in parseString return expatbuilder.parseString(string) File "/usr/local/lib/python2.2/site-packages/_xmlplus/dom/expatbuilder.py", li ne 943, in parseString return builder.parseString(string) File "/usr/local/lib/python2.2/site-packages/_xmlplus/dom/expatbuilder.py", li ne 189, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 41 - From Matt Gushee Fri Sep 27 01:21:22 2002 From: Matt Gushee (Matt Gushee) Date: Thu, 26 Sep 2002 18:21:22 -0600 Subject: [XML-SIG] well-formed xml In-Reply-To: References: Message-ID: <20020927002121.GA8500@swordfish> On Thu, Sep 26, 2002 at 07:08:48PM -0500, Mark McEahern wrote: > I'm obviously missing something because this seemingly innocent chunk of > xhtml: > > from xml.dom import minidom > > s = "search" > # ^ > # - seems to be the problem > # > # maybe it thinks I'm trying to reference the &q entity? Yep. You need to escape that ampersand: ...?hl=en&q=... -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From mike@skew.org Fri Sep 27 06:33:12 2002 From: mike@skew.org (Mike Brown) Date: Thu, 26 Sep 2002 23:33:12 -0600 (MDT) Subject: [XML-SIG] well-formed xml In-Reply-To: "from Mark McEahern at Sep 26, 2002 07:08:48 pm" Message-ID: <200209270533.g8R5XCdW082904@chilled.skew.org> Mark McEahern wrote: > I'm obviously missing something because this seemingly innocent chunk of > xhtml: > > from xml.dom import minidom > > s = "search" FAQ. It's not well-formed XML. In an XML document, a bare "&" always denotes the beginning of an entity reference, unless it is in a CDATA section. HTML has a similar rule, but you're allowed to get away with bare ampersands in part because HTML has a fixed set of entities (so a reference to one that's unknown is probably not a reference at all, therefore "&" can be assumed), and in part because HTML browsers are not required to report such things as errors (lenience=easier document authoring and more usable documents), whereas XML parsers are required to do so (stricter rules force more predictable documents, for easier processing). Please be aware that things like - 'raw' characters vs numeric character references vs entity references, - whether or not character data is in CDATA sections, - the character-to-byte encoding of the document, - attribute order, - the type of quotes around attribute values, - whitespace between attributes in an element's start tag, - extraneous whitespace in attribute values, and - whether an empty element is written like or , are all considered lexical fluff, things that have no bearing on what semantic, logical information is carried in the document. It is the parser's job to see past all that stuff and just tell the application what the important bits are: the hierarchy of elements, attributes, character data, and processing instructions. HTML processors do pretty much the same thing. Thus it is more correct to use "&" in HTML where an ampersand is *meant*, even though you can often get away with a bare one. - Mike ____________________________________________________________________________ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ From Juergen Hermann" Message-ID: On Thu, 26 Sep 2002 19:08:48 -0500, Mark McEahern wrote: >I'm obviously missing something because this seemingly innocent chunk o= f >xhtml: > > from xml.dom import minidom > > s =3D "search= " This is not well-formed and thus not XHTML. ... en&q=3D ... is the correct form. Ciao, J=FCrgen -- J=FCrgen Hermann, Developer WEB.DE AG, http://webde-ag.de/ From ht@cogsci.ed.ac.uk Fri Sep 27 11:55:29 2002 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 27 Sep 2002 11:55:29 +0100 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <1033045636.23902.438.camel@ibook> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> <1033045636.23902.438.camel@ibook> Message-ID: Eric van der Vlist writes: > On Thu, 2002-09-26 at 14:48, Daniel Veillard wrote: > > > Hum, I think you would need a rewrite anyway for full conformance, > > the XML Schemas regexp have more complext constructs than standard regexps > > the quantifiers may be more rich (not 100% sure I didn't checked fully) > > and all the character classes/group/category/blocks are not part of > > "normal" regexps (well I never saw any such description in regexps help > > or man before, so I doubt it appeared magically in python). > > I am not 100% sure... > > The quantifiers are "*", "+", "?" and the "{}" constructs and they seem > to work fine in Python... > > As for the character classes/group/category/blocks, I was wondering if > they couldn't be described and generated with chargen.py. > > A preparsing of the W3C XML Schema patterns could then be done to use > them. > > What might be tougher are the features such as the complement of classes > (ex: [\p{IsBasicLatin}-[^\p{L}]]) which AFAIK is an extension over perl > regexps. So I think the correct answer is to pre-process the XML Schema regexps into Python regexps. It's not terribly hard to do this, and I think the result is likely to be efficient enough to be acceptable. For example, I append herewith my semi-automatically generated translation of the NCName pattern constraint from the W3C XML Schema REC ([\i-[:]][\c-[:]]*), from the forthcoming partial support for pattern in XSV. Adding this to XSV added less than 5% to overall processing time for the validation of large schemas (where every name is of type NCName, so a lot of use is made of this pattern). Not all subtraction will be as easy to handle (or will it . . .?), I agree. Note further to an observation Eric made some time back, pre-compilation involves _both_ Unicode 3.1 classes _and_ XML 1.0 2e classes -- the pattern below was produced by mechanically recovering the class definitions from the XML 1.0 2e REC itself (XML version, of course :-) ht [this is a pseudo-dump of the internal form of the NCName built-in simple type from the next XSV release] -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2002, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me _always_ has this .sig -- mail without it is forged spam] From vdv@dyomedea.com Fri Sep 27 12:48:14 2002 From: vdv@dyomedea.com (Eric van der Vlist) Date: 27 Sep 2002 13:48:14 +0200 Subject: [XML-SIG] Ann: xvif 0.2.0 Message-ID: <1033127295.27425.171.camel@ibook> I am happy to anounce the release 0.2.0 of my xvif (XML Validation Interoperability Framework) including a partial implementation of Relax NG and a very partial implementation of W3C XML Schema datatypes. The major change in this version is a clean up of the xvif syntax (the xvif "pipe" element now bahaves as a full class Relax NG pattern) which becomes slightly more verbose but allows to write schemas with "fallback" Relax NG patterns which are fully conform to Relax NG and yet can define pipes which will be used by xvif processors. The test suites can now be browsed online and links have been created between the online validator, the strawman and the test cases. Links: http://downloads.xmlschemata.org/python/xvif/ http://downloads.xmlschemata.org/python/xvif/xvif.html http://downloads.xmlschemata.org/python/xvif/tryMe.cgi http://downloads.xmlschemata.org/python/xvif-0.2.0.tgzhttp://ibook.paris.dy= omedea.com/downloads/python/xvif/tests/ Maybe more interesting for this list even though in a very early stage, I have started to implement some minimal support of datatype libraries. This should be considered as a proof of concept, but I am trying to see to which attempt simple types can be assimilated to Python classes and builtin types and, so far, I quite like the result. For a simple type library such as Relax NG core datatypes, the definition of the 2 builtin types is as simple as: class stringType(unicode): """ This class is strictly identical to the python's unicode type """ class tokenType(unicode): def __new__(cls, value=3D""): return unicode.__new__(cls, string.strip(re.sub("[\n\t ]+", " ", value))) (http://downloads.xmlschemata.org/python/xvif/rngCoreTypeLib.py) My Relax NG implementation uses a dictionary associating a library URI and a module and does all the associations by introspection of the module: the name of the types match the name of the classes defined in this module (modulo a suffix) and the validation is done by creating a new instance in a "try/except" block. Things are sometimes less simple for W3C XML Schema and its facets but still I think that the approach is interesting. (http://downloads.xmlschemata.org/python/xvif/wxsTypeLib.py) The definition of xs:integer for instance matches quite well the Python "long" builtin type: class integerType(long, _Numeric): """ """ (Numeric is a generic class with the definition of facets common to the numeric types.) and a type such as byte can be defined as: class byteType(_Bounded, intType): min =3D -128 max =3D 127 I don't know if there will be concrete applications, but I like the idea that I can use these types by themselves like: >>> import wxsTypeLib >>> x =3D wxsTypeLib.byteType(1) >>> x =3D wxsTypeLib.byteType(100+x) >>> x =3D wxsTypeLib.byteType(100+x) Traceback (most recent call last): File "", line 1, in ? File "./wxsTypeLib.py", line 79, in __init__ raise ValueError ValueError Things are more tricky for types such as decimals or datetimes, but they could be useful by themselves too. Finally, about unicode support, you won't find yet much of what has been recently discussed in the version with the only exception of smart_len()... (http://downloads.xmlschemata.org/python/xvif/Smart_len.py) More will come in next versions! Thanks for your help (and of course, your comments are welcome), Eric --=20 Rendez-vous =E0 Paris. http://www.technoforum.fr/integ2002/index.html ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ From tpassin@comcast.net Fri Sep 27 14:04:54 2002 From: tpassin@comcast.net (Thomas B. Passin) Date: Fri, 27 Sep 2002 09:04:54 -0400 Subject: [XML-SIG] Re: Issues with Unicode type References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> <1033045636.23902.438.camel@ibook> Message-ID: <000c01c26626$7902eb20$fe193044@tbp1> [Henry S. Thompson] Wow, Henry, this is a real contribution - who would want to compile that RE by hand? Is there not a typo in the last bit, though? > > > > > > The XML Namespaces Rec says that an NCName is NCName ::= (Letter | '_') (NCNameChar)* but you have it equivalent to NCName ::= (NCNameChar) (NCNameChar)* Cheers, Tom P From uche.ogbuji@fourthought.com Fri Sep 27 18:37:13 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 27 Sep 2002 11:37:13 -0600 Subject: [XML-SIG] Second Python/XML column article out Message-ID: <1033148239.5892.4618.camel@malatesta> On XML.com. This time I lead a tour of PyXML. http://www.xml.com/pub/a/2002/09/25/py.html -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From marklists@mceahern.com Fri Sep 27 20:31:10 2002 From: marklists@mceahern.com (Mark McEahern) Date: Fri, 27 Sep 2002 14:31:10 -0500 Subject: [XML-SIG] well-formed xml In-Reply-To: Message-ID: [Mike Rovner] > IIRC, double quotes are the must: > s = 'search' Thanks for the reply. It seems to be the ampersand: from xml.dom import minidom s = "search" s2 = s.replace("&", "&") for x in (s, s2): try: doc = minidom.parseString(x) except Exception, e: print "Failed: %s" % x else: print "Succeeded: %s" % x Output: Failed: search Succeeded: search // m From mike@bindkey.com Fri Sep 27 20:20:12 2002 From: mike@bindkey.com (Mike Rovner) Date: Fri, 27 Sep 2002 12:20:12 -0700 Subject: [XML-SIG] well-formed xml References: Message-ID: IIRC, double quotes are the must: s = 'search' Cheers, Mike "Juergen Hermann" wrote in message news:E17uppg-0007LF-00@smtp.web.de... On Thu, 26 Sep 2002 19:08:48 -0500, Mark McEahern wrote: >I'm obviously missing something because this seemingly innocent chunk of >xhtml: > > from xml.dom import minidom > > s = "search" This is not well-formed and thus not XHTML. ... en&q= ... is the correct form. From Matt Gushee Fri Sep 27 20:37:33 2002 From: Matt Gushee (Matt Gushee) Date: Fri, 27 Sep 2002 13:37:33 -0600 Subject: [XML-SIG] well-formed xml In-Reply-To: References: Message-ID: <20020927193732.GE1574@swordfish> On Fri, Sep 27, 2002 at 02:31:10PM -0500, Mark McEahern wrote: > [Mike Rovner] > > IIRC, double quotes are the must: > > s = 'search' Just for the record, both Python and XML allow you to use single and double quotes interchangeably--keeping in mind, of course, that strings within strings need to either use a different type of quote than the outer string, or have their quotes escaped. -- Matt Gushee Englewood, Colorado, USA mgushee@havenrock.com http://www.havenrock.com/ From ht@cogsci.ed.ac.uk Sat Sep 28 14:03:26 2002 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 28 Sep 2002 14:03:26 +0100 Subject: [XML-SIG] Re: Issues with Unicode type In-Reply-To: <000c01c26626$7902eb20$fe193044@tbp1> References: <3D921F0A.9050209@lemburg.com> <1033024134.23888.47.camel@ibook> <1033043569.23888.390.camel@ibook> <20020926084840.A28714@redhat.com> <1033045636.23902.438.camel@ibook> <000c01c26626$7902eb20$fe193044@tbp1> Message-ID: "Thomas B. Passin" writes: > [Henry S. Thompson] > > Wow, Henry, this is a real contribution - who would want to compile that RE > by hand? Is there not a typo in the last bit, though? > > > > > > > > > > > > > > > The XML Namespaces Rec says that an NCName is > > NCName ::= (Letter | '_') (NCNameChar)* > > but you have it equivalent to > > NCName ::= (NCNameChar) (NCNameChar)* I _think_ I have that right -- Name is defined as > > SQL 'CREATE TABLE' statements) gnosis.util.sql2dtd (SQL query -> DTD for query results) gnosis.util.xml2sql (XML -> SQL 'INSERT INTO' statements) gnosis.util.combinators (Combinatorial higher-order functions) gnosis.util.introspect (Introspect Python objects) ...and so much more! :-) ------------------------------------------------------------------------ This release contains a variety of minor bugfixes contributed by users: * gnosis.xml.validity accidentally left out of 1.0.3 distribution. * gnosis.introspect has minor improvements/corrections. * User contributed, but not really tested, improvements to gnosis.indexer. Let me know if this breaks something (it has to do with reindexing working correctly). * Cleanup of gnosis.xml.objectify bug introduced in 1.0.3 (problem was detecting file object rather than string or DOM). * Probably something else I forgot. It may be obtained at: http://gnosis.cx/download/Gnosis_Utils-1.0.4.tar.gz The current release is always available as: http://gnosis.cx/download/Gnosis_Utils-current.tar.gz Try it out, have fun, send feedback! David Mertz (mertz@gnosis.cx) From uche.ogbuji@fourthought.com Mon Sep 30 07:38:54 2002 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: 30 Sep 2002 00:38:54 -0600 Subject: [XML-SIG] Python interface to mnogosearch Message-ID: <1033367936.12965.1900.camel@malatesta> Does anyone know of such a beast? Looks like there are Perl and PHP interfaces. http://www.mnogosearch.org/ -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/ Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html From bill@rfa.org Mon Sep 30 09:18:26 2002 From: bill@rfa.org (Bill Eldridge) Date: Mon, 30 Sep 2002 10:18:26 +0200 Subject: [XML-SIG] Re: [4suite] Python interface to mnogosearch References: <1033367936.12965.1900.camel@malatesta> Message-ID: <3D9808D2.6060002@rfa.org> No, but the .pm files for udm-perl are only 449 lines long, 561 lines with news extensions. A line-for-line conversion to Python would be extremely easy, no more than a day with lots of coffee breaks. Lots and lots of this kind of stuff: $str =~ s/\$f/$template_env{from1}/gs; $str =~ s/\$l/$template_env{to}/gs; $str =~ s/\$t/$template_env{found}/gs; $str =~ s/\$A/$template_env{self}/gs; $str =~ s/\$Q/$template_env{query}/gs; Uche Ogbuji wrote: > Does anyone know of such a beast? Looks like there are Perl and PHP > interfaces. > > http://www.mnogosearch.org/ > -- Bill Eldridge Radio Free Asia bill@rfa.org From noreply@sourceforge.net Mon Sep 30 10:04:34 2002 From: noreply@sourceforge.net (noreply@sourceforge.net) Date: Mon, 30 Sep 2002 02:04:34 -0700 Subject: [XML-SIG] [ pyxml-Bugs-616431 ] prepare_input_source and relative path Message-ID: Bugs item #616431, was opened at 2002-09-30 01:04 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=616431&group_id=6473 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Stéphane Bidoul (sbidoul) Assigned to: Nobody/Anonymous (nobody) Summary: prepare_input_source and relative path Initial Comment: I'm trying to upgrade to PyXML 0.8.1 from 0.7. Now that expatreader tries to resolve external entities by default (which is a good thing), I'm discovering problems with relative references to the external subset. The problem lies in prepare_input_source that reports: [...] File "C:\soft\Python22\lib\site- packages\_xmlplus\sax\saxutils.py", line 465, in prepare_input_source f = urllib2.urlopen(source.getSystemId()) File "c:\soft\python22\lib\urllib2.py", line 138, in urlopen return _opener.open(url, data) File "c:\soft\python22\lib\urllib2.py", line 320, in open type_ = req.get_type() File "c:\soft\python22\lib\urllib2.py", line 224, in get_type raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ../../../../test/suitedata/lib/suite.dtd At the end of prepare_input_source, there is: if os.path.isfile(sysid): basehead = os.path.split(os.path.normpath (base))[0] source.setSystemId(os.path.join(basehead, sysid)) f = open(sysid, "rb") else: source.setSystemId(urlparse.urljoin(base, sysid)) f = urllib2.urlopen(source.getSystemId()) In my case: sysid = ..\lib\suite.dtd base = ..\..\..\..\test\suitedata\cases\cases1.xml In my test, since the current directory is not the same as base's directory, os.isfile("..\lib\suite.dtd") fails, but os.isfile("..\..\..\..\test\suitedata\cases\..\lib\suite.dtd") succeeds. Here is a proposed patch that tries to open as a file also when the sysid is relative AND the base is a file. *************** *** 454,463 **** if source.getByteStream() is None: sysid = source.getSystemId() ! if os.path.isfile(sysid): basehead = os.path.split(os.path.normpath (base))[0] source.setSystemId(os.path.join(basehead, sysid)) ! f = open(sysid, "rb") else: source.setSystemId(urlparse.urljoin(base, sysid)) f = urllib2.urlopen(source.getSystemId()) --- 454,464 ---- if source.getByteStream() is None: sysid = source.getSystemId() ! if os.path.isfile(sysid) or \ ! (sysid.startswith(".") and os.path.isfile(base)): basehead = os.path.split(os.path.normpath (base))[0] source.setSystemId(os.path.join(basehead, sysid)) ! f = open(source.getSystemId(), "rb") else: source.setSystemId(urlparse.urljoin(base, sysid)) f = urllib2.urlopen(source.getSystemId()) This fix works for me, although it is probably not the optimal solution... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=106473&aid=616431&group_id=6473 From scjuonline@web.de Mon Sep 30 10:51:57 2002 From: scjuonline@web.de (=?iso-8859-15?q?J=FCrgen=20Schmidt?=) Date: Mon, 30 Sep 2002 11:51:57 +0200 Subject: [XML-SIG] replace ENTITY_NODE ? Message-ID: <200209301151.57412.scjuonline@web.de> Dear List, I'm trying to change an entity declaration within the internal subset of= the=20 DTD using DOM. The first problem I face is that ENTITY_NODEs wont show up im my DOM-Docu= ment=20 if I use Reader.fromStream(...). The second: will I be able to replace this Node once it shows up or is it= =20 read-only? thx for your help Juergen From martin@v.loewis.de Mon Sep 30 14:06:39 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 30 Sep 2002 15:06:39 +0200 Subject: [XML-SIG] replace ENTITY_NODE ? In-Reply-To: <200209301151.57412.scjuonline@web.de> References: <200209301151.57412.scjuonline@web.de> Message-ID: J=FCrgen Schmidt writes: > The first problem I face is that ENTITY_NODEs wont show up im my > DOM-Document if I use Reader.fromStream(...). In general, processing of the DTD is very weak in the DOM; it is particularly weak in PyXML, and very weak in PyXML < 0.8.1. What version have you been using? What XML parser? > The second: will I be able to replace this Node once it shows up or > is it read-only? In the DOM, the entities attribute of the DocumentType interface is readonly, see http://www.w3.org/TR/2002/WD-DOM-Level-3-Core-20020409/core.html#ID-4122669= 27 Editing an individual Entity is not supported, either. Of course, when you come up with patches to PyXML that go beyond those specified interfaces, in a canonical way, without breaking anything, we'd happily include those in PyXML 0.8.2. Regards, Martin