From tismer@tismer.com Mon May 1 13:03:03 2000 From: tismer@tismer.com (Christian Tismer) Date: Mon, 01 May 2000 14:03:03 +0200 Subject: [XML-SIG] Windows install problems References: <01BFB306.0859C5E0@JOHAN> Message-ID: <390D7277.1D60B44B@tismer.com> Johan De Smedt wrote: > > Hi, > > I've had the following problem while trying to install the python xml package on windows NT: I'd recomend to use my Windows installer. Please report any problems to me. http://www.tismer.com/xml/ contains my build from April 18. ciao - chris -- Christian Tismer :^) Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com From paul@prescod.net Mon May 1 21:38:29 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 15:38:29 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> Message-ID: <390DEB45.D8D12337@prescod.net> Uche asked for a summary so I cc:ed the xml-sig. Guido van Rossum wrote: > > ... > > OK. I really meant recoding in UTF-8 -- I maintain that there are > lots of forces that prevent recoding most ISO-2022-JP documents in > UTF-8. Absolutely agree. > Are you sure you understand what we are arguing about? Here's what I thought we were arguing about: If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8). I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it. > Earlier, you quoted some reference documentation that defines 8-bit > strings as containing characters. That's taken out of context -- this > was written in a time when there was (for most people anyway) no > difference between characters and bytes, and I really meant bytes. Actually, I think that that was Fredrik. Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make. We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem. >>> a=1000000000000000000000000000000000000L >>> int(a) Traceback (innermost last): File "", line 1, in ? OverflowError: long int too long to convert And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Mon May 1 22:32:38 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 17:32:38 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT." <390DEB45.D8D12337@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us> > > Are you sure you understand what we are arguing about? > > Here's what I thought we were arguing about: > > If you put a bunch of "funny characters" into a Python string literal, > and then compare that string literal against a Unicode object, should > those funny characters be treated as logical units of text (characters) > or as bytes? And if bytes, should some transformation be automatically > performed to have those bytes be reinterpreted as characters according > to some particular encoding scheme (probably UTF-8). > > I claim that we should *as far as possible* treat strings as character > lists and not add any new functionality that depends on them being byte > list. Ideally, we could add a byte array type and start deprecating the > use of strings in that manner. Yes, it will take a long time to fix this > bug but that's what happens when good software lives a long time and the > world changes around it. > > > Earlier, you quoted some reference documentation that defines 8-bit > > strings as containing characters. That's taken out of context -- this > > was written in a time when there was (for most people anyway) no > > difference between characters and bytes, and I really meant bytes. > > Actually, I think that that was Fredrik. Yes, I came across the post again later. Sorry. > Anyhow, you wrote the documentation that way because it was the most > intuitive way of thinking about strings. It remains the most intuitive > way. I think that that was the point Fredrik was trying to make. I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately... > We can't make "byte-list" strings go away soon but we can start moving > people towards the "character-list" model. In concrete terms I would > suggest that old fashioned lists be automatically coerced to Unicode by > interpreting each byte as a Unicode character. Trying to go the other > way could cause the moral equivalent of an OverflowError but that's not > a problem. > > >>> a=1000000000000000000000000000000000000L > >>> int(a) > Traceback (innermost last): > File "", line 1, in ? > OverflowError: long int too long to convert > > And just as with ints and longs, we would expect to eventually unify > strings and unicode strings (but not byte arrays). OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode. I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally. I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric. Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one. I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding. I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Guido van Rossum wrote: > I just wish he made the point more eloquently. The eff-bot seems to > be in a crunchy mood lately... I've posted a few thousand messages on this topic, most of which seem to have been ignored. if you'd read all my messages, and seen all the replies, you'd be cranky too... > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. maybe, but it's a darn good axiom, and it's used by everyone else. Perl uses it, Tcl uses it, XML uses it, etc. see: http://www.python.org/pipermail/python-dev/2000-April/005218.html > I have a bunch of good reasons (I think) for liking UTF-8: it allows > you to convert between Unicode and 8-bit strings without losses, Tcl > uses it (so displaying Unicode in Tkinter *just* *works*...), it is > not Western-language-centric. the "Tcl uses it" is a red herring -- their internal implementation uses 16-bit integers, and the external interface works very hard to keep the "strings are character sequences" illusion. in other words, the length of a string is *always* the number of characters, the character at index i is *always* the i'th character in the string, etc. that's not true in Python 1.6a2. (as for Tkinter, you only have to add 2-3 lines of code to make it use 16-bit strings instead...) > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. this is another red herring: my argument is that 8-bit strings should contain unicode characters, using unicode character codes. there should be only one character repertoire, and that repertoire is uni- code. for a definition of these terms, see: http://www.python.org/pipermail/python-dev/2000-April/005225.html obviously, you can only store 256 different values in a single 8-bit character (just like you can only store 4294967296 different values in a single 32-bit int). to store larger values, use unicode strings (or long integers). conversion from a small type to a large type always work, conversion from a large type to a small one may result in an OverflowError. it has nothing to do with encodings. > I claim that as long as we're using an encoding we might as well use > the most accepted 8-bit encoding of Unicode as the default encoding. yeah, and I claim that it won't fly, as long as it breaks the "strings are character sequences" rule used by all other contemporary (and competing) systems. (if you like, I can post more "fun with unicode" messages ;-) and as I've mentioned before, there are (at least) two ways to solve this: 1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and Perl). make sure len(s) returns the number of characters in the string, make sure s[i] returns the i'th character (not necessarily starting at the i'th byte, and not necessarily one byte), etc. to make this run reasonable fast, use as many implementation tricks as you can come up with (I've described three ways to implement this in an earlier post). 2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i]) is a unicode character code, whether s is an 8-bit string or a = unicode string. for alternative 1 to work, you need to add some way to explicitly work with binary strings (like it's done in Perl and Tcl). alternative 2 doesn't need that; 8-bit strings can still be used to hold any kind of binary data, as in 1.5.2. just keep in mind you cannot use use all methods on such an object... > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. I still think it's very unfortunate that you think that unicode strings are a special kind of strings. Perl and Tcl don't, so why should we? From paul@prescod.net Tue May 2 01:19:20 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 19:19:20 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <390E1F08.EA91599E@prescod.net> Sorry for the long message. Of course you need only respond to that which is interesting to you. I don't think that most of it is redundant. Guido van Rossum wrote: > > ... > > OK, you've made your claim -- like Fredrik, you want to interpret > 8-bit strings as Latin-1 when converting (not just comparing!) them to > Unicode. If the user provides an explicit conversion function (e.g. UTF-8-decode) then of course we should use that function. Under my character is a character is a character model, this "conversion" is morally equivalent to ROT-13, strupr or some other text->text translation. So you could apply UTF-8-decode even to a Unicode string as long as each character in the string has ord()<256 (so that it could be interpreted as a character representation for a byte). > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. I don't see it as an axiom, but rather as a design decision you make to keep your language simple. Along the lines of "all values are objects" and (now) all integer values are representable with a single type. Are you happy with this? a="\244" b=u"\244" assert len(a)==len(b) assert ord(a[0])==ord(b[0]) # same thing, right? print b==a # Traceback (most recent call last): # File "", line 1, in ? # UnicodeError: UTF-8 decoding error: unexpected code byte If I type "\244" it means I want character 244, not the first half of a UTF-8 escape sequence. "\244" is a string with one character. It has no encoding. It is not latin-1. It is not UTF-8. It is a string with one character and should compare as equal with another string with the same character. I would laugh my ass off if I was using Perl and it did something weird like this to me (as long as it didn't take a month to track down the bug!). Now it isn't so funny. > I have a bunch of good reasons (I think) for liking UTF-8: I'm not against UTF-8. It could be an internal representation for some Unicode objects. > it allows > you to convert between Unicode and 8-bit strings without losses, Here's the heart of our disagreement: ****** I don't want, in Py3K, to think about "converting between Unicode and 8-bit strings." I want strings and I want byte-arrays and I want to worry about converting between *them*. There should be only one string type, its characters should all live in the Unicode character repertoire and the character numbers should all come from Unicode. "Special" characters can be assigned to the Unicode Private User Area. Byte arrays would be entirely seperate and would be converted to Unicode strings with explicit conversion functions. ***** In the meantime I'm just trying to get other people thinking in this mode so that the transition is easier. If I see people embedding UTF-8 escape sequences in literal strings today, I'm going to hit them. I recognize that we can't design the universe right now but we could agree on this direction and use it to guide our decision-making. By the way, if we DID think of 8-bit strings as essentially "byte arrays" then let's use that terminology and imagine some future documentation: "Python's string type is equivalent to a list of bytes. For clarity, we will call this type a byte list from now on. In contexts where a Unicode character-string is desired, Python automatically converts byte lists to charcter strings by doing a UTF-8 decode on them." What would you think if Java had a default (I say "magical") conversion from byte arrays to character strings. The only reason we are discussing this is because Python strings have a dual personality which was useful in the past but will (IMHO, of course) become increasingly confusing in the future. We want the best of both worlds without confusing anybody and I don't think that we can have it. If you want 8-bit strings to be really byte arrays in perpetuity then let's be consistent in that view. We can compare them to Unicode as we would two completely separate types. "U" comes after "S" so unicode strings always compare greater than 8-bit strings. The use of the word "string" for both objects can be considered just a historical accident. > Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), Don't follow this entirely. Shouldn't the next version of TKinter accept and return Unicode strings? It would be rather ugly for two Unicode-aware systems (Python and TK) to talk to each other in 8-bit strings. I mean I don't care what you do at the C level but at the Python level arguments should be "just strings." Consider that len() on the TKinter side would return a different value than on the Python side. What about integral indexes into buffers? I'm totally ignorant about TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is between the 5th and 6th character when in an 8-bit string the equivalent index might be the 11th or 12th byte? > it is not Western-language-centric. If you look at encoding efficiency it is. > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all! Is Unicode going to be the canonical Py3K character set or will we have different objects for different character sets/encodings with different default (I say "magical") conversions between them. Such a design would not be entirely insane though it would be a PITA to implement and maintain. If we aren't ready to establish Unicode as the one true character set then we should probably make no special concessions for Unicode at all. Let a thousand string objects bloom! Even if we agreed to allow many string objects, byte==character should not be the default string object. Unicode should be the default. > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Won't this be totally common? Most people are going to use 8-bit literals in their program text but work with Unicode data from XML parsers, COM, WebDAV, Tkinter, etc? > Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. If we are guessing then we are doing something wrong. My answer to the question of "default encoding" falls out naturally from a certain way of looking at text, popularized in various other languages and increasingly "the norm" on the Web. If you accept the model (a character is a character is a character), the right behavior is obvious. "\244"==u"\244" Nobody is ever going to have trouble understanding how this works. Choose simplicity! -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 01:53:26 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 20:53:26 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT." <390E1F08.EA91599E@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us> Paul, we're both just saying the same thing over and over without convincing each other. I'll wait till someone who wasn't in this debate before chimes in. Have you tried using this? --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid> Paul Prescod wrote: > I would laugh my ass off if I was using Perl and it did something = weird > like this to me. you don't have to -- in Perl 5.6, a character is a character... does anyone on this list follow the perl-porters list? was this as controversial over in Perl land as it appears to be over here? From Fredrik Lundh" reading the XML namespace specification makes my brain hurt, so I thought I'd ask here before it explodes... given the following XML snippet, what's the correct namespace for the "attribute" attribute? I'm not smart enough to figure that out from the specification, my intuition says "no namespace", and so does James Clark's namespace note (http://www.jclark.com/xml/xmlns.htm) where Check Status is mapped to: <{http://www.w3.org/TR/REC-html40}A HREF=3D'/cgi-bin/ResStatus' >Check Status (slightly edited -- see the note for the full example). but 1.5.2's xmllib doesn't agree with this: import xmllib class Parser(xmllib.XMLParser): def unknown_starttag(self, tag, attr): print "S", repr(tag), attr def unknown_endtag(self, tag): print "E", repr(tag) p =3D Parser() p.feed(""" """) p.close() gives the following output: S 'namespace: body' {} S 'namespace: member' {'namespace: attribute': 'value'} E 'namespace: member' E 'namespace: body' instead of=20 S 'namespace: body' {} S 'namespace: member' {'attribute': 'value'} E 'namespace: member' E 'namespace: body' can anyone sort this out for me? (and no, I really have to be able to use xmllib, to make sure soaplib.py works under an off-the-shelf Python distribution...) From ludvig.svenonius@excosoft.se Tue May 2 15:52:03 2000 From: ludvig.svenonius@excosoft.se (Ludvig Svenonius) Date: Tue, 2 May 2000 16:52:03 +0200 Subject: [XML-SIG] namespace headache In-Reply-To: <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: I'm pretty sure an unprefixed attribute will default to the same namespace URI as its host element, so in the snippet: 'attribute' would have the namespace URI 'namespace:', as its host element, whereas in: it would have no namespace URI. I think I read about this somewhere in the namespace specification at W3C. This also explains why XSL-specific attributes in XSLT elements needn't be prefixed (they will conveniently default to the same namespaces as their host elements, i.e. the XSLT namespace). The same goes for XHTML, I guess. I think xmllib has it right. I have no explanation for James Clark's note however. The alternative of forcing the XML author to explicitly prefix every attribute in elements that belong to a certain namespace just to declare that they belong to the same namespace seems pretty inconvenient. -- Ludvig Svenonius Excosoft AB ludvig@excosoft.se -----Original Message----- From: xml-sig-admin@python.org [mailto:xml-sig-admin@python.org]On Behalf Of Fredrik Lundh Sent: Tuesday, May 02, 2000 4:19 PM To: xml-sig@python.org Subject: [XML-SIG] namespace headache reading the XML namespace specification makes my brain hurt, so I thought I'd ask here before it explodes... given the following XML snippet, what's the correct namespace for the "attribute" attribute? I'm not smart enough to figure that out from the specification, my intuition says "no namespace", and so does James Clark's namespace note (http://www.jclark.com/xml/xmlns.htm) where Check Status is mapped to: <{http://www.w3.org/TR/REC-html40}A HREF='/cgi-bin/ResStatus' >Check Status (slightly edited -- see the note for the full example). but 1.5.2's xmllib doesn't agree with this: import xmllib class Parser(xmllib.XMLParser): def unknown_starttag(self, tag, attr): print "S", repr(tag), attr def unknown_endtag(self, tag): print "E", repr(tag) p = Parser() p.feed(""" """) p.close() gives the following output: S 'namespace: body' {} S 'namespace: member' {'namespace: attribute': 'value'} E 'namespace: member' E 'namespace: body' instead of S 'namespace: body' {} S 'namespace: member' {'attribute': 'value'} E 'namespace: member' E 'namespace: body' can anyone sort this out for me? (and no, I really have to be able to use xmllib, to make sure soaplib.py works under an off-the-shelf Python distribution...) _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://www.python.org/mailman/listinfo/xml-sig From Troy.Nordine@westgroup.com Tue May 2 16:55:54 2000 From: Troy.Nordine@westgroup.com (Nordine, Troy) Date: Tue, 2 May 2000 10:55:54 -0500 Subject: [XML-SIG] namespace headache Message-ID: <9DDF5FF45501D211BC22006094238FB00276C05E@elfie.int.westgroup.com> > > given the following XML snippet, what's the correct namespace > for the "attribute" attribute? > > > > > > This topic came up on XML-DEV back in early Feb. under the topic "XML Schemas Question: default namespace misses attributes". The essence of the discussion (as I understood it) was that attributes that don't have prefixes are in the namespace defined by the element, not the namespace that the element is in. See Henry Thompson's reply to the original post for a better explanation : http://www.xml.org/archives/xml-dev/2000/02/0097.html. > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where > > > Check Status > > > is mapped to: > > > <{http://www.w3.org/TR/REC-html40}A HREF='/cgi-bin/ResStatus' > >Check Status > > > (slightly edited -- see the note for the full example). > > but 1.5.2's xmllib doesn't agree with this: > So as far as I can tell, James is right and xmllib is wrong. Troy From paul@prescod.net Tue May 2 17:51:34 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:51:34 -0500 Subject: [XML-SIG] namespace headache References: <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: <390F0796.3515445C@prescod.net> Fredrik Lundh wrote: > > ... > > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where Your intuition is right. > (and no, I really have to be able to use xmllib, to make sure > soaplib.py works under an off-the-shelf Python distribution...) You'll have to fix xmllib then. Attributes without prefixes have no namespace. They neither inherit their namespace nor use the default namespace. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From paul@prescod.net Tue May 2 18:21:04 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 12:21:04 -0500 Subject: [XML-SIG] namespace headache References: Message-ID: <390F0E80.FD3620F8@prescod.net> It is a reasonable convention, built *on top of* the XML namespaces specification, to treat "href" on an "html:a" element as equivalent to "html:href". You could imagine this as another layer termed "Simplified Attribute-Inherited Namespaces". But such a document doesn't exist...some particular XML vocabularies just "work that way." -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From tgraham@mulberrytech.com Tue May 2 19:18:13 2000 From: tgraham@mulberrytech.com (Tony Graham) Date: Tue, 2 May 2000 14:18:13 -0400 (EST) Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate (Guido van Rossum) In-Reply-To: <20000502155608.6D41C1CD6E@dinsdale.python.org> References: <20000502155608.6D41C1CD6E@dinsdale.python.org> Message-ID: <14607.7141.80000.709929@menteith.com> I subscribe to the Digest, so I'm a bit behind... At 2 May 2000 11:56 -0400, xml-sig-request@python.org wrote: > From: Guido van Rossum > Date: Mon, 01 May 2000 17:32:38 -0400 > Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate ... > I have a bunch of good reasons (I think) for liking UTF-8: it allows > you to convert between Unicode and 8-bit strings without losses, Tcl UTF-8 is variable-length 8-bit encoding of Unicode characters. The only characters that cleanly convert between UTF-8 and fixed-length 8-bit strings are the ASCII characters. > uses it (so displaying Unicode in Tkinter *just* *works*...), it is > not Western-language-centric. UTF-8 is Western-language-centric. In fact, it's practically English-centric since only the ASCII characters are 1 byte per character, the characters for writing most European languages plus Arabic and Hebrew are 2 bytes per character, and the rest -- including Hangul and the CJK ideographs -- are 3 bytes per character. Japanese text files, for example, are 50% larger as UTF-8 text than as UTF-16 text. > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. > > I claim that as long as we're using an encoding we might as well use > the most accepted 8-bit encoding of Unicode as the default encoding. There have been other proposals for variable-length 8-bit transformation formats of Unicode characters, but UTF-8 is the only one that is specified in the Unicode Standard and ISO/IEC 10646. There is less hassle with characters outside the 16-bit Basic Multilingual Plane (BMP) with UTF-8 than with, for example, UTF-16. When working with UTF-8, you have to consider that all characters are encoded as varying numbers of bytes. When working with UTF-16, it's easy to assume that all characters are 16-bit and write your code accordingly, but there will shortly be characters defined outside of the BMP -- including math characters used in MathML and new but essential CJK ideographs -- so you have to work with UTF-16 data as being "16-bit except when it isn't". It shouldn't matter what encoding or transformation format is used for the internal representation of strings. Python should be able to read and write files in a number of encodings so that it plays well with others. I compared eight languages in the "Programming Language Support" chapter of "Unicode: A Primer" (ISBN: 0-7645-4625-2) and found that there was no Unicode encoding that all eight languages could read and write. Playing well with others also means reading and writing whatever non-Unicode encoding a user keeps his data in. Python should also be able to read Python programs in a number of encodings, including UTF-8 and UTF-16, plus it should include a mechanism for referencing Unicode characters by number (or name) within strings. > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. Given the long tradition of using different encodings in 8-bit strings, surely there's no safe assumption about the encoding in any 8-bit string? ISO 8859-1 (Latin-1) is being superseded by ISO 8859-15 (which shuffled a few things and added the euro); Windows' CP 1252 isn't really ISO 8859-1 despite how some mailers and HTML editors label it; and even I've processed multi-byte Japanese, Chinese, and Korean text using 8-bit scripting languages. Perl, for example, has "byte" and "utf8" pragmata for controlling whether strings are treated as fixed-length 1-byte characters or as variable-length UTF-8 characters, with the current default being "byte". Tcl, to use another example, can read and write files in a number of encodings, but it defaults to using the system encoding, or ISO 8859-1 if it can't determine the system encoding. Python, similarly, should not make assumptions about the encoding used in strings in existing programs and should be flexible in supporting the encodings that people do use. Regards, Tony Graham ====================================================================== Tony Graham mailto:tgraham@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9632 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== From paul@prescod.net Tue May 2 19:23:24 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 13:23:24 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <390F1D1C.6EAF7EAD@prescod.net> Guido van Rossum wrote: > > .... > > Have you tried using this? Yes. I haven't had large problems with it. As long as you know what is going on, it doesn't usually hurt anything because you can just explicitly set up the decoding you want. It's like the int division problem. You get bitten a few times and then get careful. It's the naive user who will be surprised by these random UTF-8 decoding errors. That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 19:56:34 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 14:56:34 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT." <390F1D1C.6EAF7EAD@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us> > It's the naive user who will be surprised by these random UTF-8 decoding > errors. > > That's why this is NOT a convenience issue (are you listening MAL???). > It's a short and long term simplicity issue. There are lots of languages > where it is de rigeur to discover and work around inconvenient and > confusing default behaviors. I just don't think that we should be ADDING > such behaviors. So what do you think of my new proposal of using ASCII as the default "encoding"? It takes care of "a character is a character" but also (almost) guarantees an error message when mixing encoded 8-bit strings with Unicode strings without specifying an explicit conversion -- *any* 8-bit byte with the top bit set is rejected by the default conversion to Unicode. I think this is less confusing than Latin-1: when an unsuspecting user is reading encoded text from a file into 8-bit strings and attempts to use it in a Unicode context, an error is raised instead of producing garbage Unicode characters. It encourages the use of Unicode strings for everything beyond ASCII -- there's no way around ASCII since that's the source encoding etc., but Latin-1 is an inconvenient default in most parts of the world. ASCII is accepted everywhere as the base character set (e.g. for email and for text-based protocols like FTP and HTTP), just like English is the one natural language that we can all sue to communicate (to some extent). --Guido van Rossum (home page: http://www.python.org/~guido/) From dieter@handshake.de Tue May 2 19:44:41 2000 From: dieter@handshake.de (Dieter Maurer) Date: Tue, 2 May 2000 20:44:41 +0200 (CEST) Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390E1F08.EA91599E@prescod.net> References: <390E1F08.EA91599E@prescod.net> Message-ID: <14607.7798.510723.419556@lindm.dm> Paul Prescod writes: > The fact that my proposal has the same effect as making Latin-1 the > "default encoding" is a near-term side effect of the definition of > Unicode. My long term proposal is to do away with the concept of 8-bit > strings (and thus, conversions from 8-bit to Unicode) altogether. One > string to rule them all! Why must this be a long term proposal? I would find it quite attractive, when * the old string type became an imutable list of bytes * automatic conversion between byte lists and unicode strings were performed via user customizable conversion functions (a la __import__). Dieter From jkraai@murlmail.com Tue May 2 20:46:49 2000 From: jkraai@murlmail.com (jkraai@murlmail.com) Date: Tue, 2 May 2000 14:46:49 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate Message-ID: <200005021946.OAA03609@www.polytopic.com> The ever quotable Guido: > English is the one natural language that we can all sue to communicate ------------------------------------------------------------------ You've received MurlMail! -- FREE, web-based email, accessible anywhere, anytime from any browser-enabled device. Sign up now at http://murl.com Murl.com - At Your Service From paul@prescod.net Tue May 2 20:23:27 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 14:23:27 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> Message-ID: <390F2B2F.2953C72D@prescod.net> Guido van Rossum wrote: > > ... > > So what do you think of my new proposal of using ASCII as the default > "encoding"? I can live with it. I am mildly uncomfortable with the idea that I could write a whole bunch of software that works great until some European inserts one of their name characters. Nevertheless, being hard-assed is better than being permissive because we can loosen up later. What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 20:58:20 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 15:58:20 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT." <390F2B2F.2953C72D@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us> [me] > > So what do you think of my new proposal of using ASCII as the default > > "encoding"? [Paul] > I can live with it. I am mildly uncomfortable with the idea that I could > write a whole bunch of software that works great until some European > inserts one of their name characters. Better than that when some Japanese insert *their* name characters and it produces gibberish instead. > Nevertheless, being hard-assed is > better than being permissive because we can loosen up later. Exactly -- just as nobody should *count* on 10**10 raising OverflowError, nobody (except maybe parts of the standard library :-) should *count* on unicode("\347") raising ValueError. I think that's fine. > What do we do about str( my_unicode_string )? Perhaps escape the Unicode > characters with backslashed numbers? Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error. But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does). --Guido van Rossum (home page: http://www.python.org/~guido/) From uogbuji@fourthought.com Tue May 2 19:33:18 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 12:33:18 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from "Fredrik Lundh" of "Tue, 02 May 2000 16:19:12 +0200." <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: <200005021833.MAA02854@localhost.localdomain> > reading the XML namespace specification makes my brain > hurt, so I thought I'd ask here before it explodes... > > given the following XML snippet, what's the correct namespace > for the "attribute" attribute? > > > > > > > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where [snip] > class Parser(xmllib.XMLParser): > def unknown_starttag(self, tag, attr): > print "S", repr(tag), attr > def unknown_endtag(self, tag): > print "E", repr(tag) > > p = Parser() > p.feed(""" > > > > > """) > p.close() > > gives the following output: > > S 'namespace: body' {} > S 'namespace: member' {'namespace: attribute': 'value'} > E 'namespace: member' > E 'namespace: body' > > instead of > > S 'namespace: body' {} > S 'namespace: member' {'attribute': 'value'} > E 'namespace: member' > E 'namespace: body' > > can anyone sort this out for me? Easy enough. Your instinct (and Mr. Clark) are right and xmllib is wrong. > (and no, I really have to be able to use xmllib, to make sure > soaplib.py works under an off-the-shelf Python distribution...) So the next part of my comment is of no use to you but I'll make it anyway: 4DOM does get it right: [uogbuji@borgia uogbuji]$ python Python 1.5.2 (#1, Mar 21 2000, 18:17:19) [GCC 2.95.3 19991030 (prerelease)] on linux-i386 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> from Ft.Dom.Ext.Reader import Sax2 >>> source = """ ... ... ... """ >>> doc = Sax2.FromXml(source) >>> member = doc.documentElement.childNodes[1] >>> member >>> attr = member.attributes[0] >>> attr >>> att.namespaceURI >>> attr.localName 'attribute' >>> attr.nodeName 'attribute' >>> import Ft.Dom.Ext >>> Ft.Dom.Ext.GetAllNs(attr) {'ns': 'namespace:', 'xml': 'http://www.w3.org/XML/1998/namespace'} >>> Hmm. I just noticed that "1 attributes and 1 children". Silliness. I'll sort that out... -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 2 19:41:40 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 12:41:40 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from "Ludvig Svenonius" of "Tue, 02 May 2000 16:52:03 +0200." Message-ID: <200005021841.MAA02895@localhost.localdomain> > I'm pretty sure an unprefixed attribute will default to the same namespace > URI as its host element, so in the snippet: No! A million times "no"! There is _no_ defaulting for attributes. None whatsoever. This is a major XML FAQ and I refer everyone to James Clark's excellent note, which /F already mentioned. http://www.jclark.com/xml/xmlns.htm > > > > > > 'attribute' would have the namespace URI 'namespace:', as its host element, > whereas in: > > > > > > > it would have no namespace URI. I think I read about this somewhere in the > namespace specification at W3C. This also explains why XSL-specific > attributes in XSLT elements needn't be prefixed (they will conveniently > default to the same namespaces as their host elements, i.e. the XSLT > namespace). The same goes for XHTML, I guess. I think xmllib has it right. I > have no explanation for James Clark's note however. The alternative of > forcing the XML author to explicitly prefix every attribute in elements that > belong to a certain namespace just to declare that they belong to the same > namespace seems pretty inconvenient. I'll grant that you pretty much explained the reason for much of the confusion. However, the fact that XSLT uses unprefixed attributes has no bearing on their namespace status. XSLT does this as a convenience: after all, XSLT processors are element-oriented (if that means anything), and will recognize the standard attributes of an XSL instruction without help from namespaces. Since there is no attr namespace defaulting, my guess is that unprefixed instruction attributes is a way to make XSLT a tad less verbose, but no more than that. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 2 23:01:49 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 16:01:49 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from Uche Ogbuji of "Tue, 02 May 2000 12:33:18 MDT." <200005021833.MAA02854@localhost.localdomain> Message-ID: <200005022201.QAA03705@localhost.localdomain> > So the next part of my comment is of no use to you but I'll make it anyway: > 4DOM does get it right: > > [uogbuji@borgia uogbuji]$ python > Python 1.5.2 (#1, Mar 21 2000, 18:17:19) [GCC 2.95.3 19991030 (prerelease)] > on linux-i386 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > >>> from Ft.Dom.Ext.Reader import Sax2 > >>> source = """ > ... > ... > ... """ > >>> doc = Sax2.FromXml(source) > >>> member = doc.documentElement.childNodes[1] > >>> member > children> > >>> attr = member.attributes[0] > >>> attr > > >>> att.namespaceURI > >>> attr.localName > 'attribute' > >>> attr.nodeName > 'attribute' > >>> import Ft.Dom.Ext > >>> Ft.Dom.Ext.GetAllNs(attr) > {'ns': 'namespace:', 'xml': 'http://www.w3.org/XML/1998/namespace'} > >>> What I get for C n P-ing from a terminal screen chunks at a time. Looks like not only did I swipe a typo I'd meant to leave out, but I left out 4DOM's answer to "attr.namespaceURI". For the record, it is '', but of course you don't have to take my word for it. Give it a try. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From mal@lemburg.com Wed May 3 00:11:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 01:11:37 +0200 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> Message-ID: <390F60A9.A3AA53A9@lemburg.com> Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > "encoding"? How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.) The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paul@prescod.net Tue May 2 23:54:41 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 17:54:41 -0500 Subject: [XML-SIG] RAX Message-ID: <390F5CB1.FBE70A92@prescod.net> RAX has been getting many good reviews. I propose it for inclusion in the xml-sig as another "higher level" API. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Wed May 3 03:31:21 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 22:31:21 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200." <390F60A9.A3AA53A9@lemburg.com> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us> > Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > > "encoding"? [MAL] > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) > > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. No, the backslash should mean itself when encoding from ASCII to Unicode. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim_one@email.msn.com Wed May 3 06:19:28 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 01:19:28 -0400 Subject: [XML-SIG] RE: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Message-ID: <000401bfb4bf$27ec1600$622d153f@tim> [Fredrik Lundh] > ... > (if you like, I can post more "fun with unicode" messages ;-) By all means! Exposing a gotcha to ridicule does more good than a dozen abstract arguments. But next time stoop to explaining what it is that's surprising . From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid> M.-A. Lemburg wrote: > Guido van Rossum wrote: > >=20 > > > > So what do you think of my new proposal of using ASCII as the = default > > > > "encoding"? >=20 > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) >=20 > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. umm. if you disallow latin-1 characters, how can you call this one loss-less? looks like political correctness taken to an entirely new level... From ht@cogsci.ed.ac.uk Wed May 3 10:59:28 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 03 May 2000 10:59:28 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > Paul, we're both just saying the same thing over and over without > convincing each other. I'll wait till someone who wasn't in this > debate before chimes in. OK, I've never contributed to this discussion, but I have a long history of shipping widely used Python/Tkinter/XML tools (see my homepage). I care _very_ much that heretofore I have been unable to support full XML because of the lack of Unicode support in Python. I've already started playing with 1.6a2 for this reason. I notice one apparent mis-communication between the various contributors: Treating narrow-strings as consisting of UNICODE code points <= 255 is not necessarily the same thing as making Latin-1 the default encoding. I don't think on Paul and Fredrik's account encoding are relevant to narrow-strings at all. I'd rather go right away to the coherent position of byte-arrays, narrow-strings and wide-strings. Encodings are only relevant to conversion between byte-arrays and strings. Decoding a byte-array with a UTF-8 encoding into a narrow string might cause overflow/truncation, just as decoding a byte-array with a UTF-8 encoding into a wide-string might. The fact that decoding a byte-array with a Latin-1 encoding into a narrow-string is a memcopy is just a side-effect of the courtesy of the UNICODE designers wrt the code points between 128 and 255. This is effectively the way our C-based XML toolset (which we embed in Python) works today -- we build an 8-bit version which uses char* strings, and a 16-bit version which uses unsigned short* strings, and convert from/to byte-streams in any supported encoding at the margins. I'd like to keep byte-arrays at the margins in Python as well, for all the reasons advanced by Paul and Fredrik. I think treating existing strings as a sort of pun between narrow-strings and byte-arrays is a recipe for ongoing confusion. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From guido@python.org Wed May 3 13:16:56 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:16:56 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "03 May 2000 10:59:28 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us> [Henry S. Thompson] > OK, I've never contributed to this discussion, but I have a long > history of shipping widely used Python/Tkinter/XML tools (see my > homepage). I care _very_ much that heretofore I have been unable to > support full XML because of the lack of Unicode support in Python. > I've already started playing with 1.6a2 for this reason. Thanks for chiming in! > I notice one apparent mis-communication between the various > contributors: > > Treating narrow-strings as consisting of UNICODE code points <= 255 is > not necessarily the same thing as making Latin-1 the default encoding. > I don't think on Paul and Fredrik's account encoding are relevant to > narrow-strings at all. I agree that's what they are trying to tell me. > I'd rather go right away to the coherent position of byte-arrays, > narrow-strings and wide-strings. Encodings are only relevant to > conversion between byte-arrays and strings. Decoding a byte-array > with a UTF-8 encoding into a narrow string might cause > overflow/truncation, just as decoding a byte-array with a UTF-8 > encoding into a wide-string might. The fact that decoding a > byte-array with a Latin-1 encoding into a narrow-string is a memcopy > is just a side-effect of the courtesy of the UNICODE designers wrt the > code points between 128 and 255. > > This is effectively the way our C-based XML toolset (which we embed in > Python) works today -- we build an 8-bit version which uses char* > strings, and a 16-bit version which uses unsigned short* strings, and > convert from/to byte-streams in any supported encoding at the margins. > > I'd like to keep byte-arrays at the margins in Python as well, for all > the reasons advanced by Paul and Fredrik. > > I think treating existing strings as a sort of pun between > narrow-strings and byte-arrays is a recipe for ongoing confusion. Very good analysis. Unfortunately this is where we're stuck, until we have a chance to redesign this kind of thing from scratch. Python 1.5.2 programs use strings for byte arrays probably as much as they use them for character strings. This is because way back in 1990 I when I was designing Python, I wanted to have smallest set of basic types, but I also wanted to be able to manipulate byte arrays somewhat. Influenced by K&R C, I chose to make strings and string I/O 8-bit clean so that you could read a binary "string" from a file, manipulate it, and write it back to a file, regardless of whether it was character or binary data. This model has never been challenged until now. I agree that the Java model (byte arrays and strings) or perhaps your proposed model (byte arrays, narrow and wide strings) looks better. But, although Python has had rudimentary support for byte arrays for a while (the array module, introduced in 1993), the majority of Python code manipulating binary data still uses string objects. My ASCII proposal is a compromise that tries to be fair to both uses for strings. Introducing byte arrays as a more fundamental type has been on the wish list for a long time -- I see no way to introduce this into Python 1.6 without totally botching the release schedule (June 1st is very close already!). I'd like to be able to move on, there are other important things still to be added to 1.6 (Vladimir's malloc patches, Neil's GC, Fredrik's completed sre...). For 1.7 (which should happen later this year) I promise I'll reopen the discussion on byte arrays. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Wed May 3 14:06:27 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 15:06:27 +0200 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid> Message-ID: <39102453.6923B10@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg wrote: > > Guido van Rossum wrote: > > > > > > > > So what do you think of my new proposal of using ASCII as the default > > > > > "encoding"? > > > > How about using unicode-escape or raw-unicode-escape as > > default encoding ? (They would have to be adapted to disallow > > Latin-1 char input, though.) > > > > The advantage would be that they are compatible with ASCII > > while still providing loss-less conversion and since they > > use escape characters, you can even read them using an > > ASCII based editor. > > umm. if you disallow latin-1 characters, how can you call this > one loss-less? [Guido didn't like this one, so its probably moot investing any more time on this...] I meant that the unicode-escape codec should only take ASCII characters as input and disallow non-escaped Latin-1 characters. Anyway, I'm out of this discussion... I'll wait a week or so until things have been sorted out. Have fun, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From dick.wall@bigfoot.com Wed May 3 14:21:44 2000 From: dick.wall@bigfoot.com (Richard Wall) Date: Wed, 03 May 2000 09:21:44 -0400 Subject: [XML-SIG] Guidance sought Message-ID: <391027E8.BCC39E4F@bigfoot.com> Hello all, I know that this is partially my own doing, but I have not had cause to really get into XML in python up until now. Suddenly I find myself with a need to do it, and everything I know about python tells me it ought to be the natural choice with XML. However I am finding the information about the 5 or 6 different approached to XML confusing in finding a place to start. Probably the easiest thing is to describe what I am trying to do, and maybe you guys can point me in the right direction of what to learn. We have an XML java interface that we use at my company to publish and subscribe data from one of our systems. The output is created directly from the contents of Java objects, and represents exactly the structure of the object model in Java. All objects in this data are wholly contained within parent objects, that is to say that there are no references to other objects to deal with. The basic objects are things like Case and Account, and these have Attributes like Name and value and the like. An example output from a simple class might look something like this: - Allowance For Funds Used During Construction Account 1.1 - _AccountCategory AFUDC value - value ImpactDDItem 1.1 - Inputs _AccountType 1 Label Allowance For Funds Used During Construction I have been able to get the whole thing loaded in to python very easily using the DOM xml stuff (it was very easy actually) but I have a feeling that writing my own document handler would be a better way to do this. there are a high number of different Classes in the system and I want to make classes responsible for their own xml importing and exporting, having them recognize and fish out attributes for themselves, and create child objects as necessary to handle embedded objects in the XML. Is this the right approach, and if so are there any tutorials covering this. The majority of stuff I have seen so far tends to deal with batch processing of XML with python, and what I reall want to do in this case is to import this XML document directly into an equivalent python object model to the java one from which it came. I understand that I will have to convert Java types to python, but that should be pretty easy (java.util.Hashtables to dictionaries, java.lang.String to string). In fact I am thinking that for the hashtable in this case, the __dict__ for the class can be set directly (so that the attribute key/value pairs simply become python attributes). Any pointers would be greatly appreciated, even if it is of the form of "It's right here in this tutorial, moron!". Thanks Dick -- dick.wall@bigfoot.com - Home dwall@newenergyassoc.com - Work QuaintRcky - AIM From ken@bitsko.slc.ut.us Wed May 3 15:58:16 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 03 May 2000 09:58:16 -0500 Subject: [XML-SIG] RAX In-Reply-To: Paul Prescod's message of "Tue, 02 May 2000 17:54:41 -0500" References: <390F5CB1.FBE70A92@prescod.net> Message-ID: Paul Prescod writes: > RAX has been getting many good reviews. I propose it for inclusion > in the xml-sig as another "higher level" API. One of the main features of RAX is that it implements a "pull" style event interface rather than the "push" style interface that SAX implements currently. Since that's good for a reason, it may also be good if there were a version of SAX that _was_ pull-style so that it could be used in applications like RAX. In this way, one could stack SAX modules and filters in a pull-style chain as an alternative to a push-style chain. -- Ken From larsga@garshol.priv.no Wed May 3 18:15:28 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 03 May 2000 19:15:28 +0200 Subject: [XML-SIG] RAX In-Reply-To: References: <390F5CB1.FBE70A92@prescod.net> Message-ID: * Ken MacLeod | | Since that's good for a reason, it may also be good if there were a | version of SAX that _was_ pull-style so that it could be used in | applications like RAX. In this way, one could stack SAX modules and | filters in a pull-style chain as an alternative to a push-style chain. I agree with this and have been thinking about this for a while, but I'm not sure how we would actually implement this. The only XML parser we have that supports a pull-style interface is RXP, and I'm not sure if we can convert the other interfaces to pull-style interfaces in a sensible way (at least not on a level as low as SAX) without storing the entire sequence of events. Good ideas are welcome... --Lars M. From ken@bitsko.slc.ut.us Wed May 3 19:19:26 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 03 May 2000 13:19:26 -0500 Subject: [XML-SIG] RAX In-Reply-To: Lars Marius Garshol's message of "03 May 2000 19:15:28 +0200" References: <390F5CB1.FBE70A92@prescod.net> Message-ID: Lars Marius Garshol writes: > * Ken MacLeod > | > | Since that's good for a reason, it may also be good if there were a > | version of SAX that _was_ pull-style so that it could be used in > | applications like RAX. In this way, one could stack SAX modules and > | filters in a pull-style chain as an alternative to a push-style chain. > > I agree with this and have been thinking about this for a while, but > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. > > Good ideas are welcome... I don't think existing push-style parsers need to be converted, or implied that they could be used in a pull-style chain. I was thinking more of creating the interface definition of a pull-style SAX parser and allowing for new parsers to be developed rather than a wholesale conversion of push-style parsers. RXP and PYX are both good candidates for pull-style parsing. I think an EasySAX-like approach would work best, where next_event() returns a DOM/mini-DOM node: node, is_end = pull_parser.next() while node != None: if node.nodeType == ELEMENT: if is_end: """ do end element processing """ else: """ do start element processing """ elif node.nodeType == TEXT: """ do text processing """ node, is_end = pull_parser.next() Most of the rest of the SAX interface (sources, creating parsers, exceptions, locators) could probably be used without change. If two threads are used, any push-style parser can be used to queue events to be read by a pull-style adapter in the other thread. -- Ken From jday@csihq.com Wed May 3 19:11:15 2000 From: jday@csihq.com (John Day) Date: Wed, 03 May 2000 14:11:15 -0400 Subject: [XML-SIG] RAX In-Reply-To: References: <390F5CB1.FBE70A92@prescod.net> Message-ID: <3.0.6.32.20000503141115.0091ec00@mail.csihq.com> What is RAX? I did a search for "rax" at the Cover XML site and came up with 0 hits. Evidently some SAX-like API for XML? Could someone provide some links please. Tnx, John Day At 07:15 PM 5/3/00 +0200, Lars Marius Garshol wrote: > >* Ken MacLeod >| >| Since that's good for a reason, it may also be good if there were a >| version of SAX that _was_ pull-style so that it could be used in >| applications like RAX. In this way, one could stack SAX modules and >| filters in a pull-style chain as an alternative to a push-style chain. > >I agree with this and have been thinking about this for a while, but >I'm not sure how we would actually implement this. The only XML parser >we have that supports a pull-style interface is RXP, and I'm not sure >if we can convert the other interfaces to pull-style interfaces in a >sensible way (at least not on a level as low as SAX) without storing >the entire sequence of events. > >Good ideas are welcome... > >--Lars M. > > >_______________________________________________ >XML-SIG maillist - XML-SIG@python.org >http://www.python.org/mailman/listinfo/xml-sig > From robin@alldunn.com Wed May 3 19:37:49 2000 From: robin@alldunn.com (Robin Dunn) Date: Wed, 3 May 2000 11:37:49 -0700 Subject: [XML-SIG] RAX References: <390F5CB1.FBE70A92@prescod.net> <3.0.6.32.20000503141115.0091ec00@mail.csihq.com> Message-ID: <016501bfb52e$b0265b10$3225d2d1@ARES> > What is RAX? I did a search for "rax" at the Cover XML site > and came up with 0 hits. Evidently some SAX-like API for XML? > Could someone provide some links please. > http://xml.com/pub/2000/04/26/rax/index.html -- Robin Dunn Software Craftsman robin@AllDunn.com http://wxpython.org Java give you jitters? http://wxpros.com Relax with wxPython! From paul@prescod.net Wed May 3 23:23:25 2000 From: paul@prescod.net (Paul Prescod) Date: Wed, 03 May 2000 15:23:25 -0700 Subject: [XML-SIG] RAX References: <390F5CB1.FBE70A92@prescod.net> Message-ID: <3910A6DD.642B1963@prescod.net> Lars Marius Garshol wrote: > > ... > > I agree with this and have been thinking about this for a while, but > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. Sean's pyx does that. Threads are another solution, but not a very efficient one. I think that the more performant solution for converting push-style parsers into pull-style parsers is Stackless Python. It seems to be the solution to a lot of problems. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From Fredrik Lundh" Message-ID: <011001bfb555$bdc64b00$34aab5d4@hagrid> Lars Marius Garshol wrote: > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. assuming that a pull-style parser is what I think it is, here's how to convert any incremental parser (xmllib, sgmlop, expat, etc) to a pull-style parser: import xmllib START, DATA, END =3D "start", "data", "end" class XMLPuller(xmllib.XMLParser): def __init__(self, stream): xmllib.XMLParser.__init__(self) self.__stream =3D stream self.__tokens =3D [] def get(self): while not self.__tokens: data =3D self.__stream.read(10000) if not data: self.close() break self.feed(data) if self.__tokens: return self.__tokens.pop(0) return None # end of stream def unknown_starttag(self, tag, attr): self.__tokens.append(START, tag, attr) def handle_data(self, data): self.__tokens.append(DATA, data) def unknown_endtag(self, tag): self.__tokens.append(END, tag) puller =3D XMLPuller(open("myfile.xml")) while 1: next =3D puller.get() if not next: break print next From ht@cogsci.ed.ac.uk Thu May 4 09:51:39 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 04 May 2000 09:51:39 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > My ASCII proposal is a compromise that tries to be fair to both uses > for strings. Introducing byte arrays as a more fundamental type has > been on the wish list for a long time -- I see no way to introduce > this into Python 1.6 without totally botching the release schedule > (June 1st is very close already!). I'd like to be able to move on, > there are other important things still to be added to 1.6 (Vladimir's > malloc patches, Neil's GC, Fredrik's completed sre...). > > For 1.7 (which should happen later this year) I promise I'll reopen > the discussion on byte arrays. I think I hear a moderate consensus developing that the 'ASCII proposal' is a reasonable compromise given the time constraints. But let's not fail to come back to this ASAP -- it _really_ narcs me that every time I load XML into my Python-based editor I'm going to convert large amounts of wide-string data into UTF-8 just so Tk can convert it back to wide-strings in order to display it! ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From guido@python.org Thu May 4 13:40:35 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 04 May 2000 08:40:35 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "04 May 2000 09:51:39 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us> > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. But > let's not fail to come back to this ASAP -- it _really_ narcs me that > every time I load XML into my Python-based editor I'm going to convert > large amounts of wide-string data into UTF-8 just so Tk can convert it > back to wide-strings in order to display it! Thanks -- but that's really Tcl's fault, since the only way to get character data *into* Tcl (or out of it) is through the UTF-8 encoding. And is your XML really stored on disk in its 16-bit format? --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Thu May 4 14:21:25 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 4 May 2000 15:21:25 +0200 Subject: [XML-SIG] Re: Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Guido van Rossum wrote: > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new object or modify an existing object to hold a copy of the Unicode string given by unicode and numChars. (Tcl_UniChar* is currently the same thing as Py_UNICODE*) From m.favas@per.dem.csiro.au Thu May 4 20:11:58 2000 From: m.favas@per.dem.csiro.au (Mark Favas) Date: Fri, 05 May 2000 03:11:58 +0800 Subject: [XML-SIG] PyXML 0.5.4 installation glitches Message-ID: <3911CB7E.72CA7564@per.dem.csiro.au> Platform: DEC Alpha, Tru64 Unix V4.0F, Compaq C V6.1-110, Python 1.6a2 (#91, May 5 2000, 01:57:36) (from CVS) Running "python setup.py build" produces the following error: building 'xml.parsers.pyexpat' extension cc -c -Iextensions/expat/xmltok -Iextensions/expat/xmlparse -I/usr/local/include/python1.6 -O -Olimit 1500 extensions/pyexpat.c -o build/temp.osf1V-alpha/extensions/pyexpat.o cc: Error: extensions/pyexpat.c, line 82: The static declaration of "handler_info" is a tentative definition and specifies an incomplete type. (incompstat) static struct HandlerInfo handler_info[]; --------------------------^ error: command 'cc' failed with exit status 1 Changing the indicated line to static struct HandlerInfo handler_info[64]; allows the compilation to proceed with the following warnings: cc: Warning: extensions/pyexpat.c, line 821: In the initializer for handler_info [0].handler, the referenced type of the pointer value "my_StartElementHandler" i s "function (pointer to void, pointer to const char, pointer to pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_StartElementHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 824: In the initializer for handler_info [1].handler, the referenced type of the pointer value "my_EndElementHandler" is "function (pointer to void, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_EndElementHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 827: In the initializer for handler_info [2].handler, the referenced type of the pointer value "my_ProcessingInstructionH andler" is "function (pointer to void, pointer to const char, pointer to const c har) returning void", which is not compatible with "void". (ptrmismatch) my_ProcessingInstructionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 830: In the initializer for handler_info [3].handler, the referenced type of the pointer value "my_CharacterDataHandler" is "function (pointer to void, pointer to const char, int) returning void", whic h is not compatible with "void". (ptrmismatch) my_CharacterDataHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 833: In the initializer for handler_info [4].handler, the referenced type of the pointer value "my_UnparsedEntityDeclHand ler" is "function (pointer to void, pointer to const char, pointer to const char , pointer to const char, pointer to const char, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_UnparsedEntityDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 836: In the initializer for handler_info [5].handler, the referenced type of the pointer value "my_NotationDeclHandler" i s "function (pointer to void, pointer to const char, pointer to const char, poin ter to const char, pointer to const char) returning void", which is not compatib le with "void". (ptrmismatch) my_NotationDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 839: In the initializer for handler_info [6].handler, the referenced type of the pointer value "my_StartNamespaceDeclHand ler" is "function (pointer to void, pointer to const char, pointer to const char ) returning void", which is not compatible with "void". (ptrmismatch) my_StartNamespaceDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 842: In the initializer for handler_info [7].handler, the referenced type of the pointer value "my_EndNamespaceDeclHandle r" is "function (pointer to void, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_EndNamespaceDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 845: In the initializer for handler_info [8].handler, the referenced type of the pointer value "my_CommentHandler" is "fu nction (pointer to void, pointer to const char) returning void", which is not co mpatible with "void". (ptrmismatch) my_CommentHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 848: In the initializer for handler_info [9].handler, the referenced type of the pointer value "my_StartCdataSectionHandl er" is "function (pointer to void) returning void", which is not compatible with "void". (ptrmismatch) my_StartCdataSectionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 851: In the initializer for handler_info [10].handler, the referenced type of the pointer value "my_EndCdataSectionHandle r" is "function (pointer to void) returning void", which is not compatible with "void". (ptrmismatch) my_EndCdataSectionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 854: In the initializer for handler_info [11].handler, the referenced type of the pointer value "my_DefaultHandler" is "f unction (pointer to void, pointer to const char, int) returning void", which is not compatible with "void". (ptrmismatch) my_DefaultHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 857: In the initializer for handler_info [12].handler, the referenced type of the pointer value "my_DefaultHandlerExpandH andler" is "function (pointer to void, pointer to const char, int) returning voi d", which is not compatible with "void". (ptrmismatch) my_DefaultHandlerExpandHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 860: In the initializer for handler_info [13].handler, the referenced type of the pointer value "my_NotStandaloneHandler" is "function (pointer to void) returning int", which is not compatible with "vo id". (ptrmismatch) my_NotStandaloneHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 863: In the initializer for handler_info [14].handler, the referenced type of the pointer value "my_ExternalEntityRefHand ler" is "function (pointer to void, pointer to const char, pointer to const char , pointer to const char, pointer to const char) returning int", which is not com patible with "void". (ptrmismatch) my_ExternalEntityRefHandler }, --------^ The link step also appears to have a wildcard quoting problem. The ld command used is: ld -shared -expect_unresolved "*" build/temp.osf1V-alpha/extensions/pyexpat.o build/temp.osf1V-alpha/extensions/expat/xmltok/xmltok.o build/temp.osf1V-alpha/extensions/expat/xmltok/xmlrole.o build/temp.osf1V-alpha/extensions/expat/xmlwf/xmlfile.o build/temp.osf1V-alpha/extensions/expat/xmlwf/xmlwf.o build/temp.osf1V-alpha/extensions/expat/xmlwf/codepage.o build/temp.osf1V-alpha/extensions/expat/xmlparse/xmlparse.o build/temp.osf1V-alpha/extensions/expat/xmlparse/hashtable.o build/temp.osf1V-alpha/extensions/expat/xmlwf/unixfilemap.o -o build/lib.osf1V-alpha/xml/parsers/pyexpat.so which works correctly if put into a /bin/sh script produces pyexpat.so without warnings of unresolved externals (the -expect_unresolved "*" pattern matches all). However, when run by Python via the "python setup.py build" command, ld complains about all the unresolved externals: ld: Warning: Unresolved: fread strlen strncpy strcmp free malloc PyType_Type PyObject_GetAttrString _Py_NoneStruct PyObject_Init as if the pattern that ld is trying to match is literally "*" instead of * -- Email - m.favas@per.dem.csiro.au Mark C Favas Phone - +61 8 9333 6268, 0418 926 074 CSIRO Exploration & Mining Fax - +61 8 9383 9891 Private Bag Post Office Wembley GPS - 31.97 S, 115.81 E Western Australia 6014 From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid> Henry S. Thompson wrote: > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. agreed. (but even if we settle for "7-bit unicode" in 1.6, there are still a few issues left to sort out before 1.6 final. but it might be best to get back to that after we've added SRE and GC to 1.6a3. we might all need a short break...) > But let's not fail to come back to this ASAP first week in june, promise ;-) From ht@cogsci.ed.ac.uk Fri May 5 09:19:07 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:19:07 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > > I think I hear a moderate consensus developing that the 'ASCII > > proposal' is a reasonable compromise given the time constraints. But > > let's not fail to come back to this ASAP -- it _really_ narcs me that > > every time I load XML into my Python-based editor I'm going to convert > > large amounts of wide-string data into UTF-8 just so Tk can convert it > > back to wide-strings in order to display it! > > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. > > And is your XML really stored on disk in its 16-bit format? No, I have no idea what encoding it's in, my XML parser supports over a dozen encodings, and quite sensibly always delivers the content, as per the XML REC, as wide-strings. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From ht@cogsci.ed.ac.uk Fri May 5 09:21:41 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:21:41 +0100 Subject: [XML-SIG] Re: Unicode debate In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Message-ID: "Fredrik Lundh" writes: > Guido van Rossum wrote: > > Thanks -- but that's really Tcl's fault, since the only way to get > > character data *into* Tcl (or out of it) is through the UTF-8 > > encoding. > > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm > > Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) > > Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new > object or modify an existing object to hold a copy of the > Unicode string given by unicode and numChars. > > (Tcl_UniChar* is currently the same thing as Py_UNICODE*) > Any way this can be exploited in Tkinter? ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From fredrik@pythonware.com Fri May 5 10:08:41 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 5 May 2000 11:08:41 +0200 Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us><200005041240.IAA08277@eric.cnri.reston.va.us><00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Message-ID: <010401bfb671$82bc6e50$0500a8c0@secret.pythonware.com> Henry S. Thompson wrote: > > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm > >=20 > > Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) > > Any way this can be exploited in Tkinter? fixes for this was checked into CVS last night, so it'll be the next alpha. From guido@python.org Fri May 5 16:07:48 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 05 May 2000 11:07:48 -0400 Subject: [XML-SIG] Moving Unicode debate to i18n-sig@python.org Message-ID: <200005051507.LAA14262@eric.cnri.reston.va.us> I've moved all my responses to the Unicode debate to the i18n-sig mailing list, where it belongs. Please don't cross-post any more. If you're interested in this issue but aren't subscribed to the i18n-sig list, please subscribe at http://www.python.org/mailman/listinfo/i18n-sig/. To view the archives, go to http://www.python.org/pipermail/i18n-sig/. See you there! --Guido van Rossum (home page: http://www.python.org/~guido/) From pwolff@mgfairfax.rr.com Sun May 7 02:49:28 2000 From: pwolff@mgfairfax.rr.com (Greg Wolff) Date: Sat, 06 May 2000 21:49:28 -0400 Subject: [XML-SIG] how to obtain Byte offset from the Locator... Message-ID: <3914CBA8.B54C5189@mgfairfax.rr.com> I've a question for this list about obtaining location information during an event call back to the document handler. I'm writing my first Python xml script and having a good time with it. (This C++ dude thinks Python is great...) But, I can't see how to obtain the byte offset from the locator. In expat's C/C++ interface there is a routine long XMLPARSEAPI XML_GetCurrentByteIndex(XML_Parser parser); that allows me to acquire the current byte offset during an event call back. In the Python interface in SAX there is no equivalent routine. Of course, the java documentation of SAX does not document any way to get the byte offset either. My question is: Is there any way to acquire the byte offset of the current Line and Column that the Locator is pointing to during an event call back? I need the information for search indices that I'm building and would rather build the code in Python than C++. Thanks for the help! Greg Wolff pwolff@cox.rr.com From mwh21@cam.ac.uk Sun May 7 11:58:11 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 07 May 2000 11:58:11 +0100 Subject: [XML-SIG] whither www.w3.org? Message-ID: Vaguely on topic ... I was just starting to learn about things XML-ish when www.w3.org fell of the 'net, which makes reading specifications a bit difficult. Yesterday I got "connection refused", today i get "host not found". Does anybody know (a) what's going on (b) if there is a web mirror anywhere? TIA, Michael From larsga@garshol.priv.no Sun May 7 12:41:59 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 07 May 2000 13:41:59 +0200 Subject: [XML-SIG] how to obtain Byte offset from the Locator... In-Reply-To: <3914CBA8.B54C5189@mgfairfax.rr.com> References: <3914CBA8.B54C5189@mgfairfax.rr.com> Message-ID: * Greg Wolff | | I've a question for this list about obtaining location information | during an event call back to the document handler. I'm writing my first | Python xml script and having a good time with it. (This C++ dude thinks | Python is great...) But, I can't see how to obtain the byte offset from | the locator. There is no way to do that with the Locator. I plan to add SAX 2.0 properties for the byte offset to the expat and xmlproc drivers, since both support this functionality, but at the moment there is no standard way to do this. For speed of access the value of the property should probably be a function (really a method tied to an object). BTW, I've been wondering what namespace to use for this. Should we define common properties/features in the http://www.python.org/ namespace, or should I use my own garshol.priv.no? | I need the information for search indices that I'm building and would | rather build the code in Python than C++. If you _know_ that you are using the expat driver you can look at the drv_pyexpat.py code and see how to find a reference to the expat Parser object and try to get the information from there. Not really the recommended way to do it, but it should work. --Lars M. From larsga@garshol.priv.no Sun May 7 12:43:04 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 07 May 2000 13:43:04 +0200 Subject: [XML-SIG] whither www.w3.org? In-Reply-To: References: Message-ID: * Michael Hudson | | Vaguely on topic ... I was just starting to learn about things XML-ish | when www.w3.org fell of the 'net, which makes reading specifications a | bit difficult. Yesterday I got "connection refused", today i get "host | not found". This may be a local problem. FWIW I can access w3.org with no problems from Norway right now. --Lars M. From tpassin@home.com Sun May 7 15:48:22 2000 From: tpassin@home.com (tpassin@home.com) Date: Sun, 7 May 2000 10:48:22 -0400 Subject: [XML-SIG] whither www.w3.org? Message-ID: <004d01bfb833$4b16ede0$7cac1218@reston1.va.home.com> Michael Hudson asked > Vaguely on topic ... I was just starting to learn about things XML-ish > when www.w3.org fell of the 'net, which makes reading specifications a > bit difficult. Yesterday I got "connection refused", today i get "host > not found". > > Does anybody know (a) what's going on (b) if there is a web mirror > anywhere? > I connected fine today, Sunday 7 May. Maybe it was a transient result of the I-LOVE-U virus. I stopped getting postings from xml-dev the same day it hit, and I still am not getting them. Tom Passin From ht@cogsci.ed.ac.uk Mon May 8 10:03:39 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 08 May 2000 10:03:39 +0100 Subject: [XML-SIG] whither www.w3.org? In-Reply-To: Michael Hudson's message of "07 May 2000 11:58:11 +0100" References: Message-ID: Michael Hudson writes: > Vaguely on topic ... I was just starting to learn about things XML-ish > when www.w3.org fell of the 'net, which makes reading specifications a > bit difficult. Yesterday I got "connection refused", today i get "host > not found". > > Does anybody know (a) what's going on (b) if there is a web mirror > anywhere? It's the Rutherford mirror that's fallen on its face, not www.w3.org. Maybe when those guys in Oxfordshire come back from their weekend things will get better. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From mwh21@cam.ac.uk Mon May 8 20:22:20 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 08 May 2000 20:22:20 +0100 Subject: [XML-SIG] How to get 4DOM to output empty Message-ID: I'm currently using 4DOM to generate XHTML (in a very crufty way that I will probably ask for more help on soon), and I'm finding that 4DOM produces stuff like which I don't *think* is valid XHTML; certainly validator.w3.org doesn't like it. Currently I produce the HTML by doing: p = PrettyPrintVisitor(" ",80,["img"]) open("books.html","w").write(p.visit(newdoc)) Is this normal/sane? I await your wisdom... Cheers, Michael From akuchlin@mems-exchange.org Mon May 8 23:19:33 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 8 May 2000 18:19:33 -0400 (EDT) Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: References: Message-ID: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Michael Hudson writes: > >which I don't *think* is valid XHTML; certainly validator.w3.org >doesn't like it. Then the validator is broken; the XML 1.0 spec says "If an element is empty, it must be represented either by a start-tag immediately followed by an end-tag or by an empty-element tag." (Unless XHTML specifies that only the empty-element tag is legal. In which the XHTML spec is what's broken.) Can't say off-hand if there's a way to make 4DOM produce empty-element tags; don't have the source code here at work... --amk From mwh21@cam.ac.uk Mon May 8 23:42:16 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 08 May 2000 23:42:16 +0100 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: "Andrew M. Kuchling"'s message of "Mon, 8 May 2000 18:19:33 -0400 (EDT)" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Message-ID: "Andrew M. Kuchling" writes: > Michael Hudson writes: > > > >which I don't *think* is valid XHTML; certainly validator.w3.org > >doesn't like it. > > Then the validator is broken; the XML 1.0 spec says "If an element is > empty, it must be represented either by a start-tag immediately > followed by an end-tag or by an empty-element tag." (Unless XHTML > specifies that only the empty-element tag is legal. In which the > XHTML spec is what's broken.) That's what I thought. And in fact the XHTML recommendation says: Empty elements must either have an end tag or the start tag must end with />. But it also says (in the "informative" appendix C): Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives uncertain results in many existing user agents. ... > Can't say off-hand if there's a way to make 4DOM produce empty-element > tags; don't have the source code here at work... ... so I'd still like to know the answer to this question. Plus the empty-element style just looks better to my eyes. Cheers, Michael -- 6. Symmetry is a complexity-reducing concept (co-routines include subroutines); seek it everywhere. -- Alan Perlis, http://www.cs.yale.edu/~perlis-alan/quotes.html From Norman Walsh Mon May 8 23:52:20 2000 From: Norman Walsh (Norman Walsh) Date: 08 May 2000 18:52:20 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Michael Hudson's message of "08 May 2000 20:22:20 +0100" References: Message-ID: <873dnspsd7.fsf@eris.nwalsh.com> / Michael Hudson was heard to say: | I'm currently using 4DOM to generate XHTML (in a very crufty way that | I will probably ask for more help on soon), and I'm finding that 4DOM | produces stuff like | | Perfectly legit. In XML, there is no distinction between and . Be seeing you, norm -- Norman Walsh | Science is a way of talking about the http://nwalsh.com/ | universe in words that bind it to a | common reality. Magic is a method of | talking to the universe in words that | it cannot ignore. The two are rarely | compatible.--Neil Gaiman From Norman Walsh Mon May 8 23:54:19 2000 From: Norman Walsh (Norman Walsh) Date: 08 May 2000 18:54:19 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Michael Hudson's message of "08 May 2000 23:42:16 +0100" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Message-ID: <87ya5kodpg.fsf@eris.nwalsh.com> / Michael Hudson was heard to say: | But it also says (in the "informative" appendix C): | | Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives | uncertain results in many existing user agents. Broken (or not properly XML-aware) user agents. Be seeing you, norm -- Norman Walsh | Blessed is he who expects nothing, for http://nwalsh.com/ | he shall never be disappointed.--Pope From mwh21@cam.ac.uk Tue May 9 00:06:33 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 09 May 2000 00:06:33 +0100 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Norman Walsh's message of "08 May 2000 18:54:19 -0400" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> <87ya5kodpg.fsf@eris.nwalsh.com> Message-ID: Norman Walsh writes: > / Michael Hudson was heard to say: > | But it also says (in the "informative" appendix C): > | > | Also, use the minimized tag syntax for empty elements, e.g.
| />, as the alternative syntax

allowed by XML gives > | uncertain results in many existing user agents. > > Broken (or not properly XML-aware) user agents. Or just old. M. -- incidentally, asking why things are "left out of the language" is a good sign that the asker is fairly clueless. -- Erik Naggum, comp.lang.lisp From hannu@tm.ee Mon May 8 23:08:01 2000 From: hannu@tm.ee (Hannu Krosing) Date: Tue, 09 May 2000 01:08:01 +0300 Subject: [XML-SIG] How to get 4DOM to output empty References: <873dnspsd7.fsf@eris.nwalsh.com> Message-ID: <39173AC1.E1D03550@tm.ee> Norman Walsh wrote: > > / Michael Hudson was heard to say: > | I'm currently using 4DOM to generate XHTML (in a very crufty way that > | I will probably ask for more help on soon), and I'm finding that 4DOM > | produces stuff like > | > | > > Perfectly legit. In XML, there is no distinction between > and . He said XHTML not XML, a standard supposed to be bacwards compatible. ------- Hannu From jsydik@virtualparadigm.com Tue May 9 02:42:59 2000 From: jsydik@virtualparadigm.com (Jeremy J. Sydik) Date: Mon, 08 May 2000 20:42:59 -0500 Subject: [XML-SIG] How to get 4DOM to output empty References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> <87ya5kodpg.fsf@eris.nwalsh.com> Message-ID: <39176D23.8FA5B72D@virtualparadigm.com> The entire Reference is: XHTML 1.0: The Extensible HyperText Markup Language A Reformulation of HTML 4 in XML 1.0 W3C Recommendation 26 January 2000 . . Appendix C. HTML Compatibility Guidelines . . C.2 Empty Elements Include a space before the trailing / and > of empty elements, e.g.
,
and Karen. Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives uncertain results in many existing user agents. As I'm reading this, the point is EXACTLY that we're working with non-aware agents, in particular, those browsers not capable of handling XML (So, really most of the current market last I knew). As far as the original question, output of

IS valid XHMTL, but not compatible with the current HTML browser base for the most part, hence
. That aside, I think this might work as a workaround for you until an answer from the FT crew shows up: Make the Following Changes to Ft/Dom/Ext/PrettyPrintVisitor: Change __init__ to be: def __init__(self, indent, width, plainElements,singleElements=[]): self.__indent = indent self.__depth = 0 self.__width = width self.__plainElements = plainElements self.__singleElements = singleElements self.__printPlain = 0 self.__plainPrinter = PrintVisitor() self.__prevNodeIsText = 0 self.__emptyReturn = 0 self.__namespaces = [{}] In visitElement: Replace: st = string.rstrip(st) + '>' With: if node.tagName in self.__singleElements: st=string.rstrip(st) + ' />' else: st = string.rstrip(st) + '>' Replace: if node.ownerDocument.isXml() or node.hasChildNodes() or node.tagName not in HTML_SINGLE_TAGS: With: if node.ownerDocument.isXml() or node.hasChildNodes() or node.tagName not in HTML_SINGLE_TAGS or node.tagName not in self.__singleElements: At which time your code example would look like: p = PrettyPrintVisitor(" ",80,[""],["IMG"]) open("books.html","w").write(p.visit(newdoc)) Not seeing your full code example, I don't know if this will actually work or not. Have Fun, Jeremy Norman Walsh wrote: > > / Michael Hudson was heard to say: > | But it also says (in the "informative" appendix C): > | > | Also, use the minimized tag syntax for empty elements, e.g.
| />, as the alternative syntax

allowed by XML gives > | uncertain results in many existing user agents. > > Broken (or not properly XML-aware) user agents. > > Be seeing you, > norm > > -- > Norman Walsh | Blessed is he who expects nothing, for > http://nwalsh.com/ | he shall never be disappointed.--Pope > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://www.python.org/mailman/listinfo/xml-sig From mickael.remond@IDEALX.com Tue May 9 08:44:12 2000 From: mickael.remond@IDEALX.com (Mickael Remond) Date: 09 May 2000 09:44:12 +0200 Subject: [XML-SIG] Bug report in DOM: ' instead of " in attribs Message-ID: <7od7mwyxpv.fsf@snake.ird.idealx.com> Hello to all, I think I have found a bug in the DOM source code (pyMXL 0.5.1). This bug prevent me from reading back the XML I have written. I was using the XMLproc Saxdriver. This bug does not seem to be corrected in PyXML 0.5.4 release candidate. The toxml method in the class Element write the attributes with two single quotes instead of using two double quotes as this should be done usually done in XML. Example: doc.toxml writes : and should writes: The diff is the following on dom/core.py: 803c803 < s = s + " %s='" % (attr,) --- > s = s + " %s=\"" % (attr,) 810c810 < s = s + "'" --- > s = s + "\"" Has this bug been identified before ? I hope this bug will be fixed in pyXML 0.5.4 final release. Thank you in advance. -- Mickaël Rémond - mickael.remond@IDEALX.com - http://IDEALX.com From larsga@garshol.priv.no Tue May 9 08:48:34 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 09:48:34 +0200 Subject: [XML-SIG] Bug report in DOM: ' instead of " in attribs In-Reply-To: <7od7mwyxpv.fsf@snake.ird.idealx.com> References: <7od7mwyxpv.fsf@snake.ird.idealx.com> Message-ID: * Mickael Remond | | I think I have found a bug in the DOM source code (pyMXL | 0.5.1). This bug prevent me from reading back the XML I have | written. I was using the XMLproc Saxdriver. This bug does not seem | to be corrected in PyXML 0.5.4 release candidate. | | The toxml method in the class Element write the attributes with two | single quotes instead of using two double quotes as this should be | done usually done in XML. XML allows both single and double quotes, so this should be perfectly OK. Any parser which does not support single quotes is simply broken. Which XML parser does not allow you to read the document back? And can we see the XML that fails? --Lars M. From Fredrik Lundh" Message-ID: <00e001bfb98b$c674cf80$34aab5d4@hagrid> Mickael Remond wrote: > The toxml method in the class Element write the attributes with two = single > quotes instead of using two double quotes as this should be done = usually done > in XML. XML allows you to use either double quotes or single quotes for attribute values (see the AttValue production). > doc.toxml writes : > that's perfectly valid XML. > Has this bug been identified before ? it's not a bug -- at least not where you think it is. since I strongly doubt that xmlproc messes up on this one, maybe the real bug is that the DOM writer doesn't look for quotes in the attribute content? the following is *not* a valid tag: it should be written as: or From Norman Walsh Tue May 9 11:37:07 2000 From: Norman Walsh (Norman Walsh) Date: 09 May 2000 06:37:07 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Hannu Krosing's message of "Tue, 09 May 2000 01:08:01 +0300" References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> Message-ID: <87u2g8nh64.fsf@eris.nwalsh.com> / Hannu Krosing was heard to say: | > Perfectly legit. In XML, there is no distinction between | > and . | | He said XHTML not XML, a standard supposed to be bacwards compatible. I understand that, but it's also supposed to be XML. The most emphatic thing that the XHTML spec could say is that one form or the other is preferred. XHTML has to obey the rules of XML. Be seeing you, norm -- Norman Walsh | If you settle for what they're giving http://nwalsh.com/ | you, you deserve what you get. From hannu@tm.ee Tue May 9 12:46:29 2000 From: hannu@tm.ee (Hannu Krosing) Date: Tue, 09 May 2000 14:46:29 +0300 Subject: [XML-SIG] How to get 4DOM to output empty References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> <87u2g8nh64.fsf@eris.nwalsh.com> Message-ID: <3917FA95.17F41A79@tm.ee> Norman Walsh wrote: > > / Hannu Krosing was heard to say: > | > Perfectly legit. In XML, there is no distinction between > | > and . > | > | He said XHTML not XML, a standard supposed to be bacwards compatible. > > I understand that, but it's also supposed to be XML. The most emphatic > thing that the XHTML spec could say is that one form or the other is > preferred. XHTML has to obey the rules of XML. My understanding was that XHTML was supposed to define a subset of XML that is also HTML (and actually accepted and rendered more-or-less ok). It is always hard to tell what a recommendation in a "standard" means. For example, if you follow just the requirements and not the recommendations when programming java applets, they usually won't work in the same way on different browsers (or don't work at all). ---------------- Hannu From Norman Walsh Tue May 9 14:10:29 2000 From: Norman Walsh (Norman Walsh) Date: 09 May 2000 09:10:29 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Hannu Krosing's message of "Tue, 09 May 2000 14:46:29 +0300" References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> <87u2g8nh64.fsf@eris.nwalsh.com> <3917FA95.17F41A79@tm.ee> Message-ID: <87bt2fna2i.fsf@eris.nwalsh.com> / Hannu Krosing was heard to say: | Norman Walsh wrote: | > / Hannu Krosing was heard to say: | > | He said XHTML not XML, a standard supposed to be bacwards compatible. | > | > I understand that, but it's also supposed to be XML. The most emphatic | > thing that the XHTML spec could say is that one form or the other is | > preferred. XHTML has to obey the rules of XML. | | My understanding was that XHTML was supposed to define a subset of XML | that is also HTML (and actually accepted and rendered more-or-less ok). | | It is always hard to tell what a recommendation in a "standard" means. Yep. FWIW, if my concern is for presentation rather than compliance with the standard, I usually just add a bogus attribute:
That works just as well as "
" and is often easier to get tools to render. Be seeing you, norm -- Norman Walsh | Our years, our debts, and our enemies http://nwalsh.com/ | are always more numerous than we | imagine.--Charles Nodier From pwolff@mgfairfax.rr.com Tue May 9 19:52:31 2000 From: pwolff@mgfairfax.rr.com (Greg Wolff) Date: Tue, 09 May 2000 14:52:31 -0400 Subject: [XML-SIG] how to obtain Byte offset from the Locator... References: <3914CBA8.B54C5189@mgfairfax.rr.com> Message-ID: <39185E6F.F794E2D3@mgfairfax.rr.com> I have a copy of expat for my C++ code but I have found that I don't have a copy of the driver for expat for the Python code. I have the xmlproc code and it works just fine, but it doesn't have byte offset as far as I can tell. (My first cursory look at the code suggests that it would be better to ask you'all for help rather than try to hack it...) Which file on the xml-sig download page has the Python Expat code in it? I have tried to download a pyexpat file but the link was broken last night when I tried it. If I can get the pyexpat code I'll hack it as Lars M. suggests below (Thanks!). Also, is there any chance of trying to work with the SAX 2.0 Python code? Thanks for the help. /pgw Greg Wolff Lars Marius Garshol wrote: > > * Greg Wolff > | > | But, I can't see how to obtain the byte offset from > | the locator. > > There is no way to do that with the Locator. > > I plan to add SAX 2.0 properties for the byte offset to the expat and > xmlproc drivers, since both support this functionality, but at the > moment there is no standard way to do this. > > ..... > > | I need the information for search indices that I'm building and would > | rather build the code in Python than C++. > > If you _know_ that you are using the expat driver you can look at the > drv_pyexpat.py code and see how to find a reference to the expat > Parser object and try to get the information from there. Not really > the recommended way to do it, but it should work. > > --Lars M. > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://www.python.org/mailman/listinfo/xml-sig From larsga@garshol.priv.no Tue May 9 20:07:13 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 21:07:13 +0200 Subject: [XML-SIG] how to obtain Byte offset from the Locator... In-Reply-To: <39185E6F.F794E2D3@mgfairfax.rr.com> References: <3914CBA8.B54C5189@mgfairfax.rr.com> <39185E6F.F794E2D3@mgfairfax.rr.com> Message-ID: * Greg Wolff | | I have a copy of expat for my C++ code but I have found that I don't | have a copy of the driver for expat for the Python code. If you download either the XML-SIG package or the saxlib 1.0 package you will get it. | I have the xmlproc code and it works just fine, but it doesn't have | byte offset as far as I can tell. Actually, it does. The get_offset method on the Parser interface will give you what you want. | Which file on the xml-sig download page has the Python Expat code in | it? This is the one you want. | Also, is there any chance of trying to work with the SAX 2.0 Python | code? Well, you can download the SAX 2.0 release for Python and try to use it with xmlproc, but I wouldn't really recommend it. Also, you'd have to add the property yourself. I am working on SAX 2.0 right now, and will try to get this feature in as soon as I can. --Lars M. From larsga@garshol.priv.no Tue May 9 20:10:46 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 21:10:46 +0200 Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? Message-ID: I currently don't have access to MSVC++ and as my home machine is Win32 (for the time being) I have problems developing SAX 2.0 drivers for pyexpat and sgmlop. I noticed that the versions in the 0.5.4 release are out of date (at least they don't seem to fit with the pyexpat.c source). If anyone could email me the binaries for these or make them available for download somewhere that would make me very happy as I really need this to be able to write the drivers. Thanks! --Lars M. From uogbuji@fourthought.com Wed May 10 03:55:20 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 09 May 2000 20:55:20 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from Michael Hudson of "08 May 2000 23:42:16 BST." Message-ID: <200005100255.UAA04394@localhost.localdomain> > > Can't say off-hand if there's a way to make 4DOM produce empty-element > > tags; don't have the source code here at work... > > ... so I'd still like to know the answer to this question. Plus the > empty-element style just looks better to my eyes. There isn't and there should be, if only for readibility. What do users think? There are two issues here: A) Should the default for printing empty XML elements be the short or long form? Any different for HTML (note that the CVS 4DOM fixes bugs with HTML 4.0 elements forbidden to have an end tag). B) Should the printers accept optional argumuments that are lists of elements to always shorten if empty (or vice-versa if the answer to A is "yes) I say "yes" to both of the above. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Wed May 10 04:01:16 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 09 May 2000 21:01:16 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from Hannu Krosing of "Tue, 09 May 2000 01:08:01 +0300." <39173AC1.E1D03550@tm.ee> Message-ID: <200005100301.VAA04423@localhost.localdomain> > Norman Walsh wrote: > > > > / Michael Hudson was heard to say: > > | I'm currently using 4DOM to generate XHTML (in a very crufty way that > > | I will probably ask for more help on soon), and I'm finding that 4DOM > > | produces stuff like > > | > > | > > > > Perfectly legit. In XML, there is no distinction between > > and . > > He said XHTML not XML, a standard supposed to be bacwards compatible. Quite incorrect. First of all, XHTML _is_ XML, or more precisely, an XML application. Second, XHTML is _not_ meant to be backwards-compatible. In fact, as I'm sure Norm Walsh would be quick to point out, one of its key aims is to break the mess caused by endless layers of backwards-compatibility. For one thing, HTML allows minimizations that are not well-formed XML (ergo XHTML), such as "
" rather than "
" or "

", but that's just the tip 'o the backwards-breaking iceberg, so to speak. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Wed May 10 04:06:32 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 09 May 2000 21:06:32 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from "Jeremy J. Sydik" of "Mon, 08 May 2000 20:42:59 CDT." <39176D23.8FA5B72D@virtualparadigm.com> Message-ID: <200005100306.VAA04457@localhost.localdomain> > That aside, I think this might work as a workaround for you until an > answer from the FT > crew shows up: > > Make the Following Changes to Ft/Dom/Ext/PrettyPrintVisitor: [SNIP] Excellent! Not too far off from what we would have done based on the answers to the survey in the last question. Thanks. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From mwh21@cam.ac.uk Wed May 10 08:12:49 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 10 May 2000 08:12:49 +0100 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Uche Ogbuji's message of "Tue, 09 May 2000 20:55:20 -0600" References: <200005100255.UAA04394@localhost.localdomain> Message-ID: Uche Ogbuji writes: > A) Should the default for printing empty XML elements be the short > or long form? Any different for HTML (note that the CVS 4DOM fixes > bugs with HTML 4.0 elements forbidden to have an end tag). CVS 4DOM? This won't entirely help; but tho only tag I have that is sometimes empty and sometimes not is ; I'd still like to see => > B) Should the printers accept optional argumuments that are lists of > elements to always shorten if empty (or vice-versa if the answer to > A is "yes) > > I say "yes" to both of the above. Me too, but what do I know... Thanks, Michael -- There are 'infinite' number of developed artifacts and one cannot develop appreciation for them all. It would be all right to not understand something, but it would be imbecilic to put judgements on things one don't understand. -- Xah, comp.lang.lisp From Daniel Graham Wed May 10 13:22:32 2000 From: Daniel Graham (Daniel Graham) Date: Wed, 10 May 2000 08:22:32 -0400 (EDT) Subject: [XML-SIG] Problems Building PyXML Message-ID: Hi, I've run into a problem with both PyXML-0.5.3 and PyXML-0.5.4 on my linux box (RedHat 6.1). The output from "python setup.py build" for each follows. I'm a newcomer to python (but already love it) and not much of a hand at c and would greatly appreciate any pointers you might give me. ################ PyXML-0.5.3 ################### rm -f *.o *~ rm -f `find . -name '*.pyc'` rm -f `find . -name '*.o'` rm -f `find . -name '*~'` cd expat ; make clean make[1]: Entering directory `/usr/local/PyXML-0.5.3/extensions/expat' rm -f xmltok/xmltok.o xmltok/xmlrole.o xmlwf/xmlwf.o xmlwf/xmlfile.o xmlwf/codepage.o xmlparse/xmlparse.o xmlparse/hashtable.o xmlwf/unixfilemap.o xmlwf/xmlwf make[1]: Leaving directory `/usr/local/PyXML-0.5.3/extensions/expat' rm -f *.a tags TAGS config.c Makefile.pre python sedscript rm -f *.so *.sl so_locations cd expat ; make clobber make[1]: Entering directory `/usr/local/PyXML-0.5.3/extensions/expat' rm -f xmltok/xmltok.o xmltok/xmlrole.o xmlwf/xmlwf.o xmlwf/xmlfile.o xmlwf/codepage.o xmlparse/xmlparse.o xmlparse/hashtable.o xmlwf/unixfilemap.o xmlwf/xmlwf rm -f libexpat.a make[1]: Leaving directory `/usr/local/PyXML-0.5.3/extensions/expat' VERSION=`python -c "import sys; print sys.version[:3]"`; \ installdir=`python -c "import sys; print sys.prefix"`; \ exec_installdir=`python -c "import sys; print sys.exec_prefix"`; \ make -f ./Makefile.pre.in VPATH=. srcdir=. \ VERSION=$VERSION \ installdir=$installdir \ exec_installdir=$exec_installdir \ Makefile make[1]: Entering directory `/usr/local/PyXML-0.5.3/extensions' make[1]: *** No rule to make target `/usr/lib/python1.5/config/Makefile', needed by `sedscript'. Stop. make[1]: Leaving directory `/usr/local/PyXML-0.5.3/extensions' make: *** [boot] Error 2 make: *** No targets. Stop. Executing 'build' action... Running command: make -f Makefile.pre.in boot Running command: make Traceback (innermost last): File "setup.py", line 173, in ? func() File "setup.py", line 143, in build_unix shutil.copy('extensions/' + filename, 'build/xml/parsers/') File "/usr/lib/python1.5/shutil.py", line 52, in copy copyfile(src, dst) File "/usr/lib/python1.5/shutil.py", line 17, in copyfile fsrc = open(src, 'rb') IOError: [Errno 2] No such file or directory: 'extensions/pyexpat.so' ################ PyXML-0.5.4 ################### rm -f *.o *~ rm -f `find . -name '*.pyc'` rm -f `find . -name '*.o'` rm -f `find . -name '*~'` cd expat ; make clean make[1]: Entering directory `/usr/local/PyXML-0.5.4/extensions/expat' rm -f xmltok/xmltok.o xmltok/xmlrole.o xmlwf/xmlwf.o xmlwf/xmlfile.o xmlwf/codepage.o xmlparse/xmlparse.o xmlparse/hashtable.o xmlwf/unixfilemap.o xmlwf/xmlwf make[1]: Leaving directory `/usr/local/PyXML-0.5.4/extensions/expat' rm -f *.a tags TAGS config.c Makefile.pre python sedscript rm -f *.so *.sl so_locations cd expat ; make clobber make[1]: Entering directory `/usr/local/PyXML-0.5.4/extensions/expat' rm -f xmltok/xmltok.o xmltok/xmlrole.o xmlwf/xmlwf.o xmlwf/xmlfile.o xmlwf/codepage.o xmlparse/xmlparse.o xmlparse/hashtable.o xmlwf/unixfilemap.o xmlwf/xmlwf rm -f libexpat.a make[1]: Leaving directory `/usr/local/PyXML-0.5.4/extensions/expat' VERSION=`python -c "import sys; print sys.version[:3]"`; \ installdir=`python -c "import sys; print sys.prefix"`; \ exec_installdir=`python -c "import sys; print sys.exec_prefix"`; \ make -f ./Makefile.pre.in VPATH=. srcdir=. \ VERSION=$VERSION \ installdir=$installdir \ exec_installdir=$exec_installdir \ Makefile make[1]: Entering directory `/usr/local/PyXML-0.5.4/extensions' make[1]: *** No rule to make target `/usr/lib/python1.5/config/Makefile', needed by `sedscript'. Stop. make[1]: Leaving directory `/usr/local/PyXML-0.5.4/extensions' make: *** [boot] Error 2 make: *** No targets. Stop. Executing 'build' action... Running command: make -f Makefile.pre.in boot Running command: make Traceback (innermost last): File "setup.py", line 185, in ? func() File "setup.py", line 155, in build_unix shutil.copy('extensions/' + filename, 'build/xml/parsers/') File "/usr/lib/python1.5/shutil.py", line 52, in copy copyfile(src, dst) File "/usr/lib/python1.5/shutil.py", line 17, in copyfile fsrc = open(src, 'rb') IOError: [Errno 2] No such file or directory: 'extensions/pyexpat.so' ##################################################### Thanks, Dan -- Daniel A. Graham Duke University Professor and Director Durham NC 27708-0097 of Graduate Studies daniel.graham@duke.edu Department of Economics (919) 660-1802 From Daniel Graham Wed May 10 14:17:14 2000 From: Daniel Graham (Daniel Graham) Date: Wed, 10 May 2000 09:17:14 -0400 (EDT) Subject: [XML-SIG] Problems Building PyXML Message-ID: Dumb mistake - sorry! The missing /usr/lib/python1.5/config files are in the devel rpm (currently python-devel-1.5.2-13.i386.rpm). After installing this rpm, the build and install went without a hitch. Dan -- Daniel A. Graham Duke University Professor and Director Durham NC 27708-0097 of Graduate Studies daniel.graham@duke.edu Department of Economics (919) 660-1802 From fdrake@acm.org Wed May 10 15:59:12 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Wed, 10 May 2000 10:59:12 -0400 (EDT) Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: <200005100255.UAA04394@localhost.localdomain> References: <200005100255.UAA04394@localhost.localdomain> Message-ID: <14617.31040.194997.275911@seahag.cnri.reston.va.us> Uche Ogbuji writes: > A) Should the default for printing empty XML elements be the short or long > form? Any different for HTML (note that the CVS 4DOM fixes bugs with HTML 4.0 > elements forbidden to have an end tag). > > B) Should the printers accept optional argumuments that are lists of elements > to always shorten if empty (or vice-versa if the answer to A is "yes) > > I say "yes" to both of the above. I agree; I think the default should be to use the short form for empty elements. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From paul@prescod.net Wed May 10 17:36:56 2000 From: paul@prescod.net (Paul Prescod) Date: Wed, 10 May 2000 09:36:56 -0700 Subject: [XML-SIG] Pull Parsing References: <390F5CB1.FBE70A92@prescod.net> <011001bfb555$bdc64b00$34aab5d4@hagrid> Message-ID: <39199028.2CE477C8@prescod.net> Fredrik showed how to turn incremental push parsers into pull parsers. Neat. I must admit that I always presumed that the conversion would be done at the SAX level so I couldn't think of a way to do it. The incremental API makes all the difference (maybe we should propose an incremental extension to SAX). I wonder if a pull style interface is intrinsically easier for the average Python programmer to get their heads around. * People tend to get uncomfortable when you take flow control out of their hands (as push parsers do). * There are people out there who are against inheritance and other trappings of object orientation (except struct-like field access, it seems). * Push parsers require a standardization of the parser *and* the handler. Pull parsers do away with the concept of a handler (and filter) altogether. * Pull parsers allow a very basic form of cooperative multithreading where you could read from several files and check other event queues * Pull parsers can always be turned into push parsers trivially A very simple API is forming in my head: domnode="dummy" while domnode: domnode = puller.get() if domnode.nodeType==TEXT: ... elif domnode.nodeType==ELEMENT_NODE: if domnode.tagName=="Foo": puller.expandTree( domnode ) (walk around the tree) elif domnode.tagName=="Bar": ... elif domnode.nodeType==...: ... elif domnode.nodeType== : ... If you want to mix in XPaths: while domnode: if xpath.matches( "some/xpath", node ): ... elif xpath.matches( "some/other/xpath", node ): ... elif xpath.matches( "another/xpath", node ): ... Opinions? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Art is always at peril in universities, where there are so many people, young and old, who love art less than argument, and dote upon a text that provides the nutritious pemmican on which scholars love to chew. -- Robertson Davies in "The Cunning Man" From uogbuji@fourthought.com Thu May 11 01:32:08 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 10 May 2000 18:32:08 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from Michael Hudson of "10 May 2000 08:12:49 BST." Message-ID: <200005110032.SAA12792@localhost.localdomain> > Uche Ogbuji writes: > > > A) Should the default for printing empty XML elements be the short > > or long form? Any different for HTML (note that the CVS 4DOM fixes > > bugs with HTML 4.0 elements forbidden to have an end tag). > > CVS 4DOM? This won't entirely help; but tho only tag I have that is > sometimes empty and sometimes not is ; I'd still like to see I should clarify: "The current state of 4DOM in our CVS repository" > => > > > B) Should the printers accept optional argumuments that are lists of > > elements to always shorten if empty (or vice-versa if the answer to > > A is "yes) > > > > I say "yes" to both of the above. > > Me too, but what do I know... This seems the consensus from both public and private messages. We'll put in the fix. --Uche From Juergen Hermann" On Wed, 10 May 2000 09:36:56 -0700, Paul Prescod wrote: >A very simple API is forming in my head: I would not return DOM nodes, but PYX-like tupels (node-type, node-name,= node- value). You wanted it simple! :) You can then put higher levels of abstraction above that, like RAX or yo= ur simple-DOM interface. Ciao, J=FCrgen -- J=FCrgen Hermann (jhe@webde-ag.de) WEB.DE AG, Amalienbadstr.41, D-76227 Karlsruhe Tel.: 0721/94329-0, Fax: 0721/94329-22 From paul@prescod.net Thu May 11 13:06:46 2000 From: paul@prescod.net (Paul Prescod) Date: Thu, 11 May 2000 05:06:46 -0700 Subject: [XML-SIG] Pull Parsing References: <20000511083227234.AAA325.334@pcfue9> Message-ID: <391AA256.1ABFACB8@prescod.net> Juergen Hermann wrote: > > On Wed, 10 May 2000 09:36:56 -0700, Paul Prescod wrote: > > >A very simple API is forming in my head: > > I would not return DOM nodes, but PYX-like tupels (node-type, node-name, node- > value). You wanted it simple! :) Some nodes have no meaningful value, some have no meaningful name and some have "extra" information like attributes and processing instruction targets. Overall I don't think that it is simpler. > You can then put higher levels of abstraction above that, like RAX or your > simple-DOM interface. Unfortunately we have to be careful about the number of levels of abstraction we build in. The push->pull abstraction would already suck some speed... -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Art is always at peril in universities, where there are so many people, young and old, who love art less than argument, and dote upon a text that provides the nutritious pemmican on which scholars love to chew. -- Robertson Davies in "The Cunning Man" From larsga@garshol.priv.no Thu May 11 13:50:58 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: Thu, 11 May 2000 14:50:58 +0200 Subject: [XML-SIG] xmlproc: Version 0.70 released Message-ID: <200005111250.OAA11417@lambda.garshol.priv.no> Changes since version 0.61: - lots of bug fixes - some internal code cleanups - optimizations (it is now even faster than before) - xmlproc now has a formal license statement (BSD-ish) - the APIs have been extended - a tool for converting DTDs to XML Schemas have been added - a GUI interface to the parser has been added The home page has moved to a permanent new location, which is: --Lars M. From paul@prescod.net Thu May 11 15:46:26 2000 From: paul@prescod.net (Paul Prescod) Date: Thu, 11 May 2000 07:46:26 -0700 Subject: [XML-SIG] Pull Parsing References: <390F5CB1.FBE70A92@prescod.net> <011001bfb555$bdc64b00$34aab5d4@hagrid> <39199028.2CE477C8@prescod.net> Message-ID: <391AC7C2.9789028F@prescod.net> In retrospect, pull parsing is not much different than push parsing for complex applications. You still want a level of indirection between the parser, the matcher and the code that is run. Pull parsing seems most ideal for simple applications where people want to get started immediately without understanding handlers, parsers and the interaction between them. I still feel that for many people they would be simpler in that context. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Art is always at peril in universities, where there are so many people, young and old, who love art less than argument, and dote upon a text that provides the nutritious pemmican on which scholars love to chew. -- Robertson Davies in "The Cunning Man" From larsga@garshol.priv.no Fri May 12 09:38:52 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 12 May 2000 10:38:52 +0200 Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? In-Reply-To: References: Message-ID: Thanks to Oliver Gathmann and Chris Olds I now have the binaries I need, so you can disregard this request now. However, I do think we should have the new binaries in the CVS tree. Should I put the ones I got there? --Lars M. From gstein@lyra.org Fri May 12 09:46:17 2000 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 May 2000 01:46:17 -0700 (PDT) Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? In-Reply-To: Message-ID: On 12 May 2000, Lars Marius Garshol wrote: > Thanks to Oliver Gathmann and Chris Olds I now have the binaries I > need, so you can disregard this request now. > > However, I do think we should have the new binaries in the CVS tree. > Should I put the ones I got there? I don't think binaries should be in the tree. The tree is *source* and should be used to build binaries for particular platforms. You are opening yourself up to trouble if you put binaries in there. It is simply too easy to fall out of sync. Cheers, -g -- Greg Stein, http://www.lyra.org/ From larsga@garshol.priv.no Fri May 12 09:51:53 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 12 May 2000 10:51:53 +0200 Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? In-Reply-To: References: Message-ID: * Greg Stein | | I don't think binaries should be in the tree. The tree is *source* | and should be used to build binaries for particular platforms. That means that we require all Win32 users to have MSVC++ installed. I don't think that's reasonable. In other words, if the binaries are not in the CVS tree then I think they should be available somewhere else. | You are opening yourself up to trouble if you put binaries in | there. It is simply too easy to fall out of sync. That is a valid point. Perhaps the binaries should be in the distribution, but not in the source tree? --Lars M. From gstein@lyra.org Fri May 12 09:52:44 2000 From: gstein@lyra.org (Greg Stein) Date: Fri, 12 May 2000 01:52:44 -0700 (PDT) Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? In-Reply-To: Message-ID: On 12 May 2000, Lars Marius Garshol wrote: >... > | You are opening yourself up to trouble if you put binaries in > | there. It is simply too easy to fall out of sync. > > That is a valid point. Perhaps the binaries should be in the > distribution, but not in the source tree? I *totally* agree on this point. Yes: the distro. No: the tree. :-) Of course, it is always possible for somebody to periodically build snapshots of the distro (or just the Win32 stuff) and drop them onto python.org for people to pick up. (or wherever access rights are available for this task) Cheers, -g -- Greg Stein, http://www.lyra.org/ From anthony@interlink.com.au Fri May 12 14:33:37 2000 From: anthony@interlink.com.au (Anthony Baxter) Date: Fri, 12 May 2000 23:33:37 +1000 Subject: [XML-SIG] Builder doesn't quote ' or " properly? Message-ID: <200005121333.XAA31777@mbuna.arbhome.com.au> Just for confirmation, the following is bad behaviour by the Builder code, right? This code: #------------- from xml.dom.builder import Builder b = Builder() b.startElement("blob"); b.text("\012") b.startElement("line", {"text":"it's..."}) b.endElement("line"); b.text("\012") b.startElement("line", {"text":"doug & dinsdale"}) b.endElement("line"); b.text("\012") b.startElement("line", {"text":'"spiny" norman'}) b.endElement("line"); b.text("\012") b.endElement("blob") print b.document.toxml() #------------- produces this output #------------- #------------- Note that it quotes the &, but not the ' or the ". Suggestions? Should I be quoting this by hand? Should I be hunting into the code to find out where it's getting it wrong? ta, Anthony -- Anthony Baxter It's never too late to have a happy childhood. From akuchlin@mems-exchange.org Fri May 12 15:15:47 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Fri, 12 May 2000 10:15:47 -0400 (EDT) Subject: [XML-SIG] Builder doesn't quote ' or " properly? In-Reply-To: <200005121333.XAA31777@mbuna.arbhome.com.au> References: <200005121333.XAA31777@mbuna.arbhome.com.au> Message-ID: <14620.4627.784694.268632@amarok.cnri.reston.va.us> Anthony Baxter writes: >Suggestions? Should I be quoting this by hand? Should I be hunting into >the code to find out where it's getting it wrong? It's clearly a bug; the toxml() method of the Element class should be escaping ' in attributes. The " doesn't need to be escaped inside an attribute surrounded by ' characters, though. --amk From Anthony Baxter Fri May 12 15:41:41 2000 From: Anthony Baxter (Anthony Baxter) Date: Sat, 13 May 2000 00:41:41 +1000 Subject: [XML-SIG] Builder doesn't quote ' or " properly? In-Reply-To: Message from "Andrew M. Kuchling" of "Fri, 12 May 2000 10:15:47 -0400." <14620.4627.784694.268632@amarok.cnri.reston.va.us> Message-ID: <200005121441.AAA32597@mbuna.arbhome.com.au> >>> "Andrew M. Kuchling" wrote > Anthony Baxter writes: > >Suggestions? Should I be quoting this by hand? Should I be hunting into > >the code to find out where it's getting it wrong? > > It's clearly a bug; the toxml() method of the Element class should be > escaping ' in attributes. The " doesn't need to be escaped inside an > attribute surrounded by ' characters, though. with that pointer, finding the problem was relatively easy. Rather than unconditionally quoting ' in xml.utils.escape, I only did it for xml.dom.core.Element.toxml Would replacing the three 'string.replace' calls in xml.utils.escape with a regex be a worthwhile optimisation? patch is against CVS version, if that helps. *** xml/dom/core.py.dist Sat May 13 00:38:20 2000 --- xml/dom/core.py Sat May 13 00:38:47 2000 *************** *** 870,876 **** append(" %s='" % (attr,)) for value in attrnode.children: if value.type == TEXT_NODE: ! append(escape(value.value) ) else: n = NODE_CLASS[ value.type ] (value, self._document) append(value.toxml()) --- 870,876 ---- append(" %s='" % (attr,)) for value in attrnode.children: if value.type == TEXT_NODE: ! append(escape(value.value, {"'":"'"}) ) else: n = NODE_CLASS[ value.type ] (value, self._document) append(value.toxml()) Anthony From larsga@garshol.priv.no Fri May 12 15:50:25 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 12 May 2000 16:50:25 +0200 Subject: [XML-SIG] Builder doesn't quote ' or " properly? In-Reply-To: <200005121441.AAA32597@mbuna.arbhome.com.au> References: <200005121441.AAA32597@mbuna.arbhome.com.au> Message-ID: * Anthony Baxter | | Would replacing the three 'string.replace' calls in xml.utils.escape | with a regex be a worthwhile optimisation? My experience suggests that it would actually slow things down. --Lars M. From anthony@interlink.com.au Mon May 15 04:04:08 2000 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 15 May 2000 13:04:08 +1000 Subject: [XML-SIG] problems using XML-sig code to read large XML files. Message-ID: <200005150304.NAA02965@mbuna.arbhome.com.au> I'm using the XML-sig code to read in a largish (2.5M) XML document. This document consists of a very very simple structure, like this: [..more countries..] this was generated using the xml-sig code. However, when I try to read it in using something like: from xml.dom import utils reader = utils.FileReader('out.xml') doc = reader.document I get an error: File "read.py", line 2, in ? reader = utils.FileReader('out.xml') File "/opt/python/lib/python1.5/site-packages/xml/dom/utils.py", line 131, in __init__ self.document = self.readFile(filename) File "/opt/python/lib/python1.5/site-packages/xml/dom/utils.py", line 140, in readFile document = self.readStream(file,type) File "/opt/python/lib/python1.5/site-packages/xml/dom/utils.py", line 148, in readStream document = self.readXml(stream) File "/opt/python/lib/python1.5/site-packages/xml/dom/utils.py", line 165, in readXml p.feed(stream.read()) File "/opt/python/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 123, in feed if not self.parser.Parse(data): pyexpat.error: not well-formed: line 37162, column 19 Using the other example on http://www.python.org/doc/howto/xml/node12.html I get something like Traceback (innermost last): File "read.py", line 16, in ? p.close() File "/opt/python/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 127, in close if not self.parser.Parse("",1): pyexpat.error: no element found: line 16148, column 16 Running both of them repeatedly gives different positions in the file. None of the lines mentioned in the file have a problem. Zope with the Ft ZDOM or the normal Zope DOM code have no problems with it. nsgmls has no problem with it. I've tried both the 0.5.4 and current CVS versions, to no avail. The dom_from_xml_file.py demo in Ft.Dom.demo also breaks. I can make the file available if anyone wants it, although just taking the example above and making 10,000 copies of the country into a file will do the trick. anyone? thanks, Anthony From anthony@interlink.com.au Mon May 15 10:13:37 2000 From: anthony@interlink.com.au (Anthony Baxter) Date: Mon, 15 May 2000 19:13:37 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? Message-ID: <200005150913.TAA05742@mbuna.arbhome.com.au> Once you've built an XML document in memory, what's the fastest way to get it to disk, and then re-read it in again afterwards? cPickle is appallingly slow, and generates _huge_ output. toxml() and then parsing it in again is also quite slow. I would have thought ESIS would be fast to read in, but nope. EsisBuilder seems to take (based on a number of runs) something like 40-50 times as long as utils.FileReader() so, what do other folks use? Anthony From sf@fermigier.com Mon May 15 11:08:48 2000 From: sf@fermigier.com (Stefane Fermigier) Date: Mon, 15 May 2000 12:08:48 +0200 Subject: [XML-SIG] SAX drivers comparison (PyXML 0.54). Message-ID: <20000515120848.L9720@cantor.math.jussieu.fr> Hi, I wrote the following script to test SAX divers speed and compatibility. The results, when run on http://www.dmoz.org/rdf/content.example.txt are: Parser: xml.sax.drivers.drv_sgmlop, time: 0.001875, 0 bytes written. Parser: xml.sax.drivers.drv_pyexpat, time: 0.109332, 4533 bytes written. !!! xml.sax.drivers.drv_xmltok Error No parsers found Parser: xml.sax.drivers.drv_xmlproc, time: 0.611368, 4996 bytes written. !!! xml.sax.drivers.drv_xmltoolkit Error No parsers found Parser: xml.sax.drivers.drv_xmllib, time: 1.250232, 28223 bytes written. !!! xml.sax.drivers.drv_xmldc Error No parsers found What I find most annoying is the fact that no one of the 3 drivers that work (and I had to change /usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py line 76 so that sgmlop works, maybe that was a mistake ?) give the same result ( bytes written by a trivial document handler. Here's the script: ############################################################################### import time, traceback, sys, StringIO import xml.sax.saxexts, xml.sax.saxlib parser_names = ["xml.sax.drivers.drv_sgmlop", "xml.sax.drivers.drv_pyexpat", "xml.sax.drivers.drv_xmltok", "xml.sax.drivers.drv_xmlproc", "xml.sax.drivers.drv_xmltoolkit", "xml.sax.drivers.drv_xmllib", "xml.sax.drivers.drv_xmldc"] class ContentHandler(xml.sax.saxlib.DocumentHandler): def __init__(self, buff): self.buff = buff def startElement(self, name, attrs): self.buff.write(name + '\n') for parser_name in parser_names: try: parser = xml.sax.saxexts.make_parser(parser_name) buff = StringIO.StringIO() parser.setDocumentHandler(ContentHandler(buff)) start = time.time() parser.parseFile(open(sys.argv[1])) buff.seek(0) print "Parser: %s, time: %f, %d bytes written." % ( parser_name, time.time() - start, len(buff.read())) except: #traceback.print_exc() print '!!!', parser_name, 'Error', sys.exc_info()[1] ############################################################################### Regards, (FYI my current goal is to parse as fast a possible something like http://www.dmoz.org/rdf/content.rdf.u8.gz which is a 500+ Mb XML file). S. -- Stéfane Fermigier, Tel: 06 63 04 12 77 (mobile). : le portail Linux / logiciel libre. "Amazon: we patent the dot in .com" From larsga@garshol.priv.no Mon May 15 12:15:14 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 13:15:14 +0200 Subject: [XML-SIG] SAX drivers comparison (PyXML 0.54). In-Reply-To: <20000515120848.L9720@cantor.math.jussieu.fr> References: <20000515120848.L9720@cantor.math.jussieu.fr> Message-ID: * Stefane Fermigier | | What I find most annoying is the fact that no one of the 3 drivers | that work [...] give the same result ( bytes written by a | trivial document handler. Strangely, this does not match my results, which were run on the code currently in the CVS tree on Linux with Python 1.5.1. !!! xml.sax.drivers.drv_sgmlop Error No parsers found !!! xml.sax.drivers.drv_pyexpat Error No parsers found !!! xml.sax.drivers.drv_xmltok Error No parsers found Parser: xml.sax.drivers.drv_xmlproc, time: 3.286604, 2877 bytes written. !!! xml.sax.drivers.drv_xmltoolkit Error No parsers found Parser: xml.sax.drivers.drv_xmllib, time: 5.625473, 2877 bytes written. !!! xml.sax.drivers.drv_xmldc Error No parsers found I don't have pyexpat on that machine, mainly because the makefile seems to make some assumptions about file structure that are not correct, and I don't have the time to fix that now. However, xmlproc and xmllib are in perfect agreement here, as I would expect them to be. Could you try with the code from the CVS tree as well and see what happens? | (and I had to change | /usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py line | 76 so that sgmlop works, maybe that was a mistake ?) Line 76 is return Parser() in the CVS tree. What is wrong with that? --Lars M. From larsga@garshol.priv.no Mon May 15 16:26:22 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 17:26:22 +0200 Subject: [XML-SIG] SAX drivers comparison (PyXML 0.54). In-Reply-To: References: <20000515120848.L9720@cantor.math.jussieu.fr> Message-ID: I tried this experiment again, this time on a different Linux box with Python 1.5.2 and with pyexpat compiled and installed, and also with the CVS tree first on the PYTHONPATH. This time I got this result: [larsga@pc-larsga tmp]$ python fermigier.py content.example.txt !!! xml.sax.drivers.drv_sgmlop Error No parsers found Parser: xml.sax.drivers.drv_pyexpat, time: 0.180707, 2877 bytes written. !!! xml.sax.drivers.drv_xmltok Error No parsers found Parser: xml.sax.drivers.drv_xmlproc, time: 0.613989, 2877 bytes written. !!! xml.sax.drivers.drv_xmltoolkit Error No parsers found Parser: xml.sax.drivers.drv_xmllib, time: 1.596865, 13025 bytes written. !!! xml.sax.drivers.drv_xmldc Error No parsers found xmlproc gives the same result as it did last time, and pyexpat agrees. Neither agree with Stephane's results. This time xmllib does not agree, though, and like Stephane I get a much larger result. A quick look at the output from xmllib shows that the problem is the newest version of xmllib (which I didn't use in the previous test), which does namespace processing, so all the element names come out as 'http://purl.org/dc/elements/1.0/ Title'. This processing can't be turned off, so there is no cure for it except to use an older version. With SAX 2.0 this problem will be handled, since there namespace processing is the default, and namespace-less processing is optional. When you try to turn it off the xmllib driver will complain, whereas the ones for pyexpat and xmlproc will allow you to do it. However, I'm still unable to find any reason for why pyexpat and xmlproc misbehaves in Stephane's experiment. --Lars M. From sf@fermigier.com Mon May 15 16:40:45 2000 From: sf@fermigier.com (Stefane Fermigier) Date: Mon, 15 May 2000 17:40:45 +0200 Subject: [XML-SIG] Re: SAX Drivers comparisons. Message-ID: <20000515174044.A46559@cantor.math.jussieu.fr> Lars Marius Garshol larsga@garshol.priv.no: > I tried this experiment again, this time on a different Linux box with > Python 1.5.2 and with pyexpat compiled and installed, and also with > the CVS tree first on the PYTHONPATH. > > This time I got this result: > > [larsga@pc-larsga tmp]$ python fermigier.py content.example.txt > !!! xml.sax.drivers.drv_sgmlop Error No parsers found > Parser: xml.sax.drivers.drv_pyexpat, time: 0.180707, 2877 bytes written. > !!! xml.sax.drivers.drv_xmltok Error No parsers found > Parser: xml.sax.drivers.drv_xmlproc, time: 0.613989, 2877 bytes written. > !!! xml.sax.drivers.drv_xmltoolkit Error No parsers found > Parser: xml.sax.drivers.drv_xmllib, time: 1.596865, 13025 bytes written. > !!! xml.sax.drivers.drv_xmldc Error No parsers found OK, now I get the same results (strange), using either PyXML-0.54 or the CVS source (python 1.5.2 on Mandrake 7.0.) Installing the CVS version gives me annonying error messages: -- [root@r76m64 xml]# python setup.py install 2>&1 | more make: Nothing to be done for `default'. File "/usr/lib/python1.5/site-packages/xml/parsers/xmlproc/catalog.py", line 4 """ An SGML Open catalog file parser. $Id: catalog.py,v 1.8 2000/05/12 18:39:58 lars Exp $ """ ^ SyntaxError: invalid syntax -- Regarding sgmlop, I still get: !!! xml.sax.drivers.drv_sgmlop Error function requires exactly 1 argument; 2 given The exact traceback is: Traceback (innermost last): File "feed_test", line 21, in ? parser.setDocumentHandler(ContentHandler(buff)) File "/usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py", line 29, in setDocumentHandler self.parser.register(DHWrapper(dh), 1) TypeError: function requires exactly 1 argument; 2 given S. -- Stéfane Fermigier, Tel: 06 63 04 12 77 (mobile). : le portail Linux / logiciel libre. "Amazon: we patent the dot in .com" From Fredrik Lundh" Message-ID: <00cf01bfbe93$a1653e20$34aab5d4@hagrid> Stefane Fermigier wrote: > Parser: xml.sax.drivers.drv_sgmlop, time: 0.001875, 0 bytes written. > and I had to change /usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py > line 76 so that sgmlop works, maybe that was a mistake ?) I think you're seeing a version mismatch here; the sax driver fails to register callbacks, so you never see any data on the Python level. guess someone who knows both saxlib and sgmlop should dig into this... > parse as fast a possible if I were you, I'd go for native sgmlop. the latest version is here: http://w1.132.telia.com/~u13208596/sgmlop.htm From larsga@garshol.priv.no Mon May 15 20:43:55 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:43:55 +0200 Subject: [XML-SIG] SAX 2.0 alpha 2 Message-ID: I've now put together a new, and much improved, SAX 2.0 alpha distribution. We are getting much closer to the final form of this thing now, and I urge everyone to have a look at it. I'm also working as hard as I can on making drivers for the different parsers. Question: should I put this stuff in the XML-SIG CVS tree as I improve on it? It is now at the stage where this shouldn't cause too much disturbance, in fact, hopefully none at all. Or should I wait until it's done? I will post four followup emails to this one with more points for discussion. Again, I urge everyone to have a look at them and provide their feedback on them, whether to the group (preferred) or to me personally if you are uncomfortable with addressing the group. The new alpha is available from --Lars M. From larsga@garshol.priv.no Mon May 15 20:46:06 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:46:06 +0200 Subject: [XML-SIG] SAX 2.0: Package structure Message-ID: Proposal for SAX 2.0 package structure: --------------------------------------- Goal: It should be possible to: - install SAX 2.0 over SAX 1.0 (thus deleting the old SAX 1.0 installation) and have SAX 1.0 applications continue to work unchanged with no problems Structure: - Everything should be in the xml.sax package, just as before - saxlib.py - This should contain the core base classes, just as before, but adding the SAX 2.0 ones to the SAX 1.0 ones and marking the deprecated ones as such. - saxutils.py - This should also contain the core utility classes, and extend the existing set with a new set for SAX 2.0. There should also be bug fixes and improvements to the existing classes. - saxexts.py - This should remain unchanged, since the make_parser function should still return SAX 1.0 parser drivers. - sax2exts.py - This should contain: - parser factories and the make_parser function - the org.xml.sax.ext package classes? - anything else? - drivers - This package should contain the SAX 1.0 drivers, mostly unchanged, but with updates where necessary. - drivers2 - This package should contain the new SAX 2.0 drivers. --Lars M. From larsga@garshol.priv.no Mon May 15 20:46:40 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:46:40 +0200 Subject: [XML-SIG] SAX 2.0: Namespaces Message-ID: Namespace handling in SAX 2.0 ----------------------------- --- Processing modes: - Namespace processing. In this mode full namespace processing is on, and namespace declaration attributes are hidden. This is the default mode, and support for it is required. http://xml.org/sax/features/namespaces on http://xml.org/sax/features/namespace-prefixes off - Namespace processing with prefixes available. In this mode namespace processing is on, and namespace declaration attributes are not hidden. Support for this mode is not required. http://xml.org/sax/features/namespaces on http://xml.org/sax/features/namespace-prefixes on - XML 1.0 processing. In this mode there is no namespace processing. Support for this mode is not required. http://xml.org/sax/features/namespaces off http://xml.org/sax/features/namespace-prefixes on --- Name representation (What is set out here goes for both ContentHandler and Attributes.) - All element and attribute names consist of a namespace name (represented as a (uri, localname) tuple) and a qualified name (the raw name, represented by a string). Which parts of this information is available depends on the processing mode and the parser, but the API provides for all this information. - During namespace processing, what namespace declarations are in effect at any point in the document can be found from the startPrefixMapping/endPrefixMapping events on the ContentHandler interface. - The namespace name: - Required during namespace processing, optional otherwise. - If the name is not connected to a namespace, the name tuple takes the form (None, localname). - Namespace processing is off and the parser is not providing namespace names this this value should be the same as the qualified name. (Alternatively, it could be None. My mind is not made up on this.) - The qualified name: - Required when namespace-prefixes is on, optional otherwise. That is, optional during pure namespace processing mode, required with XML 1.0 processing mode and namespace processing with prefixes. - If the parser does not make qualified names available, the value is None. If it does make them available, the value is a string. --- An example - In namespace processing (without prefixes) mode, the following will be reported: - an element - name: ("http://www.greeting.com/ns/", "hello") - qname: "h:hello" or None - an attribute - name: (None, "id") - qname: "id" or None - an attribute - name: ("http://www.greeting.com/ns/", "person") - qname: "h:person" or None - In namespace processing with prefixes mode, the following will be reported: - an element - name: ("http://www.greeting.com/ns/", "hello") - qname: "h:hello" - an attribute - name: (None, "id") - qname: "id" - an attribute - name: ("http://www.greeting.com/ns/", "person") - qname: "h:person" - an attribute - name: (None, None) # the attribute cannot be looked up by this name... - qname: "xmlns:h" - In XML 1.0 processing mode, the following will be reported: - an element - name: ("http://www.greeting.com/ns/", "hello") or None - qname: "h:hello" - an attribute - name: (None, "id") or None - qname: "id" - an attribute - name: ("http://www.greeting.com/ns/", "person") or None - qname: "h:person" - an attribute - name: (None, None) or None - qname: "xmlns:h" From larsga@garshol.priv.no Mon May 15 20:48:02 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:48:02 +0200 Subject: [XML-SIG] SAX 2.0: Properties and features Message-ID: Extra properties/features: -------------------------- - feature: is-incremental, used to tell whether the parser supports the IncrementalParser interface or not. - property: parser-name, the name of the parser - property: parser-version, the version of the parser - property: driver-version, the version of the driver --Lars M. From larsga@garshol.priv.no Mon May 15 20:48:10 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:48:10 +0200 Subject: [XML-SIG] SAX 2.0: Main open issues Message-ID: Main open issues: ----------------- - Unicode handling: - Should parsers accept Unicode input? If so, what form does it take in the InputSource object? - Extra properties/features: - Which are they? - What domain are they in? python.org? Something else? One alternative may be garshol.priv.no, which I own. python.org seems by far the best. - Name representation: - Agree 100% on the representation of namespace-affected names. - Bundling with Python 1.6: - Can we finish on time? - What should be included? - What to do with the org.xml.sax.ext package: - Include it? How? - Test suite: - I will make one. Should I make it available for download? - Should it be in the XML-SIG package? --Lars M. From larsga@garshol.priv.no Mon May 15 20:57:26 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 21:57:26 +0200 Subject: [XML-SIG] Pull Parsing In-Reply-To: <39199028.2CE477C8@prescod.net> References: <390F5CB1.FBE70A92@prescod.net> <011001bfb555$bdc64b00$34aab5d4@hagrid> <39199028.2CE477C8@prescod.net> Message-ID: * Paul Prescod | | Fredrik showed how to turn incremental push parsers into pull parsers. | Neat. Agreed. :-) | I must admit that I always presumed that the conversion would be | done at the SAX level so I couldn't think of a way to do it. The | incremental API makes all the difference (maybe we should propose an | incremental extension to SAX). Python SAX 1.0 already has this, and I've now put this into SAX 2.0 as an interface IncrementalParser (with feed, close and reset methods), which parsers _may_ support. | I wonder if a pull style interface is intrinsically easier for the | average Python programmer to get their heads around. I think it is. | * Pull parsers can always be turned into push parsers trivially I wouldn't say trivially, but they can be. My opinion on this is that this is definitely more intuitive for people to understand, but it's more awkward to use, since it forces you into doing your own dispatch of tokens to token handlers. (I call the things the parser returns structured tokens.) For the same reason it costs performance-wise: the parser knows what kind of token it has, and puts this information into the token object, from where the application must extract it again to do dispatch. With an event-based approach one jumps directly from the parser to the application, with no need for special dispatch on token type. I think the approach does have some merit, but not to the extent that I will personally sit down and implement a framework for it. If someone else will I'll be happy to provide input to the extent that I'm able to. --Lars M. From akuchlin@mems-exchange.org Mon May 15 21:03:54 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 15 May 2000 16:03:54 -0400 (EDT) Subject: [XML-SIG] SAX 2.0 alpha 2 In-Reply-To: References: Message-ID: <14624.22570.715019.1413@amarok.cnri.reston.va.us> Lars Marius Garshol writes: >Question: should I put this stuff in the XML-SIG CVS tree as I improve >on it? It is now at the stage where this shouldn't cause too much >disturbance, in fact, hopefully none at all. Or should I wait until >it's done? The whole point of having a CVS tree is to use it for distributing code that's in development, so I'd suggest erring on the side of checking it in sooner rather than later. Even if you think the code will break things, I'd recommend checking it in anyway, and sending a warning to the SIG first. --amk From larsga@garshol.priv.no Mon May 15 21:16:31 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 15 May 2000 22:16:31 +0200 Subject: [XML-SIG] SAX 2.0 alpha 2 In-Reply-To: <14624.22570.715019.1413@amarok.cnri.reston.va.us> References: <14624.22570.715019.1413@amarok.cnri.reston.va.us> Message-ID: * Lars Marius Garshol | | Question: should I put this stuff in the XML-SIG CVS tree as I improve | on it? It is now at the stage where this shouldn't cause too much | disturbance, in fact, hopefully none at all. Or should I wait until | it's done? * Andrew M. Kuchling | | The whole point of having a CVS tree is to use it for distributing | code that's in development, so I'd suggest erring on the side of | checking it in sooner rather than later. Even if you think the code | will break things, I'd recommend checking it in anyway, and sending a | warning to the SIG first. OK. Then it goes in there now. You can consider this the warning. :-) --Lars M. From gstein@lyra.org Tue May 16 01:53:55 2000 From: gstein@lyra.org (Greg Stein) Date: Mon, 15 May 2000 17:53:55 -0700 (PDT) Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: <200005150913.TAA05742@mbuna.arbhome.com.au> Message-ID: On Mon, 15 May 2000, Anthony Baxter wrote: > Once you've built an XML document in memory, what's the fastest > way to get it to disk, and then re-read it in again afterwards? > > cPickle is appallingly slow, and generates _huge_ output. > > toxml() and then parsing it in again is also quite slow. > I would have thought ESIS would be fast to read in, but nope. EsisBuilder > seems to take (based on a number of runs) something like 40-50 times as > long as utils.FileReader() > > so, what do other folks use? xml.utils.qp_xml parses XML into a very lightweight structure. Those should be easily pickle-able. With a little bit of work, I bet they could be marshalled. It definitely isn't a solution out of the box, but (IMO) it is a great head start on a Pythonic structure that is easily marshalled/pickled. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Tue May 16 04:49:27 2000 From: gstein@lyra.org (Greg Stein) Date: Mon, 15 May 2000 20:49:27 -0700 (PDT) Subject: [XML-SIG] SAX 2.0 alpha 2 In-Reply-To: <14624.22570.715019.1413@amarok.cnri.reston.va.us> Message-ID: On Mon, 15 May 2000, Andrew M. Kuchling wrote: > Lars Marius Garshol writes: > >Question: should I put this stuff in the XML-SIG CVS tree as I improve > >on it? It is now at the stage where this shouldn't cause too much > >disturbance, in fact, hopefully none at all. Or should I wait until > >it's done? > > The whole point of having a CVS tree is to use it for distributing > code that's in development, so I'd suggest erring on the side of > checking it in sooner rather than later. Even if you think the code > will break things, I'd recommend checking it in anyway, and sending a > warning to the SIG first. Yah. What Andrew said. Doubled. :-) Cheers, -g -- Greg Stein, http://www.lyra.org/ From Anthony Baxter Tue May 16 10:10:37 2000 From: Anthony Baxter (Anthony Baxter) Date: Tue, 16 May 2000 19:10:37 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: Message from Greg Stein of "Mon, 15 May 2000 17:53:55 MST." Message-ID: <200005160910.TAA05002@mbuna.arbhome.com.au> >>> Greg Stein wrote > xml.utils.qp_xml parses XML into a very lightweight structure. Those > should be easily pickle-able. With a little bit of work, I bet they could > be marshalled. > > It definitely isn't a solution out of the box, but (IMO) it is a great > head start on a Pythonic structure that is easily marshalled/pickled. Hm - when I use qp_xml, I get File "/opt/python/lib/python1.5/site-packages/xml/utils/qp_xml.py", line 144, in parse p.Parse(input, 1) File "/opt/python/lib/python1.5/site-packages/xml/utils/qp_xml.py", line 88, in start name = attrs[i] KeyError: 0 putting print "attrlen", len(attrs), type(attrs), attrs before the offending line shows: attrlen 2 {'name': 'Svalbard', 'ccode': 'AAX'} a quick fix that wfm: --- qp_xml.py.dist Tue May 16 19:06:51 2000 +++ qp_xml.py Tue May 16 19:04:55 2000 @@ -83,9 +83,9 @@ work_attrs = [ ] # scan for namespace declarations (and xml:lang while we're at it) - for i in range(0, len(attrs), 2): - name = attrs[i] - value = attrs[i+1] + for i in attrs.keys(): + name = i + value = attrs[i] if name == 'xmlns': elem.ns_scope[''] = value with this fix, it happily parses my dataset. yay. Anthony From Anthony Baxter Tue May 16 11:19:54 2000 From: Anthony Baxter (Anthony Baxter) Date: Tue, 16 May 2000 20:19:54 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: Message from Anthony Baxter of "Tue, 16 May 2000 19:10:37 +1000." <200005160910.TAA05002@mbuna.arbhome.com.au> Message-ID: <200005161019.UAA05545@mbuna.arbhome.com.au> >>> Anthony Baxter wrote > >>> Greg Stein wrote > > xml.utils.qp_xml parses XML into a very lightweight structure. Those > > should be easily pickle-able. With a little bit of work, I bet they could > > be marshalled. > > > > It definitely isn't a solution out of the box, but (IMO) it is a great > > head start on a Pythonic structure that is easily marshalled/pickled. Something like this works quite nicely: def qp_dump(obj,file): import cPickle,gzip cPickle.dump(obj,gzip.open(file, 'w', 4)) def qp_load(file) import cPickle,gzip cPickle.load(gzip.open(file)) on my test sparc, a 2.5M XML dump gets dumped out as a 990K gzipped pickle. using the pyexpat full dom thing, it's 120s to build the DOM, 27s to dump out the XML. with qp_xml.Parser().parse(file) it reads it in 22s, and dumps a gzipped pickle in 14s. It dumps a normal pickle in 13s, but it's an order of magnitude larger. unfortunately the load time bites: 71s for the uncompressed pickle, 340s(!) for the gzipped one. blah. (3 runs each test, storing on local disk) (ps: as an aside: marshallable? how can you make an object marshallable?) Anthony From larsga@garshol.priv.no Tue May 16 11:31:41 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 16 May 2000 12:31:41 +0200 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: <200005161019.UAA05545@mbuna.arbhome.com.au> References: <200005161019.UAA05545@mbuna.arbhome.com.au> Message-ID: * Anthony Baxter | | (ps: as an aside: marshallable? how can you make an object | marshallable?) Marshal can basically store/load anything except object instances. So lists, dictionaries and so on are possible, but not objects. Marshal is, I think, much faster than pickle, so it's certainly worth a try. What I've been wondering is why you want to do this, though. Why not use some kind of database system with the possibility to export to and import from XML? --Lars M. From Anthony Baxter Tue May 16 13:01:01 2000 From: Anthony Baxter (Anthony Baxter) Date: Tue, 16 May 2000 22:01:01 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: Message from Lars Marius Garshol of "16 May 2000 12:31:41 +0200." Message-ID: <200005161201.WAA05895@mbuna.arbhome.com.au> >>> Lars Marius Garshol wrote > Marshal can basically store/load anything except object instances. > So lists, dictionaries and so on are possible, but not objects. That's what I was thinking - so to marshal, you'd have to convert it to a structure of lists and dictionaries. I was just curious if there was some secret protocol an object can use to marshal itself (there didn't seem to be in Python/marshal.c) > > Marshal is, I think, much faster than pickle, so it's certainly worth > a try. Only if it's not more expensive to convert it into a marshallable form. > What I've been wondering is why you want to do this, though. Why not > use some kind of database system with the possibility to export to and > import from XML? speed - loading up a document with multiple megabytes of XML and thousands of nodes is horribly slow. Unfortunately, pickling them is even slower :/ I think I'm just going to have to build my own wierd structures and wrap a DOM or something on top of them. Anthony From larsga@garshol.priv.no Tue May 16 13:08:31 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 16 May 2000 14:08:31 +0200 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: <200005161201.WAA05895@mbuna.arbhome.com.au> References: <200005161201.WAA05895@mbuna.arbhome.com.au> Message-ID: * Lars Marius Garshol | | | Marshal is, I think, much faster than pickle, so it's certainly worth | a try. * Anthony Baxter | | Only if it's not more expensive to convert it into a marshallable form. Well, you can represent the XML document purely as lists, dictionaries and tuples. With some access functions it's likely to be at least as convenient as the DOM, although harder to learn for new comers. * Lars Marius Garshol | | What I've been wondering is why you want to do this, though. Why not | use some kind of database system with the possibility to export to | and import from XML? * Anthony Baxter | | speed - loading up a document with multiple megabytes of XML and | thousands of nodes is horribly slow. Unfortunately, pickling them is | even slower :/ | | I think I'm just going to have to build my own wierd structures and | wrap a DOM or something on top of them. You misunderstand the question. For some reason you seem to use XML as the underlying data model for your data, rather than to load from XML into some application-specific representation and dump from that representation and back out. Why not use a relational database for these data? Or ZODB? Or Metakit? Or application-specific classes with shelve and/or pickle? Why XML? --Lars M. From Anthony Baxter Tue May 16 13:42:50 2000 From: Anthony Baxter (Anthony Baxter) Date: Tue, 16 May 2000 22:42:50 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: Message from Lars Marius Garshol of "16 May 2000 14:08:31 +0200." Message-ID: <200005161242.WAA00467@mbuna.arbhome.com.au> >>> Lars Marius Garshol wrote > Well, you can represent the XML document purely as lists, dictionaries > and tuples. With some access functions it's likely to be at least as > convenient as the DOM, although harder to learn for new comers. That's something like what I'm going to be doing, yes. > You misunderstand the question. For some reason you seem to use XML as > the underlying data model for your data, rather than to load from XML > into some application-specific representation and dump from that > representation and back out. Ah, I see what you mean - convenience for data interchange. I thought it would be a nice simple format to store things in, and it would allow me to switch in and out various tools. Anthony From uogbuji@fourthought.com Tue May 16 16:41:30 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 16 May 2000 09:41:30 -0600 Subject: [XML-SIG] SAX 2.0: Package structure In-Reply-To: Message from Lars Marius Garshol of "15 May 2000 21:46:06 +0200." Message-ID: <200005161541.JAA03291@localhost.localdomain> > Goal: It should be possible to: > > - install SAX 2.0 over SAX 1.0 (thus deleting the old SAX 1.0 > installation) and have SAX 1.0 applications continue > to work unchanged with no problems Useful, but I wouldn't make it a do-or-die goal. > Structure: > > - Everything should be in the xml.sax package, just as before If nothing is done to make it hard to rename this to "xml.sax2" if necessary, I'd think it would soften any backward-compatability problems. The package structure looks great. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 16 16:48:35 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 16 May 2000 09:48:35 -0600 Subject: [XML-SIG] SAX 2.0: Namespaces In-Reply-To: Message from Lars Marius Garshol of "15 May 2000 21:46:40 +0200." Message-ID: <200005161548.JAA03315@localhost.localdomain> > --- Name representation > > (What is set out here goes for both ContentHandler and Attributes.) > > - All element and attribute names consist of a namespace name > (represented as a (uri, localname) tuple) and a qualified name > (the raw name, represented by a string). Which parts of this > information is available depends on the processing mode and the > parser, but the API provides for all this information. > > - During namespace processing, what namespace declarations are in > effect at any point in the document can be found from the > startPrefixMapping/endPrefixMapping events on the ContentHandler > interface. > > - The namespace name: > > - Required during namespace processing, optional otherwise. > > - If the name is not connected to a namespace, the name tuple > takes the form (None, localname). > > - Namespace processing is off and the parser is not providing > namespace names this this value should be the same as the > qualified name. (Alternatively, it could be None. My mind is not > made up on this.) I'd say "None" for faster checking (in the very rare cases where checking at this level makes sense). Otherwise, more good stuff. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 16 16:56:41 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 16 May 2000 09:56:41 -0600 Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: Message from Lars Marius Garshol of "15 May 2000 21:48:10 +0200." Message-ID: <200005161556.JAA03337@localhost.localdomain> > - Unicode handling: > > - Should parsers accept Unicode input? If so, what form does it > take > in the InputSource object? I think they should definitely support Unicode input. I'm not clear on what you mean by "form", but I think they should be either UTF-8 or the new u"foo" strings from Python 1.6. Of course, there are people who know better than I do about Unicode so take my comments lightly. > - Extra properties/features: > > - Which are they? Until the whole Unicode story is straightened out, do we make the answers to your above questions properties? > - What domain are they in? python.org? Something else? One > alternative > may be garshol.priv.no, which I own. python.org seems by far the > best. I'd say python.org, if we can get it. If not, we can ask David Megginson about python.sax.org. If not, your domain would do. > - Name representation: > > - Agree 100% on the representation of namespace-affected names. I agree 100% > - Bundling with Python 1.6: > > - Can we finish on time? Ha! And of course, would GvR accept somethign new in the second beta? Is there no feature-freeze? > - What should be included? That's the major sticking point. There is little agreement even within this group. The only thing I've heard everyone champion for 1.6 is EasyDOM/EasySAX. > - What to do with the org.xml.sax.ext package: > > - Include it? How? > > - Test suite: > > - I will make one. Should I make it available for download? Yes, I think. > - Should it be in the XML-SIG package? I'm curious about the answers to this question as well, since we have good-sized test-suites for the 4Suite which we've never released. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From akuchlin@mems-exchange.org Tue May 16 17:01:15 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 16 May 2000 12:01:15 -0400 (EDT) Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: <200005161556.JAA03337@localhost.localdomain> References: <200005161556.JAA03337@localhost.localdomain> Message-ID: <14625.28875.578969.19146@amarok.cnri.reston.va.us> Uche Ogbuji writes: >> - Should it be in the XML-SIG package? >I'm curious about the answers to this question as well, since we have >good-sized test-suites for the 4Suite which we've never released. Tests are useful to people hacking on the code, too, since they can re-run them to verify that their changes haven't broken anything. Unless the test suites are incredibly huge (several megabytes), or unless they contain proprietary information (old customer documents that tripped bugs), I'd say check them in. Test suites wouldn't actually be installed into site-packages, though. -- A.M. Kuchling http://starship.python.net/crew/amk/ "Hey. Thanks for listening. I suppose you must think I'm crazy." "No. I don't. Maybe I ought to. But I don't. You hear a lot of weird stories behind a bar." -- Brant Tucker and the bartender, in SANDMAN #56: "World's End" From gstein@lyra.org Wed May 17 05:24:18 2000 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 May 2000 21:24:18 -0700 (PDT) Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: <200005160910.TAA05002@mbuna.arbhome.com.au> Message-ID: On Tue, 16 May 2000, Anthony Baxter wrote: > >>> Greg Stein wrote > > xml.utils.qp_xml parses XML into a very lightweight structure. Those > > should be easily pickle-able. With a little bit of work, I bet they could > > be marshalled. > > > > It definitely isn't a solution out of the box, but (IMO) it is a great > > head start on a Pythonic structure that is easily marshalled/pickled. > > Hm - when I use qp_xml, I get > > File "/opt/python/lib/python1.5/site-packages/xml/utils/qp_xml.py", line 144, in parse > p.Parse(input, 1) > File "/opt/python/lib/python1.5/site-packages/xml/utils/qp_xml.py", line 88, in start > name = attrs[i] > KeyError: 0 Oh... this is caused by the API change to pyexpat. How damn annoying. All righty, then. I just checked in some changes to qp_xml to fix this problem, among others. Please grab a new copy from anon-cvs, or download it from: http://www.lyra.org/cgi-bin/viewcvs.cgi/xml/xml/utils/qp_xml.py Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Wed May 17 05:29:37 2000 From: gstein@lyra.org (Greg Stein) Date: Tue, 16 May 2000 21:29:37 -0700 (PDT) Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: <200005161201.WAA05895@mbuna.arbhome.com.au> Message-ID: On Tue, 16 May 2000, Anthony Baxter wrote: > >>> Lars Marius Garshol wrote > > Marshal can basically store/load anything except object instances. > > So lists, dictionaries and so on are possible, but not objects. > > That's what I was thinking - so to marshal, you'd have to convert it > to a structure of lists and dictionaries. I was just curious if there > was some secret protocol an object can use to marshal itself (there > didn't seem to be in Python/marshal.c) Yup... this was my thought. I think qp_xml gives you a good basis for taking the pyexpat callbacks and constructing list/dict structures. > > > > Marshal is, I think, much faster than pickle, so it's certainly worth > > a try. > > Only if it's not more expensive to convert it into a marshallable form. The conversion would probably take a while on a large data set -- you'd be iterating over every XML node. I'd recommend gutting qp_xml. Cheers, -g -- Greg Stein, http://www.lyra.org/ From mangold@hft.ei.tum.de Wed May 17 07:42:44 2000 From: mangold@hft.ei.tum.de (Tobias Mangold) Date: Wed, 17 May 2000 08:42:44 +0200 Subject: [XML-SIG] Build Problem for PyXML 0.5.4 on HPUX Message-ID: <39223F64.ACED99BD@hft.ei.tum.de> Hi, the build problem I encountered is not a big deal, but it breaks the build run. The problem: Shared libs on HPUX are have the file extension '.sl', and not '.so' like on Linux. Since the '.so' extension is hardcoded into the setup.py script, I had to adapt it manually. Have a nice day ... Toby -- T. Mangold, Lehrstuhl fuer Hochfrequenztechnik Technische Universitaet Muenchen, Arcisstr. 21, D-80333 Muenchen, Germany TEL: ++49-89-289-23371 / FAX: ++49-89-289-23365 Email: mangold@hft.ei.tum.de From larsga@garshol.priv.no Wed May 17 10:47:10 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 11:47:10 +0200 Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: <14625.28875.578969.19146@amarok.cnri.reston.va.us> References: <200005161556.JAA03337@localhost.localdomain> <14625.28875.578969.19146@amarok.cnri.reston.va.us> Message-ID: * Andrew M. Kuchling | | Tests are useful to people hacking on the code, too, since they can | re-run them to verify that their changes haven't broken anything. | Unless the test suites are incredibly huge (several megabytes), or | unless they contain proprietary information (old customer documents | that tripped bugs), I'd say check them in. Then I will do that once I have something that's approaching stability. The SAX tests don't need to be so big, since we're testing the drivers rather than the parsers. I also have some test suites (several, in fact) for xmlproc, and they are command-line interface: 52k using James Clark's stuff: 175k API tests: 45k using my own stuff: 2030k What say ye? Should this, or parts of it, go in? I'll make it available separately, anyway. There is and will be nothing proprietary in any of these. | Test suites wouldn't actually be installed into site-packages, | though. Sounds good to me. --Lars M. From Anthony Baxter Wed May 17 15:35:51 2000 From: Anthony Baxter (Anthony Baxter) Date: Thu, 18 May 2000 00:35:51 +1000 Subject: [XML-SIG] fast dump/restore of an XML document? In-Reply-To: Message from Greg Stein of "Tue, 16 May 2000 21:29:37 MST." Message-ID: <200005171435.AAA08730@mbuna.arbhome.com.au> >>> Greg Stein wrote > Yup... this was my thought. I think qp_xml gives you a good basis for > taking the pyexpat callbacks and constructing list/dict structures. > The conversion would probably take a while on a large data set -- you'd be > iterating over every XML node. I'd recommend gutting qp_xml. Ok, I've done this - is this of interest at all? I've now got something that's just a big list containing lists and dictionaries, and I'm wrapping a class around it to give a lot of DOM-like access to it (while still allowing the under-the hood fast dump/restore.) Some timings: qp is Greg's latest qp_xml. qph is my hacked up version. Test was my standard 2.5M XML file qp parse done in 26.0s qp pickle.dump done in 20.2s qp pickle.load done in 88.1s qph parse done in 21.0s qph pickle.dump done in 12.6s qph pickle.load done in 89.5s qph marshal.dump done in 5.13s qph marshal.load done in 6.92s Some file sizes: qp pickle file was 8.9M qph pickle file was 6.6M qph marshal file was 6.5M qp gzipped pickle file was 990K qph gzipped pickle file was 940K qph gzipped marshal file was 590K (but marshal can't right directly to a gzip.open() :( ) Memory consumption was also about 60% of the size of qp_xml's structure (and qp_xml's about 40% of the size of the full xml size). So, is it worth pursuing this as something suitable for release, or just for my own internal use? Is there any interest in it? Anthony From fdrake@acm.org Wed May 17 15:35:31 2000 From: fdrake@acm.org (Fred L. Drake) Date: Wed, 17 May 2000 07:35:31 -0700 (PDT) Subject: [XML-SIG] Re: [XML-checkins] CVS: xml/xml/dom javadom.py In-Reply-To: <200005171409.HAA20402@nebula.lyra.org> Message-ID: On Wed, 17 May 2000, Lars Marius Garshol wrote: > # - more 4DOM-like interface? support _get_* ? Wasn't this the interface we agreed to a few months ago? Since 4DOM will be the standard Python DOM implementation, this seems like the interface to support. -Fred -- Fred L. Drake, Jr. From akuchlin@mems-exchange.org Wed May 17 16:53:40 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Wed, 17 May 2000 11:53:40 -0400 (EDT) Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: References: <200005161556.JAA03337@localhost.localdomain> <14625.28875.578969.19146@amarok.cnri.reston.va.us> Message-ID: <14626.49284.368426.925284@amarok.cnri.reston.va.us> Lars Marius Garshol writes: >I also have some test suites (several, in fact) for xmlproc, and they >are ... >What say ye? Should this, or parts of it, go in? I'll make it >available separately, anyway. Check in what you like; I have no objection to all of the test material, including the 2030K collection. It's a good idea to put them in a directory of their own (xml/test/xmlproc, maybe) to make it easy to exclude them from distributions. --amk From RTowster@emerging.com Wed May 17 16:56:31 2000 From: RTowster@emerging.com (Robert Towster) Date: Wed, 17 May 2000 10:56:31 -0500 Subject: [XML-SIG] http://www.python.org/doc/howto/xml/node4.html Message-ID: Hi I got a problem downloading =) Windows users should get the precompiled version at XXX; looks like someone forgot to replace the placeholders =) Robert Towster Technology Services e m e r g i n g | Houston 713.544.1391 voice 713.544.1230 fax http://www.emerging.com From larsga@garshol.priv.no Wed May 17 17:30:16 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 18:30:16 +0200 Subject: [XML-SIG] Re: [XML-checkins] CVS: xml/xml/dom javadom.py In-Reply-To: References: Message-ID: * Lars Marius Garshol | | # - more 4DOM-like interface? support _get_* ? * Fred L. Drake | | Wasn't this the interface we agreed to a few months ago? Might be. I frankly can't remember. | Since 4DOM will be the standard Python DOM implementation, | this seems like the interface to support. Then I'll update javadom. I've already generalized the support for Java DOM implementations and added support for Xerces, so the thing is due for an update anyway. --Lars M. From larsga@garshol.priv.no Wed May 17 17:34:14 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 18:34:14 +0200 Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: <14626.49284.368426.925284@amarok.cnri.reston.va.us> References: <200005161556.JAA03337@localhost.localdomain> <14625.28875.578969.19146@amarok.cnri.reston.va.us> <14626.49284.368426.925284@amarok.cnri.reston.va.us> Message-ID: * Lars Marius Garshol | | I also have some test suites (several, in fact) for xmlproc, and they | are | ... | What say ye? Should this, or parts of it, go in? I'll make it | available separately, anyway. * Andrew M. Kuchling | | Check in what you like; I have no objection to all of the test | material, including the 2030K collection. Then I will put it in. I'll have to go over it and ensure that it is ready for public consumption first, so it probably won't happen until the weekend. | It's a good idea to put them in a directory of their own | (xml/test/xmlproc, maybe) to make it easy to exclude them from | distributions. I agree. I'll put the SAX tests in xml/test/sax also. --Lars M. From larsga@garshol.priv.no Wed May 17 17:36:02 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 18:36:02 +0200 Subject: [XML-SIG] SAX 2.0: Package structure In-Reply-To: <200005161541.JAA03291@localhost.localdomain> References: <200005161541.JAA03291@localhost.localdomain> Message-ID: * Lars Marius Garshol | | Goal: It should be possible to: | | - install SAX 2.0 over SAX 1.0 (thus deleting the old SAX 1.0 | installation) and have SAX 1.0 applications continue | to work unchanged with no problems * Uche Ogbuji | | Useful, but I wouldn't make it a do-or-die goal. Agreed. However, as near as I can tell we have achieved it already. * Lars Marius Garshol | | Structure: | | - Everything should be in the xml.sax package, just as before * Uche Ogbuji | | If nothing is done to make it hard to rename this to "xml.sax2" if | necessary, I'd think it would soften any backward-compatability | problems. I can't think of anything that would make it hard, so I think we're safe for now. Good to see your comments! --Lars M. From fdrake@acm.org Wed May 17 17:37:34 2000 From: fdrake@acm.org (Fred L. Drake) Date: Wed, 17 May 2000 09:37:34 -0700 (PDT) Subject: [XML-SIG] Re: [XML-checkins] CVS: xml/xml/dom javadom.py In-Reply-To: Message-ID: On 17 May 2000, Lars Marius Garshol wrote: > Then I'll update javadom. I've already generalized the support for > Java DOM implementations and added support for Xerces, so the thing is > due for an update anyway. Great news! I'll need to check out the Java support on this nifty new laptop, then perhaps I can play with some of these things a little more. (It's amazing; this little laptop costs less than my old home desktop and packs a bigger whallop, and the networking was easier to configure! I'm staying home today to play with this new toy. ;) -Fred -- Fred L. Drake, Jr. From larsga@garshol.priv.no Wed May 17 17:40:46 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 18:40:46 +0200 Subject: [XML-SIG] SAX 2.0: Namespaces In-Reply-To: <200005161548.JAA03315@localhost.localdomain> References: <200005161548.JAA03315@localhost.localdomain> Message-ID: * Lars Marius Garshol | | - The namespace name: | | [...] | | - Namespace processing is off and the parser is not providing | namespace names this this value should be the same as the | qualified name. (Alternatively, it could be None. My mind is not | made up on this.) * Uche Ogbuji | | I'd say "None" for faster checking (in the very rare cases where | checking at this level makes sense). I agree that this option seems tempting, and it's certainly more in keeping with what Java SAX does. However, this means that there is no place where a general SAX utility can know, a priori, that there will be a useful name. If choose this alternative, the namespace name may be None in some cases and similarly the qualified name may be None in some cases (all depending on the setting of various features). I'm very uneasy about doing that since it will obviously complicate writing general utilities, which are part of what SAX (IMHO) is all about. I'm not really happy with either alternative, and in general I think this namespace thing has proved to be a horror to implement. To me it certainly has made XML a lot less fun. There's no way out of it now, though. --Lars M. From larsga@garshol.priv.no Wed May 17 17:50:24 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 17 May 2000 18:50:24 +0200 Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: <200005161556.JAA03337@localhost.localdomain> References: <200005161556.JAA03337@localhost.localdomain> Message-ID: * Lars Marius Garshol | | - Should parsers accept Unicode input? If so, what form does it | take in the InputSource object? * Uche Ogbuji | | I think they should definitely support Unicode input. Agreed; that was kind of a rhetorical question. | I'm not clear on what you mean by "form", but I think they should be | either UTF-8 or the new u"foo" strings from Python 1.6. For IncrementalParser.feed this does indeed make sense and I think it should be that way. However, in InputSource we need a stream. Python 1.6a2 has Unicode streams (wrappers around file-like objects) in the codecs module, although they aren't usable yet. What I'm wondering is basically: - would pyexpat accept Unicode string objects as arguments to parser.Parse? (xmllib does (on feed), and I would think xmlproc also does, although I haven't tried yet) - if we support codecs.StreamReader as character streams, how do we handle this in JPython? * Lars Marius Garshol | | - Extra properties/features: | | - Which are they? * Uche Ogbuji | | Until the whole Unicode story is straightened out, do we make the | answers to your above questions properties? No. See the mail about extra features and properties. * Lars Marius Garshol | | - What domain are they in? python.org? Something else? One | alternative may be garshol.priv.no, which I own. python.org seems | by far the best. * Uche Ogbuji | | I'd say python.org, if we can get it. If not, we can ask David | Megginson about python.sax.org. If not, your domain would do. This was exactly what I was thinking. I use my domain now, in a way that makes it easy to change painlessly. Who should I contact to ask for delegation of part of the python.org namespace? * Lars Marius Garshol | | - Name representation: | | - Agree 100% on the representation of namespace-affected names. * Uche Ogbuji | | I agree 100% :-) * Lars Marius Garshol | | - Bundling with Python 1.6: | | - Can we finish on time? * Uche Ogbuji | | Ha! And of course, would GvR accept somethign new in the second | beta? Is there no feature-freeze? I have no idea. | [What should be included?] | | That's the major sticking point. There is little agreement even | within this group. The only thing I've heard everyone champion for | 1.6 is EasyDOM/EasySAX. In that case I can relax. :-) That seems difficult to achieve, though. Paul is busy (I think), and I'm too bound up with SAX 2.0 and the book to be able to do it. --Lars M. From fdrake@acm.org Wed May 17 18:09:26 2000 From: fdrake@acm.org (Fred L. Drake) Date: Wed, 17 May 2000 10:09:26 -0700 (PDT) Subject: [XML-SIG] SAX 2.0: Main open issues In-Reply-To: Message-ID: On 17 May 2000, Lars Marius Garshol wrote: > * Uche Ogbuji > | I'd say python.org, if we can get it. If not, we can ask David > | Megginson about python.sax.org. If not, your domain would do. > > This was exactly what I was thinking. I use my domain now, in a way > that makes it easy to change painlessly. Who should I contact to ask > for delegation of part of the python.org namespace? The specific properties and identifier suffixes should be arrived at through discussion here. Andrew and I can beat each other up over the prefix. ;) I'll be so bold as to propose we use the same namespace we're using for Python-related system identifiers (remember XBEL?). We currently have http://www.python.org/topics/xml/dtds/...; perhaps we should make SAX property identifiers live at http://www.python.org/topics/xml/sax/properties/...? Properties specific to individual parsers should live outside the python.org space. (Andrew, any objections?) -Fred -- Fred L. Drake, Jr. From catpro@manx.dreamhaven.net Thu May 18 05:44:01 2000 From: catpro@manx.dreamhaven.net (Joshua D. Boyd) Date: Thu, 18 May 2000 00:44:01 -0400 (EDT) Subject: [XML-SIG] [ot] Documentation In-Reply-To: <00cf01bfbe93$a1653e20$34aab5d4@hagrid> Message-ID: Sorry to bring up the latex documentation again, but anyway... I'm trying to print the xml-howto.tex file (latex xml-howto.tex; dvips xml-howto.dvi), but I'm having trouble with the print running off the top of the page. I don't have the same problem with the documents I've written myself using the article and letter classes, but I did have to do some tweeking to stop those from having the same problem. I tried applying the same tweek to the xml-howto.tex file, and it started looking like crap. It's not like my printer is unusual. It has a fairly normal margin requirement of .25 in. Oh, and postscript files generated from sources other than latex have never had any problem, so I believe that this is just a latex oddity. I suspect that my tweek to make my own files work was not the right way to fix the problem. Oh, this is on a Redhat 6.1 box, btw. -- Joshua Boyd http://catpro.dragonfire.net/joshua From uche.ogbuji@fourthought.com Thu May 18 16:49:17 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Thu, 18 May 2000 09:49:17 -0600 Subject: [XML-SIG] 4Suite betas Message-ID: <200005181549.JAA10073@localhost.localdomain> As I posted to 4suite@lists.fourthought.com: We've been trying to get a final release out for a couple of days, but we've been held back by Windows build problems. We'd been sitting on the latest betas because we expected to have the releases ready by now. I just put them up at ftp://fourthought.com/pub/4Suite/ Note that as of the next release, we'll be simplifying the FTP directory structure. All 4Suite files will be available at the above location. BTW, does anyone know of a Windows version of FLEX more recent than 2.5.2? Or at least one that recognizes the -P option to change the "yy" prefixes for global variable and function names in generated scanners? -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From mclay@nist.gov Thu May 18 20:55:25 2000 From: mclay@nist.gov (Michael McLay) Date: Thu, 18 May 2000 15:55:25 -0400 (EDT) Subject: [XML-SIG] 4Suite betas In-Reply-To: <200005181549.JAA10073@localhost.localdomain> References: <200005181549.JAA10073@localhost.localdomain> Message-ID: <14628.19117.804546.794074@fermi.eeel.nist.gov> uche.ogbuji@fourthought.com writes: > As I posted to 4suite@lists.fourthought.com: > > We've been trying to get a final release out for a couple of days, but > we've been held back by Windows build problems. We'd been sitting on > the latest betas because we expected to have the releases ready by now. > I just put them up at > > ftp://fourthought.com/pub/4Suite/ > > Note that as of the next release, we'll be simplifying the FTP directory > structure. All 4Suite files will be available at the above location. > > BTW, does anyone know of a Windows version of FLEX more recent than > 2.5.2? Or at least one that recognizes the -P option to change the "yy" > prefixes for global variable and function names in generated scanners? The Bison/Flex Wizard at http://www.fg-soup.com/products.html has version 2.5.4 of FLEX.exe. It includes the -P option. From wayne@idini.com Thu May 18 20:21:14 2000 From: wayne@idini.com (Wayne) Date: Thu, 18 May 2000 12:21:14 -0700 Subject: [XML-SIG] confirm 151262 Message-ID: <005601bfc0fe$3bca40c0$2d00a8c0@idini.com> This is a multi-part message in MIME format. ------=_NextPart_000_0053_01BFC0C3.8F4235E0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable ------=_NextPart_000_0053_01BFC0C3.8F4235E0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
 
------=_NextPart_000_0053_01BFC0C3.8F4235E0-- From mclay@nist.gov Sat May 20 02:24:42 2000 From: mclay@nist.gov (Michael McLay) Date: Fri, 19 May 2000 21:24:42 -0400 (EDT) Subject: [XML-SIG] XML Schema validator? Message-ID: <14629.59739.66134.318367@fermi.eeel.nist.gov> I'm looking for a validator for XML Schema instance files. The XML Schema validator at http://www.ltg.ed.ac.uk/~ht/xsv-status.html is close to what I need, but it has a GPLed copyright on the validator and a non-commercial restriction on the PyXML module that it is dependant on. The ideal mechanism for checking XML files against an XML Schema would be built on top of the standard Python 1.6 XML package and it would have a Python-like copyright. From ht@cogsci.ed.ac.uk Mon May 22 10:31:43 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 22 May 2000 10:31:43 +0100 Subject: [XML-SIG] XML Schema validator? In-Reply-To: Michael McLay's message of "Fri, 19 May 2000 21:24:42 -0400 (EDT)" References: <14629.59739.66134.318367@fermi.eeel.nist.gov> Message-ID: Michael McLay writes: > I'm looking for a validator for XML Schema instance files. > The XML Schema validator at http://www.ltg.ed.ac.uk/~ht/xsv-status.html > is close to what I need, but it has a GPLed copyright on the validator > and a non-commercial restriction on the PyXML module that it is > dependant on. The ideal mechanism for checking XML files against an > XML Schema would be built on top of the standard Python 1.6 XML > package and it would have a Python-like copyright. Our PyXML module will shortly be re-released for Python1.6 with a GPL license, and a new name so as not to conflict with the existing PyXML module. We will happily discuss with you less-restrictive licensing terms for [whatever PyXML becomes], and we would also be happy if someone ported XSV to run on top of some other validating Python XML interface -- the hooks are there (in layer.py). ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From masjober@ra.abo.fi Mon May 22 13:27:55 2000 From: masjober@ra.abo.fi (Mats Sjoberg IB) Date: Mon, 22 May 2000 15:27:55 +0300 Subject: [XML-SIG] The PyXML distribution Message-ID: <200005221227.PAA05511@rafael.ABO.RA> Is there any possibility to install this package for one user only? I do not have root access so I cannot install the package globally. Mats Sjöberg (mats.sjoberg@abo.fi) Turku Centre for Computer Science Finland From larsga@garshol.priv.no Mon May 22 13:48:42 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 22 May 2000 14:48:42 +0200 Subject: [XML-SIG] The PyXML distribution In-Reply-To: <200005221227.PAA05511@rafael.ABO.RA> References: <200005221227.PAA05511@rafael.ABO.RA> Message-ID: * Mats Sjoberg | | Is there any possibility to install this package for one user only? | I do not have root access so I cannot install the package globally. What platform are you on? And what installer are you using? --Lars M. From mikl@club-internet.fr Wed May 24 12:45:02 2000 From: mikl@club-internet.fr (mikl@club-internet.fr) Date: 24 May 2000 13:45:02 +0200 Subject: [XML-SIG] XML status, historical data and hierarchy Message-ID: <87d7mcuq81.fsf@western.ird.idealx.com> Hi, I have a question on the way I could structure my XML document to solve this problem: I need to manipulate actions. Actions are created as prevision, with a previsionnal duration. If someone is intereted in achieving this action, he can propose to be responsible for this action. He can even propose a different estimated duration. The status is changing from prevision to proposition. If this action is accepted, the status change to currently being processed, and the estimated duration can be renegociated. Then, the personn can finish his duration with one or several achieved actions, each with a definitive duration. This is the simplest case, because the personn in charge of this action can divide it into several other previsionnal actions and ask for a volunteer. The same process can be deeply nested. My problem is that I hardly can figure out how you would model this thing in XML. Thank you in advance for your help. This problem is not typically Python related but, this group is usually very helpful... -- Mickaël From tpassin@home.com Wed May 24 13:05:44 2000 From: tpassin@home.com (tpassin@home.com) Date: Wed, 24 May 2000 08:05:44 -0400 Subject: [XML-SIG] XML status, historical data and hierarchy References: <87d7mcuq81.fsf@western.ird.idealx.com> Message-ID: <000b01bfc578$63c893a0$7cac1218@reston1.va.home.com> asked > > Hi, > > I have a question on the way I could structure my XML document to solve this > problem: > > I need to manipulate actions. Actions are created as prevision, with a > previsionnal duration. > If someone is intereted in achieving this action, he can propose to be > responsible for this action. He can even propose a different estimated > duration. The status is changing from prevision to proposition. > > If this action is accepted, the status change to currently being processed, > and the estimated duration can be renegociated. > > Then, the personn can finish his duration with one or several achieved > actions, each with a definitive duration. > > This is the simplest case, because the personn in charge of this action can > divide it into several other previsionnal actions and ask for a volunteer. The > same process can be deeply nested. > > My problem is that I hardly can figure out how you would model this thing in > XML. > The problem is not in the XML, but how you model this as an abstract data model. Once you know that, you can translate it into XML. From your description, it sounds like the model would be recursive, each project action possibly containing other project actions subject to certain constraints. Get your data model designed, then the XML will probably be apparent. Tom Passin From andy@reportlab.com Wed May 24 13:05:06 2000 From: andy@reportlab.com (Andy Robinson) Date: Wed, 24 May 2000 13:05:06 +0100 Subject: [XML-SIG] XML status, historical data and hierarchy In-Reply-To: <87d7mcuq81.fsf@western.ird.idealx.com> Message-ID: > My problem is that I hardly can figure out how you would model > this thing in > XML. > I'd start with an easier problem, and try to model it with Python objects. Then when you have a model you like, start to think of coding it in XML. The w3c hypes XML for lots of things, but not yet as a RAD tool :-) - Andy Robinson From bjorn@roguewave.com Wed May 24 17:11:30 2000 From: bjorn@roguewave.com (Bjorn Pettersen) Date: Wed, 24 May 2000 10:11:30 -0600 Subject: [XML-SIG] speed question re DOM parsing Message-ID: <392BFF32.5C0AECE4@roguewave.com> I'm just starting to work with XML, so be gentle The problem is that I'm reading in a 280K xml file using the sample code from the XML howto: def getXmlDomDocument(name): p = saxexts.make_parser() dh = SaxBuilder() p.setDocumentHandler(dh) p.parseFile(open(name)) p.close() doc = dh.document xml.dom.utils.strip_whitespace(doc) return doc it takes about five seconds to read and parse the file... Is there a better way to read the file (or is there updated code that is faster)? -- bjorn From gstein@lyra.org Wed May 24 22:01:27 2000 From: gstein@lyra.org (Greg Stein) Date: Wed, 24 May 2000 14:01:27 -0700 (PDT) Subject: [XML-SIG] speed question re DOM parsing In-Reply-To: <392BFF32.5C0AECE4@roguewave.com> Message-ID: On Wed, 24 May 2000, Bjorn Pettersen wrote: > I'm just starting to work with XML, so be gentle > > The problem is that I'm reading in a 280K xml file using the sample code > from the XML howto: > > def getXmlDomDocument(name): > p = saxexts.make_parser() > dh = SaxBuilder() > p.setDocumentHandler(dh) > p.parseFile(open(name)) > p.close() > doc = dh.document > xml.dom.utils.strip_whitespace(doc) > return doc > > it takes about five seconds to read and parse the file... > > Is there a better way to read the file (or is there updated code that is > faster)? If you want a DOM for the output, then no... you'll have to deal with the speed. If you have simple requirements for the Python representation of the XML, then take a look at xml.utils.qp_xml. Cheers, -g -- Greg Stein, http://www.lyra.org/ From bjorn@roguewave.com Thu May 25 01:49:39 2000 From: bjorn@roguewave.com (Bjorn Pettersen) Date: Wed, 24 May 2000 18:49:39 -0600 Subject: [XML-SIG] speed question re DOM parsing References: Message-ID: <392C78A3.C4176635@roguewave.com> Greg Stein wrote: > > On Wed, 24 May 2000, Bjorn Pettersen wrote: > > I'm just starting to work with XML, so be gentle > > > > The problem is that I'm reading in a 280K xml file using the sample code > > from the XML howto: > > > > def getXmlDomDocument(name): > > p = saxexts.make_parser() > > dh = SaxBuilder() > > p.setDocumentHandler(dh) > > p.parseFile(open(name)) > > p.close() > > doc = dh.document > > xml.dom.utils.strip_whitespace(doc) > > return doc > > > > it takes about five seconds to read and parse the file... > > > > Is there a better way to read the file (or is there updated code that is > > faster)? > > If you want a DOM for the output, then no... you'll have to deal with the > speed. If you have simple requirements for the Python representation of > the XML, then take a look at xml.utils.qp_xml. Hey, that works great! (down to ~0.5 seconds, and it doesn't have problems with installer either -- life is good ;-) -- bjorn From uche.ogbuji@fourthought.com Thu May 25 05:37:32 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:37:32 -0600 Subject: [XML-SIG] ANN: 4DOM 0.10.0 Message-ID: <200005250437.WAA02614@localhost.localdomain>  From uche.ogbuji@fourthought.com Thu May 25 05:37:53 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:37:53 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 Message-ID: <200005250437.WAA02657@localhost.localdomain>  From uche.ogbuji@fourthought.com Thu May 25 05:45:45 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:45:45 -0600 Subject: [XML-SIG] ANN: 4DOM 0.10.0 Message-ID: <200005250445.WAA02711@localhost.localdomain> Fourthought, Inc. (http://Fourthought.com) announces the release of 4DOM 0.10.0 ----------------------- An XML/HTML Python library using the Document Object Model interface 4DOM is a Python library for XML and HTML processing and manipulation using the W3C's Document Object Model for interface. 4DOM implements DOM Core level 2, HTML level 2 and Level 2 Document Traversal. 4DOM should work on all platforms supported by Python. If you have any problems with a particular platform, please e-mail the authors. 4DOM is designed to allow developers rapidly design applications that read, write or manipulate HTML and XML. News ---- - Moved all static variables to class variables - Fixed printing to work with empty elements - Removed all tabs from files - Change package to xml.dom - major change to the internals to use Node as a Python attribute manager this improves efficiency: cutting down on __g/setattrs__ and simplifies some things More info and Obtaining 4DOM ---------------------------- Please see http://Fourthought.com/4Suite/4DOM Or you can download 4DOM from ftp://Fourthought.com/pub/4Suite There are Linux RPMs available at ftp://Fourthought.com/pub/mirrors/python4linux/redhat/ 4DOM is distributed under a license similar to that of Python. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uche.ogbuji@fourthought.com Thu May 25 05:45:53 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:45:53 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 Message-ID: <200005250445.WAA02755@localhost.localdomain> Fourthought, Inc. (http://Fourthought.com) announces the release of 4XSLT and 4XPath 0.9.0 ---------------------- A python implementation of the W3C's XSLT language 4XSLT is an XML transformation processor based on the W3C's specification for the XSLT transform language. 4XPath implements the W3C XPath language for indicating and selecting XML document components. http://www.w3.org/TR/xslt 4XPath implements the full 4XPath recommendation except for the 'lang' core function. 4XSLT all of the XSLT 1.0 Recommendation, except for extension elements and fallback. Note: 4XSLT and 4XPath cannot work with JPython. News ---- - Moved some parsing functionality to C for performance increase - Fixed bugs for Windows build - Converted to BisonGen for performance increase - Fix namespace axis - Change package name to xml.xpath / xml.xslt - Implemented node-set and match proprietary ft extensions - Cleaned up extension function code and simplified use of user ext functions - Changed xml output method to use short form for empty elements - Fixed automatic detection of html output method - Fixed xsl:apply-templates to support with-param - Split Processor from output Writer classes (improved coupling/cohesion) and implemented the core writer as a plain text outputter to avoid messing with SAX output unless necessary - Implemented xsl:attribute-set - Implemented xsl:decimal-format - Implemented disable-output-escaping on xsl:text and xsl:value-of - Implemented number-format extension function - Add proper support for qualified names in vars, params, functions, etc. - Fixed bug with xsl:element and namespaces - Fixed performance bugs - Other bug-fixes More info and Obtaining 4XPath and 4XSLT ---------------------------------------- Please see http://Fourthought.com/4Suite/4XPath http://Fourthought.com/4Suite/4XSLT Or you can download 4XSLT from ftp://Fourthought.com/pub/4Suite/ Source files with "-all" in the name include 4DOM and 4XPath. There are Linux RPMs available at ftp://Fourthought.com/pub/mirrors/python4linux/redhat/ And Windows binaries at ftp://Fourthought.com/pub/4Suite/windows 4XPath and 4XSLT are distributed under a license similar to that of Python. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From tpassin@home.com Thu May 25 12:52:57 2000 From: tpassin@home.com (tpassin@home.com) Date: Thu, 25 May 2000 07:52:57 -0400 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 References: <200005250445.WAA02755@localhost.localdomain> Message-ID: <002601bfc63f$c4e14480$7cac1218@reston1.va.home.com> At last! WIndows binaries for these little babies!. Thanks, guys. Looks like they are really located at ftp://fourthought.com/pub/4Suite/binaries/windows/ Tom announced: > Fourthought, Inc. (http://Fourthought.com) announces the release of > > 4XSLT and 4XPath 0.9.0 > ---------------------- > A python implementation > of the W3C's XSLT language > > > 4XSLT is an XML transformation processor based on the W3C's specification > for the XSLT transform language. 4XPath implements the W3C XPath language > for indicating and selecting XML document components. > > > And Windows binaries at > > ftp://Fourthought.com/pub/4Suite/windows > From uogbuji@fourthought.com Thu May 25 19:17:42 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Thu, 25 May 2000 12:17:42 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 In-Reply-To: Message from of "Thu, 25 May 2000 07:52:57 EDT." <002601bfc63f$c4e14480$7cac1218@reston1.va.home.com> Message-ID: <200005251817.MAA03789@localhost.localdomain> > At last! WIndows binaries for these little babies!. Thanks, guys. > Looks like they are really located at > > ftp://fourthought.com/pub/4Suite/binaries/windows/ Oops. Yes. Getting the Windows binaries out was a royal pain, but we'll keep them coming. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From walter@bnbt.de Fri May 26 15:27:02 2000 From: walter@bnbt.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Fri, 26 May 2000 16:27:02 +0200 Subject: [XML-SIG] Bug in sgmlop? Message-ID: <4.3.1.0.20000526162422.00ac8de0@mail.bnbt.de> Hello all! I'm having a little problem with sgmlop from the 0.5.4 release. sgmlop seems to drop the last character in the string passed to parse: import sgmlop class handler: def handle_data(self,data): print repr(data) parser =3D sgmlop.SGMLParser() parser.register(handler()) parser.parse("gurk") parser.close() This script outputs 'gur' instead of 'gurk'. Bye, Walter D=F6rwald From Fredrik Lundh" Message-ID: <002a01bfc8ae$395994a0$f2a6b5d4@hagrid> Walter D=F6rwald wrote: > I'm having a little problem with sgmlop from > the 0.5.4 release. sgmlop seems to drop the > last character in the string passed to parse: I've verified this in the 1990620 release. here's a tentative patch: --- sgmlop.c.old Sun Jun 20 13:43:17 1999 +++ sgmlop.c Sun May 28 16:02:55 2000 @@ -1080,8 +1080,10 @@ } else { =20 /* raw data */ - if (++p >=3D end) + if (++p >=3D end) { + q =3D p; goto eol; + } continue; =20 } From Fredrik Lundh" Message-ID: <003801bfc8ae$f25b5600$f2a6b5d4@hagrid> Walter D=F6rwald wrote: > parser.register(handler()) > parser.parse("gurk") > parser.close() footnote: the correct way to use the parser is to either call "feed" a couple of time, and call "close" when you don't have more data, or to call "parse" just once, with all the data you have. not that it matters much in the current release... From walter@bnbt.de Sun May 28 18:52:03 2000 From: walter@bnbt.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Sun, 28 May 2000 19:52:03 +0200 Subject: [XML-SIG] Bug in sgmlop? In-Reply-To: <003801bfc8ae$f25b5600$f2a6b5d4@hagrid> References: <4.3.1.0.20000526162422.00ac8de0@mail.bnbt.de> Message-ID: <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> At 16:13 28.05.00, you wrote: >Walter D=F6rwald wrote: > > parser.register(handler()) > > parser.parse("gurk") > > parser.close() > >footnote: the correct way to use the parser is to >either call "feed" a couple of time, and call "close" >when you don't have more data, or to call "parse" >just once, with all the data you have. Thanks for the tips, I'm doing a feed/close loop self.lineno =3D 1 for line in lines: parser.feed(line) self.lineno =3D self.lineno + 1 parser.close() but only because I need line number information so I'm splitting the source into lines. I suppose parsing the string in one go would be faster. Are there any plans to provide line and column number information to the sgmlop user? E.g. the function finish_starttag(self,name,attrs) could be changed to finish_starttag(self,name,attrs,row,col) and should be passed the position in the string where the tag started. (and similar for the other functions). This would greatly simplify finding "bugs" in an XML file and could be used by a XML editor to highlight the position of the error. Bye, Walter D=F6rwald >not that it matters much in the current release... > > > > >_______________________________________________ >XML-SIG maillist - XML-SIG@python.org >http://www.python.org/mailman/listinfo/xml-sig From Fredrik Lundh" <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> Message-ID: <000c01bfc8d4$846e3420$f2a6b5d4@hagrid> I just posted an updated version of sgmlop to the "eff-bot staging = site": http://w1.132.telia.com/~u13208596/sgmlop.htm if I don't hear anything negative, I'll move it over to the pythonware site later this week. enjoy /F From Fredrik Lundh" <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> Message-ID: <002301bfc8d5$0c543e20$f2a6b5d4@hagrid> (oops. pilot error. please ignore my last mail) I just posted an updated version of sgmlop to the "staging area" at: http://w1.132.telia.com/~u13208596/sgmlop.htm This release addresses the following issues: SGMLOP1: SGML files containing text only wasn't properly handled. the parser never consumed the last character, not even if the 'close' method was called (reported by Walter D=F6rwald) SGMLOP2: Unicode strings (under 1.6) were treated as binary buffers. In this release, the parser can properly parse 16-bit strings, but the callbacks get 8-bit UTF-8 strings, not true Unicode strings. This will be fixed in a future release. SGMLOP3: The 'close' method no longer accepts an optional argument. Use a separate 'feed' call instead. SGMLOP4: Recursive calls to 'feed' or 'close' (from within a call- back) could lead to all sorts of weird problems. This version checks for this condition, and raises an AssertionError instead. I'll move it over to the pythonware site later this week. Please wait for that announcement before linking to this library. enjoy /F From info@pythonware.com Mon May 29 14:11:44 2000 From: info@pythonware.com (PythonWare) Date: Mon, 29 May 2000 15:11:44 +0200 Subject: [XML-SIG] Re: new sgmlop release (may 28, 2000) Message-ID: <000901bfc96f$722fa0a0$0500a8c0@secret.pythonware.com> (same, but with the official link) It's release week at the labs, and we'll start with something small but tasty: Secret Labs' sgmlop module is a fast replacement for the regular expression-based parsers used in Python's sgmllib, htmllib, and xmllib modules. A new version is now available from: http://www.pythonware.com/products/xml Changes since the last release include: - if a file ends with cdata, make sure all characters are sent to the callback - Unicode strings (under 1.6) are now translated to UTF-8 on the fly (future versions will be fully unicode-aware) - the 'close' method no longer accepts an optional argument. - recursive calls to 'feed' or 'close' now raises an exception. enjoy, the pythonware team "Secret Labs -- makers of fine pythonware since 1997." From Juergen Hermann" The following changes are necessary to extensions/pyexpat.c in order to = get it to compile with VC5: --- pyexpat.c.orig Fri Mar 31 03:44:28 2000 +++ pyexpat.c Mon May 29 13:44:10 2000 @@ -79,7 +79,7 @@ xmlhandler handler; }; -static struct HandlerInfo handler_info[]; +staticforward struct HandlerInfo handler_info[]; static PyObject *conv_atts( XML_Char **atts){ PyObject *attrs_obj=3DNULL; @@ -148,7 +148,7 @@ } #define VOID_HANDLER( NAME, PARAMS, PARAM_FORMAT ) \ - RC_HANDLER( void, NAME, PARAMS, , PARAM_FORMAT, , ,\ + RC_HANDLER( void, NAME, PARAMS, ; , PARAM_FORMAT, ; , ; ,\ (xmlparseobject *)userData ) #define INT_HANDLER( NAME, PARAMS, PARAM_FORMAT )\ Ciao, J=FCrgen -- J=FCrgen Hermann (jhe@webde-ag.de) WEB.DE AG, Amalienbadstr.41, D-76227 Karlsruhe Tel.: 0721/94329-0, Fax: 0721/94329-22 From jarek@sonic.net Wed May 31 00:11:18 2000 From: jarek@sonic.net (Jarek Wilkiewicz) Date: Tue, 30 May 2000 16:11:18 -0700 Subject: [XML-SIG] 0.5.4 and documentType Message-ID: <00c601bfca8c$5d2548e0$010a0a0a@nonia> This is a multi-part message in MIME format. ------=_NextPart_000_00C3_01BFCA51.B0261230 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, I tried creating a DOM tree from an xml file, and the = document.documentType returns a None. Is implementation of the = DocumentType missing from the current PyXML release, or am I doing = something wrong? Thanks, Jarek ------=_NextPart_000_00C3_01BFCA51.B0261230 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hello,
 
I tried creating a DOM tree from an xml = file, and=20 the document.documentType returns a None. Is implementation of the = DocumentType=20 missing from the current PyXML release, or am I doing something=20 wrong?
 
Thanks,
Jarek
 
------=_NextPart_000_00C3_01BFCA51.B0261230-- From anthony@interlink.com.au Wed May 31 04:46:17 2000 From: anthony@interlink.com.au (Anthony Baxter) Date: Wed, 31 May 2000 13:46:17 +1000 Subject: [XML-SIG] questions on pyexpat usage - CdataSectionHandler? Message-ID: <200005310346.NAA31890@mbuna.arbhome.com.au> I'm trying to figure out how to use pyexpat's CdataSectionHandler If I create small methods and assign them to parser.StartCdataSectionHandler and parser.EndCdataSectionHandler, they get called correctly, but with no arguments. How do you retrieve the actual data from the CDATA section? A DefaultHandler only sees the open and closing CDATA tags, nothing else. I'd RTFM but there doesn't seem to be a FM. The source was unhelpful... thanks, Anthony From Anthony Baxter Wed May 31 04:53:36 2000 From: Anthony Baxter (Anthony Baxter) Date: Wed, 31 May 2000 13:53:36 +1000 Subject: [XML-SIG] never mind.. (was Re: questions on pyexpat usage - CdataSectionHandler? ) In-Reply-To: Message from Anthony Baxter of "Wed, 31 May 2000 13:46:17 +1000." Message-ID: <200005310353.NAA31955@mbuna.arbhome.com.au> Never mind, I figured out what's going on - it's calling the CharacterDataHandler for the actual data. Sorry for the noise. Anthony >>> Anthony Baxter wrote > I'm trying to figure out how to use pyexpat's CdataSectionHandler > > If I create small methods and assign them to parser.StartCdataSectionHandler > and parser.EndCdataSectionHandler, they get called correctly, but with no > arguments. How do you retrieve the actual data from the CDATA section? > > A DefaultHandler only sees the open and closing CDATA tags, nothing else. > > I'd RTFM but there doesn't seem to be a FM. The source was unhelpful... > > thanks, > Anthony -- Anthony Baxter It's never too late to have a happy childhood. From paul@prescod.net Tue May 30 14:38:41 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 30 May 2000 08:38:41 -0500 Subject: [XML-SIG] questions on pyexpat usage - CdataSectionHandler? References: <200005310346.NAA31890@mbuna.arbhome.com.au> Message-ID: <3933C461.47F8665B@prescod.net> Now he wants a manual. The nerve. The data should come in a CharacterData event. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself "I want to give beauty pageants the respectability they deserve." - Brooke Ross, Miss Canada International From larsga@garshol.priv.no Wed May 31 09:02:25 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 31 May 2000 10:02:25 +0200 Subject: [XML-SIG] 0.5.4 and documentType In-Reply-To: <00c601bfca8c$5d2548e0$010a0a0a@nonia> References: <00c601bfca8c$5d2548e0$010a0a0a@nonia> Message-ID: * Jarek Wilkiewicz | | I tried creating a DOM tree from an xml file, and the | document.documentType returns a None. Is implementation of the | DocumentType missing from the current PyXML release, or am I doing | something wrong? The problem is that SAX, which PyDOM uses to read in the document, does not report the DOCTYPE declaration, and so PyDOM can't include it in the document. 4DOM has the same problem. SAX 2.0 will include this, and it's possible that 4DOM takes advantage of that in it's Sax2 reader, but I haven't been able to look at that yet. --Lars M. From uogbuji@fourthought.com Wed May 31 17:29:16 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 31 May 2000 10:29:16 -0600 Subject: [XML-SIG] 0.5.4 and documentType In-Reply-To: Message from Lars Marius Garshol of "31 May 2000 10:02:25 +0200." Message-ID: <200005311629.KAA04201@localhost.localdomain> > * Jarek Wilkiewicz > | > | I tried creating a DOM tree from an xml file, and the > | document.documentType returns a None. Is implementation of the > | DocumentType missing from the current PyXML release, or am I doing > | something wrong? > > The problem is that SAX, which PyDOM uses to read in the document, > does not report the DOCTYPE declaration, and so PyDOM can't include it > in the document. 4DOM has the same problem. > > SAX 2.0 will include this, and it's possible that 4DOM takes advantage > of that in it's Sax2 reader, but I haven't been able to look at that > yet. This is exactly right. Use xml.dom.ext.readers.Sax2 and you'll get your doctyupe just fine. This is not upgraded to the latest XML-SIG SAX2 yet, but will be soon. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From gwillis@mail.com Wed May 31 17:33:33 2000 From: gwillis@mail.com (george willis) Date: Wed, 31 May 2000 12:33:33 -0400 (EDT) Subject: [XML-SIG] XML serialization / marshalling via DTD Message-ID: <381953605.959790813787.JavaMail.root@web135-mc.mail.com> Goal: To provide object serialization/deserialization mechanism for python (similar to what is provided in java) using XML, a DTD, and a generic XML API and parser. Ultimate use will be inside the Zope environment. See http://iceberg.sourceforge.net/ for similar concept using java. Background: I have found several leads, pieces of code, etc. In fact, it seems too many people have taken a stab at this without realizing the generic functionality that would solve so many problems. There are importers without exporters, importers to screen widgets, importers that do there own parsing, etc. Question: Has anyone embraced this problem to develop an architecturally sound solution to object serialization via XML using XML-RD, XML-Schemas, or other refinements of the DTD specification? George Willis gwillis@mail.com voice: (706)206-0091 fax: (240)337-8593 ______________________________________________ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup From Fredrik Lundh" Message-ID: <023301bfcb1f$134e8180$f2a6b5d4@hagrid> george willis wrote: > Goal: To provide object serialization/deserialization mechanism for = python > (similar to what is provided in java) using XML, a DTD, and a generic = XML > API and parser. Ultimate use will be inside the Zope environment. = See > http://iceberg.sourceforge.net/ for similar concept using java. what's wrong with XML-RPC and/or SOAP? From paul@prescod.net Wed May 31 18:02:04 2000 From: paul@prescod.net (Paul Prescod) Date: Wed, 31 May 2000 12:02:04 -0500 Subject: [XML-SIG] XML serialization / marshalling via DTD References: <381953605.959790813787.JavaMail.root@web135-mc.mail.com> Message-ID: <3935458C.7ACB09BE@prescod.net> You have to be very careful with what these iceberg people are trying to do. Let's imagine that we make a technology that can read a Java class and make a perfect DTD for the data in the class. Now you want to change the implementation of the class for performance reasons...do you change the DTD? If not, then you need to introduce some level of abstraction between the raw XML and the objects you are creating. If you DO change the DTD automatically then what value is the DTD providing anyhow? You might as well just serialize the object with a fixed DTD like SOAP or XML-RPC because the real "schema" is the Java code. I think you need to separately maintain your DTD or schema and your Python or Java implementation. Then you need to implement a mapping layer. That's what people use the various XML APIs for. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself "I want to give beauty pageants the respectability they deserve." - Brooke Ross, Miss Canada International From gwillis@mail.com Wed May 31 19:32:12 2000 From: gwillis@mail.com (george willis) Date: Wed, 31 May 2000 14:32:12 -0400 (EDT) Subject: [XML-SIG] XML serialization / marshalling via DTD Message-ID: <383806599.959797932195.JavaMail.root@web135-mc.mail.com> Nothing is "wrong" with SOAP, or its predessesor XML-RPC -- they are just not object serialization mechanisms, but rather RPC mechanisms that use XML to transport data. To get reusable code, we must think abstract. XML is nothing more and nothing less than a universal standard for the representation of serialized object models for transport between systems. What is needed is a generic serializer/deserializer like that found in java. If you have ever "marshalled" objects the Microsoft way, you know what a royal pain it is to write your own marshalling code. If you have ever "serialized" objects the java way, you wonder why this wasn't automated sooner. A mechanism which automates the serialization of objects via introspection of the object model, and that automates deserialization of objects via parsing the stream and creating the necessary classes/types (if not present) and then instantiating objects that that conform to the stream, has many uses -- persistance, rpc, distributed objects, etc. The key to good architectures lies in the building blocks. We have SOAP for rpc - and it will utilize some sort of XML-Schema to refine parameters in the rpc call. But how does the XML-Schema get turned into objects that the receiving system can then use to perform its function utilizing OOP? How does the resulting output remain independant and loosely coupled from the plathora of standards like HTML, WML, XML, XML-Schema, XML-RD, etc., if not through a mechanism that takes the object model that results after the rpc call, and then serializing this to the appropriate dialect? The process of ser/deser exists embedded and constrained in several places such as XMLDocument,XMLWidgets, XMLObjects, etc. WHAT WE NEED IS THIS MECHANISM, JUST AS IT IS IN JAVA XML SERIALIZATION CODE, TO PROVIDE AUTOMATED SER/DESER USING THE EXISTING FOUNDATIONAL WORK OF SAX, DOM, AND PLUGGABLE PARSERS, SO THAT WE MAY LEVERAGE ALL THE BENEFITS OF OO REUSE THROUGH THE CONSUMING TECHNOLOGIES OF RPC, PERSISTANCE, ETC. I hope this clarifies my previous post. I thank you for taking the time to respond and welcome any additional insights you may have. ------Original Message------ From: "Fredrik Lundh" To: "george willis" , Sent: May 31, 2000 4:41:30 PM GMT Subject: Re: [XML-SIG] XML serialization / marshalling via DTD what's wrong with XML-RPC and/or SOAP? george willis wrote: Goal: To provide object serialization/deserialization mechanism for python (similar to what is provided in java) using XML, a DTD, and a generic XML API and parser. Ultimate use will be inside the Zope environment. See http://iceberg.sourceforge.net/ for similar concept using java. Background: I have found several leads, pieces of code, etc. In fact, it seems too many people have taken a stab at this without realizing the generic functionality that would solve so many problems. There are importers without exporters, importers to screen widgets, importers that do there own parsing, etc. Question: Has anyone embraced this problem to develop an architecturally sound solution to object serialization via XML using XML-RD, XML-Schemas, or other refinements of the DTD specification? George Willis gwillis@mail.com voice: (706)206-0091 fax: (240)337-8593 ______________________________________________ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup From Fredrik Lundh" Message-ID: <009b01bfcb30$487bb9c0$f2a6b5d4@hagrid> > WHAT WE NEED IS THIS MECHANISM, JUST AS IT IS IN JAVA XML = SERIALIZATION > CODE, TO PROVIDE AUTOMATED SER/DESER USING THE EXISTING FOUNDATIONAL = WORK OF > SAX, DOM, AND PLUGGABLE PARSERS, SO THAT WE MAY LEVERAGE ALL THE = BENEFITS OF > OO REUSE THROUGH THE CONSUMING TECHNOLOGIES OF RPC, PERSISTANCE, ETC. if you had bothered to look before you started shouting at me, you might have noticed that xmlrpclib.py and soaplib.py provide generic marshalling code for python data structures. but since "reuse" obviously means "inventing yet another wheel" in your dictionary, I can only wish you good luck. over and out /F From ken@bitsko.slc.ut.us Wed May 31 19:29:05 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 31 May 2000 13:29:05 -0500 Subject: [XML-SIG] XML serialization / marshalling via DTD In-Reply-To: "Fredrik Lundh"'s message of "Wed, 31 May 2000 20:44:40 +0200" References: <383806599.959797932195.JavaMail.root@web135-mc.mail.com> <009b01bfcb30$487bb9c0$f2a6b5d4@hagrid> Message-ID: "Fredrik Lundh" writes: > xmlrpclib.py and soaplib.py provide generic marshalling code for > python data structures. I don't see soaplib.py in CVS, is it available elsewhere? -- Ken From ken@bitsko.slc.ut.us Wed May 31 19:39:50 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 31 May 2000 13:39:50 -0500 Subject: [XML-SIG] XML serialization / marshalling via DTD In-Reply-To: george willis's message of "Wed, 31 May 2000 14:32:12 -0400 (EDT)" References: <383806599.959797932195.JavaMail.root@web135-mc.mail.com> Message-ID: george willis writes: > Nothing is "wrong" with SOAP, or its predessesor XML-RPC -- they are > just not object serialization mechanisms, but rather RPC mechanisms > that use XML to transport data. > > To get reusable code, we must think abstract. XML is nothing more > and nothing less than a universal standard for the representation of > serialized object models for transport between systems. What is > needed is a generic serializer/deserializer like that found in java. You may not have been following SOAP recently, SOAP 1.1 clearly seperates envelope, object encoding, RPC, and HTTP binding. The part you are possibly looking for is object encoding (section 5). Most implementations will have a generic serializer/deserializer for that encoding. The W3C is working on defining what their role will be in XML protocols, you can follow or participate by joining the xml-dist-app mailing list. Most of what you're talking about is exactly what has been discussed on that mailing list: -- Ken From gwillis@mail.com Wed May 31 20:21:42 2000 From: gwillis@mail.com (george willis) Date: Wed, 31 May 2000 15:21:42 -0400 (EDT) Subject: [XML-SIG] XML serialization / marshalling via DTD Message-ID: <383748928.959800902912.JavaMail.root@web313-mc.mail.com> I believe that such flame wars are counter productive, and have no place in a professional environment. They waste time and threaten to dismiss key topics of discussion as "shortsited". I am aware that many use "CAPITAL CRITICISM" to shout derision at one another. My use of capitals was to highlight my main thought in a rather large message. As my first message is proof that capitals can be used without derision, your response is proof that derision does not require capitals. Your comments that try to dismiss this issue as "shortsited" actual give credence to my concerns. With even more ser/deser code designed for a specific application, which among the heaps of choices is a good starting point for an autonomous tool to provide this service to all of the desired consuming tools? Which is the best code base? (see, it's just hard to highlight without caps.) Save your fight for the java XML products that are threatening to take the market share. If you took offense, you have my sincere apologies. If you can and will cooperate in this discussion, I welcome your help. If not, go with God and my blessing. NOW, UNTO THE BUSINESS AT HAND (used to signify a new section) ------------------------------ Fredrik has made me aware that yet more code exists in even more modules. Thank you Fredrik. Does anyone have experience with this code? Does it ser/deser an entire containment tree? What DTD or XML Schema standards are supported? Does it make use of the existing SAX/DOM/Parser tools? How does it compare with other approaches code found in XML Objects, XML Documnent, etc. Has anyone else considered an architecture where ser/deser was a seperate interface? ------Original Message------ From: "Fredrik Lundh" To: "george willis" , , Sent: May 31, 2000 6:44:40 PM GMT Subject: Re: [XML-SIG] XML serialization / marshalling via DTD > WHAT WE NEED IS THIS MECHANISM, JUST AS IT IS IN JAVA XML SERIALIZATION > CODE, TO PROVIDE AUTOMATED SER/DESER USING THE EXISTING FOUNDATIONAL WORK OF > SAX, DOM, AND PLUGGABLE PARSERS, SO THAT WE MAY LEVERAGE ALL THE BENEFITS OF > OO REUSE THROUGH THE CONSUMING TECHNOLOGIES OF RPC, PERSISTANCE, ETC. if you had bothered to look before you started shouting at me, you might have noticed that xmlrpclib.py and soaplib.py provide generic marshalling code for python data structures. but since "reuse" obviously means "inventing yet another wheel" in your dictionary, I can only wish you good luck. over and out /F George Willis gwillis@mail.com voice: (706)206-0091 fax: (240)337-8593 ______________________________________________ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup From gwillis@mail.com Wed May 31 20:45:55 2000 From: gwillis@mail.com (george willis) Date: Wed, 31 May 2000 15:45:55 -0400 (EDT) Subject: [XML-SIG] XML serialization / marshalling via DTD Message-ID: <383153385.959802355030.JavaMail.root@web135-mc.mail.com> First off, thanks for your help Ken. I have been following SOAP, with the recent IBM involvement and inclusion of multiple transports such as SMTP and MOM protocols (i.e. MQSeries) SOAP is a good architecture from the scant view I have taken of it. I am glad to see a layered approach, and it seems like a place to look for good code, but before I do, I thought I might ask those who have dealt with this issue before, and gain from their experience. Does the code in the python SOAP package perform serialization and deserialization? How does this code compare to other ser/deser code found used in XMLDocument, XMLWidgets, and ZODB? It would seem to me that since this is needed for ZODB, we might have some good code their? Has anyone compared these codebases that must perform the ser/deser to see which might have the best code? Shouldn't we study them and put forth a "best-of-breed" ser/deser mechanism that can then be used by all these consumers? --- In zope-xml@egroups.com, Ken MacLeod wrote: > george willis writes: > > > Nothing is "wrong" with SOAP, or its predessesor XML-RPC -- they are > > just not object serialization mechanisms, but rather RPC mechanisms > > that use XML to transport data. > > > > To get reusable code, we must think abstract. XML is nothing more > > and nothing less than a universal standard for the representation of > > serialized object models for transport between systems. What is > > needed is a generic serializer/deserializer like that found in java. > > You may not have been following SOAP recently, SOAP 1.1 clearly > seperates envelope, object encoding, RPC, and HTTP binding. The part > you are possibly looking for is object encoding (section 5). Most > implementations will have a generic serializer/deserializer for that > encoding. > > The W3C is working on defining what their role will be in XML > protocols, you can follow or participate by joining the xml-dist-app > mailing list. Most of what you're talking about is exactly what has > been discussed on that mailing list: > > > > > -- Ken George Willis gwillis@mail.com voice: (706)206-0091 fax: (240)337-8593 ______________________________________________ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup From gwillis@mail.com Wed May 31 21:34:12 2000 From: gwillis@mail.com (george willis) Date: Wed, 31 May 2000 16:34:12 -0400 (EDT) Subject: [XML-SIG] XML serialization / marshalling via DTD Message-ID: <383438520.959805252510.JavaMail.root@web303-mc.mail.com> First off, thanks for taking time to respond Paul. I have many of your articles (groves) and it is an honor to converse with you. Your points are well taken. The concerns you express are the same as those experienced in java serialization whether XML or not, and for that matter, serialization in general. Transport protocols must become standard when utilized for inter-enterprise communications where the protocol becomes the serialized interface or API. APIs must be well thought out and become stable quickly if independant code is to be written that supports the interface. But there are many cases where these concerns can be relaxed. Cases that come to mind are where the interface is not ready for widespread use. Serailization gives us a quick way to "sculpt" the transport protocol during development. As the functionality of the system and its visionary uses fall into place, the resulting protocol can be frozen, providing an interface where neccessary. Also, where code can be shared, the api to the classes become the interface, and the transport becomes ubiquitous for a season. Again, as you have warned us, be prepared to freeze it if independant code will tap into the trasport protocol. My main goal is to take XML-Schema for invoices, P.O.s etc. and deserialize them into an object model. Since there is no industry agreement on what a P.O. looks like in XML, (i.e. no standard interface), a CASE tool to render the object classes and instances would be of great aid. Also, where no XML-Schema existed, one could be developed and refined during the system prototype, and then submitted as a good starting point. It is always better to have a protocol from a working system to start with -- it shows a certain due diligence. Again, your points are well taken, and very appropriate. Thank you for all the work you have done that others are building on. George Willis gwillis@mail.com voice: (706)206-0091 fax: (240)337-8593 ______________________________________________ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup