From tismer@tismer.com Mon May 1 13:03:03 2000 From: tismer@tismer.com (Christian Tismer) Date: Mon, 01 May 2000 14:03:03 +0200 Subject: [XML-SIG] Windows install problems References: <01BFB306.0859C5E0@JOHAN> Message-ID: <390D7277.1D60B44B@tismer.com> Johan De Smedt wrote: > > Hi, > > I've had the following problem while trying to install the python xml package on windows NT: I'd recomend to use my Windows installer. Please report any problems to me. http://www.tismer.com/xml/ contains my build from April 18. ciao - chris -- Christian Tismer :^) Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com From paul@prescod.net Mon May 1 21:38:29 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 15:38:29 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> Message-ID: <390DEB45.D8D12337@prescod.net> Uche asked for a summary so I cc:ed the xml-sig. Guido van Rossum wrote: > > ... > > OK. I really meant recoding in UTF-8 -- I maintain that there are > lots of forces that prevent recoding most ISO-2022-JP documents in > UTF-8. Absolutely agree. > Are you sure you understand what we are arguing about? Here's what I thought we were arguing about: If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8). I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it. > Earlier, you quoted some reference documentation that defines 8-bit > strings as containing characters. That's taken out of context -- this > was written in a time when there was (for most people anyway) no > difference between characters and bytes, and I really meant bytes. Actually, I think that that was Fredrik. Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make. We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem. >>> a=1000000000000000000000000000000000000L >>> int(a) Traceback (innermost last): File "", line 1, in ? OverflowError: long int too long to convert And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Mon May 1 22:32:38 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 17:32:38 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT." <390DEB45.D8D12337@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us> > > Are you sure you understand what we are arguing about? > > Here's what I thought we were arguing about: > > If you put a bunch of "funny characters" into a Python string literal, > and then compare that string literal against a Unicode object, should > those funny characters be treated as logical units of text (characters) > or as bytes? And if bytes, should some transformation be automatically > performed to have those bytes be reinterpreted as characters according > to some particular encoding scheme (probably UTF-8). > > I claim that we should *as far as possible* treat strings as character > lists and not add any new functionality that depends on them being byte > list. Ideally, we could add a byte array type and start deprecating the > use of strings in that manner. Yes, it will take a long time to fix this > bug but that's what happens when good software lives a long time and the > world changes around it. > > > Earlier, you quoted some reference documentation that defines 8-bit > > strings as containing characters. That's taken out of context -- this > > was written in a time when there was (for most people anyway) no > > difference between characters and bytes, and I really meant bytes. > > Actually, I think that that was Fredrik. Yes, I came across the post again later. Sorry. > Anyhow, you wrote the documentation that way because it was the most > intuitive way of thinking about strings. It remains the most intuitive > way. I think that that was the point Fredrik was trying to make. I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately... > We can't make "byte-list" strings go away soon but we can start moving > people towards the "character-list" model. In concrete terms I would > suggest that old fashioned lists be automatically coerced to Unicode by > interpreting each byte as a Unicode character. Trying to go the other > way could cause the moral equivalent of an OverflowError but that's not > a problem. > > >>> a=1000000000000000000000000000000000000L > >>> int(a) > Traceback (innermost last): > File "", line 1, in ? > OverflowError: long int too long to convert > > And just as with ints and longs, we would expect to eventually unify > strings and unicode strings (but not byte arrays). OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode. I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally. I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric. Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one. I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding. I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Guido van Rossum wrote: > I just wish he made the point more eloquently. The eff-bot seems to > be in a crunchy mood lately... I've posted a few thousand messages on this topic, most of which seem to have been ignored. if you'd read all my messages, and seen all the replies, you'd be cranky too... > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. maybe, but it's a darn good axiom, and it's used by everyone else. Perl uses it, Tcl uses it, XML uses it, etc. see: http://www.python.org/pipermail/python-dev/2000-April/005218.html > I have a bunch of good reasons (I think) for liking UTF-8: it allows > you to convert between Unicode and 8-bit strings without losses, Tcl > uses it (so displaying Unicode in Tkinter *just* *works*...), it is > not Western-language-centric. the "Tcl uses it" is a red herring -- their internal implementation uses 16-bit integers, and the external interface works very hard to keep the "strings are character sequences" illusion. in other words, the length of a string is *always* the number of characters, the character at index i is *always* the i'th character in the string, etc. that's not true in Python 1.6a2. (as for Tkinter, you only have to add 2-3 lines of code to make it use 16-bit strings instead...) > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. this is another red herring: my argument is that 8-bit strings should contain unicode characters, using unicode character codes. there should be only one character repertoire, and that repertoire is uni- code. for a definition of these terms, see: http://www.python.org/pipermail/python-dev/2000-April/005225.html obviously, you can only store 256 different values in a single 8-bit character (just like you can only store 4294967296 different values in a single 32-bit int). to store larger values, use unicode strings (or long integers). conversion from a small type to a large type always work, conversion from a large type to a small one may result in an OverflowError. it has nothing to do with encodings. > I claim that as long as we're using an encoding we might as well use > the most accepted 8-bit encoding of Unicode as the default encoding. yeah, and I claim that it won't fly, as long as it breaks the "strings are character sequences" rule used by all other contemporary (and competing) systems. (if you like, I can post more "fun with unicode" messages ;-) and as I've mentioned before, there are (at least) two ways to solve this: 1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and Perl). make sure len(s) returns the number of characters in the string, make sure s[i] returns the i'th character (not necessarily starting at the i'th byte, and not necessarily one byte), etc. to make this run reasonable fast, use as many implementation tricks as you can come up with (I've described three ways to implement this in an earlier post). 2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i]) is a unicode character code, whether s is an 8-bit string or a = unicode string. for alternative 1 to work, you need to add some way to explicitly work with binary strings (like it's done in Perl and Tcl). alternative 2 doesn't need that; 8-bit strings can still be used to hold any kind of binary data, as in 1.5.2. just keep in mind you cannot use use all methods on such an object... > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. I still think it's very unfortunate that you think that unicode strings are a special kind of strings. Perl and Tcl don't, so why should we? From paul@prescod.net Tue May 2 01:19:20 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 19:19:20 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <390E1F08.EA91599E@prescod.net> Sorry for the long message. Of course you need only respond to that which is interesting to you. I don't think that most of it is redundant. Guido van Rossum wrote: > > ... > > OK, you've made your claim -- like Fredrik, you want to interpret > 8-bit strings as Latin-1 when converting (not just comparing!) them to > Unicode. If the user provides an explicit conversion function (e.g. UTF-8-decode) then of course we should use that function. Under my character is a character is a character model, this "conversion" is morally equivalent to ROT-13, strupr or some other text->text translation. So you could apply UTF-8-decode even to a Unicode string as long as each character in the string has ord()<256 (so that it could be interpreted as a character representation for a byte). > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. I don't see it as an axiom, but rather as a design decision you make to keep your language simple. Along the lines of "all values are objects" and (now) all integer values are representable with a single type. Are you happy with this? a="\244" b=u"\244" assert len(a)==len(b) assert ord(a[0])==ord(b[0]) # same thing, right? print b==a # Traceback (most recent call last): # File "", line 1, in ? # UnicodeError: UTF-8 decoding error: unexpected code byte If I type "\244" it means I want character 244, not the first half of a UTF-8 escape sequence. "\244" is a string with one character. It has no encoding. It is not latin-1. It is not UTF-8. It is a string with one character and should compare as equal with another string with the same character. I would laugh my ass off if I was using Perl and it did something weird like this to me (as long as it didn't take a month to track down the bug!). Now it isn't so funny. > I have a bunch of good reasons (I think) for liking UTF-8: I'm not against UTF-8. It could be an internal representation for some Unicode objects. > it allows > you to convert between Unicode and 8-bit strings without losses, Here's the heart of our disagreement: ****** I don't want, in Py3K, to think about "converting between Unicode and 8-bit strings." I want strings and I want byte-arrays and I want to worry about converting between *them*. There should be only one string type, its characters should all live in the Unicode character repertoire and the character numbers should all come from Unicode. "Special" characters can be assigned to the Unicode Private User Area. Byte arrays would be entirely seperate and would be converted to Unicode strings with explicit conversion functions. ***** In the meantime I'm just trying to get other people thinking in this mode so that the transition is easier. If I see people embedding UTF-8 escape sequences in literal strings today, I'm going to hit them. I recognize that we can't design the universe right now but we could agree on this direction and use it to guide our decision-making. By the way, if we DID think of 8-bit strings as essentially "byte arrays" then let's use that terminology and imagine some future documentation: "Python's string type is equivalent to a list of bytes. For clarity, we will call this type a byte list from now on. In contexts where a Unicode character-string is desired, Python automatically converts byte lists to charcter strings by doing a UTF-8 decode on them." What would you think if Java had a default (I say "magical") conversion from byte arrays to character strings. The only reason we are discussing this is because Python strings have a dual personality which was useful in the past but will (IMHO, of course) become increasingly confusing in the future. We want the best of both worlds without confusing anybody and I don't think that we can have it. If you want 8-bit strings to be really byte arrays in perpetuity then let's be consistent in that view. We can compare them to Unicode as we would two completely separate types. "U" comes after "S" so unicode strings always compare greater than 8-bit strings. The use of the word "string" for both objects can be considered just a historical accident. > Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), Don't follow this entirely. Shouldn't the next version of TKinter accept and return Unicode strings? It would be rather ugly for two Unicode-aware systems (Python and TK) to talk to each other in 8-bit strings. I mean I don't care what you do at the C level but at the Python level arguments should be "just strings." Consider that len() on the TKinter side would return a different value than on the Python side. What about integral indexes into buffers? I'm totally ignorant about TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is between the 5th and 6th character when in an 8-bit string the equivalent index might be the 11th or 12th byte? > it is not Western-language-centric. If you look at encoding efficiency it is. > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all! Is Unicode going to be the canonical Py3K character set or will we have different objects for different character sets/encodings with different default (I say "magical") conversions between them. Such a design would not be entirely insane though it would be a PITA to implement and maintain. If we aren't ready to establish Unicode as the one true character set then we should probably make no special concessions for Unicode at all. Let a thousand string objects bloom! Even if we agreed to allow many string objects, byte==character should not be the default string object. Unicode should be the default. > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Won't this be totally common? Most people are going to use 8-bit literals in their program text but work with Unicode data from XML parsers, COM, WebDAV, Tkinter, etc? > Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. If we are guessing then we are doing something wrong. My answer to the question of "default encoding" falls out naturally from a certain way of looking at text, popularized in various other languages and increasingly "the norm" on the Web. If you accept the model (a character is a character is a character), the right behavior is obvious. "\244"==u"\244" Nobody is ever going to have trouble understanding how this works. Choose simplicity! -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 01:53:26 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 20:53:26 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT." <390E1F08.EA91599E@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us> Paul, we're both just saying the same thing over and over without convincing each other. I'll wait till someone who wasn't in this debate before chimes in. Have you tried using this? --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid> Paul Prescod wrote: > I would laugh my ass off if I was using Perl and it did something = weird > like this to me. you don't have to -- in Perl 5.6, a character is a character... does anyone on this list follow the perl-porters list? was this as controversial over in Perl land as it appears to be over here? From Fredrik Lundh" reading the XML namespace specification makes my brain hurt, so I thought I'd ask here before it explodes... given the following XML snippet, what's the correct namespace for the "attribute" attribute? I'm not smart enough to figure that out from the specification, my intuition says "no namespace", and so does James Clark's namespace note (http://www.jclark.com/xml/xmlns.htm) where Check Status is mapped to: <{http://www.w3.org/TR/REC-html40}A HREF=3D'/cgi-bin/ResStatus' >Check Status (slightly edited -- see the note for the full example). but 1.5.2's xmllib doesn't agree with this: import xmllib class Parser(xmllib.XMLParser): def unknown_starttag(self, tag, attr): print "S", repr(tag), attr def unknown_endtag(self, tag): print "E", repr(tag) p =3D Parser() p.feed(""" """) p.close() gives the following output: S 'namespace: body' {} S 'namespace: member' {'namespace: attribute': 'value'} E 'namespace: member' E 'namespace: body' instead of=20 S 'namespace: body' {} S 'namespace: member' {'attribute': 'value'} E 'namespace: member' E 'namespace: body' can anyone sort this out for me? (and no, I really have to be able to use xmllib, to make sure soaplib.py works under an off-the-shelf Python distribution...) From ludvig.svenonius@excosoft.se Tue May 2 15:52:03 2000 From: ludvig.svenonius@excosoft.se (Ludvig Svenonius) Date: Tue, 2 May 2000 16:52:03 +0200 Subject: [XML-SIG] namespace headache In-Reply-To: <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: I'm pretty sure an unprefixed attribute will default to the same namespace URI as its host element, so in the snippet: 'attribute' would have the namespace URI 'namespace:', as its host element, whereas in: it would have no namespace URI. I think I read about this somewhere in the namespace specification at W3C. This also explains why XSL-specific attributes in XSLT elements needn't be prefixed (they will conveniently default to the same namespaces as their host elements, i.e. the XSLT namespace). The same goes for XHTML, I guess. I think xmllib has it right. I have no explanation for James Clark's note however. The alternative of forcing the XML author to explicitly prefix every attribute in elements that belong to a certain namespace just to declare that they belong to the same namespace seems pretty inconvenient. -- Ludvig Svenonius Excosoft AB ludvig@excosoft.se -----Original Message----- From: xml-sig-admin@python.org [mailto:xml-sig-admin@python.org]On Behalf Of Fredrik Lundh Sent: Tuesday, May 02, 2000 4:19 PM To: xml-sig@python.org Subject: [XML-SIG] namespace headache reading the XML namespace specification makes my brain hurt, so I thought I'd ask here before it explodes... given the following XML snippet, what's the correct namespace for the "attribute" attribute? I'm not smart enough to figure that out from the specification, my intuition says "no namespace", and so does James Clark's namespace note (http://www.jclark.com/xml/xmlns.htm) where Check Status is mapped to: <{http://www.w3.org/TR/REC-html40}A HREF='/cgi-bin/ResStatus' >Check Status (slightly edited -- see the note for the full example). but 1.5.2's xmllib doesn't agree with this: import xmllib class Parser(xmllib.XMLParser): def unknown_starttag(self, tag, attr): print "S", repr(tag), attr def unknown_endtag(self, tag): print "E", repr(tag) p = Parser() p.feed(""" """) p.close() gives the following output: S 'namespace: body' {} S 'namespace: member' {'namespace: attribute': 'value'} E 'namespace: member' E 'namespace: body' instead of S 'namespace: body' {} S 'namespace: member' {'attribute': 'value'} E 'namespace: member' E 'namespace: body' can anyone sort this out for me? (and no, I really have to be able to use xmllib, to make sure soaplib.py works under an off-the-shelf Python distribution...) _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://www.python.org/mailman/listinfo/xml-sig From Troy.Nordine@westgroup.com Tue May 2 16:55:54 2000 From: Troy.Nordine@westgroup.com (Nordine, Troy) Date: Tue, 2 May 2000 10:55:54 -0500 Subject: [XML-SIG] namespace headache Message-ID: <9DDF5FF45501D211BC22006094238FB00276C05E@elfie.int.westgroup.com> > > given the following XML snippet, what's the correct namespace > for the "attribute" attribute? > > > > > > This topic came up on XML-DEV back in early Feb. under the topic "XML Schemas Question: default namespace misses attributes". The essence of the discussion (as I understood it) was that attributes that don't have prefixes are in the namespace defined by the element, not the namespace that the element is in. See Henry Thompson's reply to the original post for a better explanation : http://www.xml.org/archives/xml-dev/2000/02/0097.html. > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where > > > Check Status > > > is mapped to: > > > <{http://www.w3.org/TR/REC-html40}A HREF='/cgi-bin/ResStatus' > >Check Status > > > (slightly edited -- see the note for the full example). > > but 1.5.2's xmllib doesn't agree with this: > So as far as I can tell, James is right and xmllib is wrong. Troy From paul@prescod.net Tue May 2 17:51:34 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:51:34 -0500 Subject: [XML-SIG] namespace headache References: <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: <390F0796.3515445C@prescod.net> Fredrik Lundh wrote: > > ... > > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where Your intuition is right. > (and no, I really have to be able to use xmllib, to make sure > soaplib.py works under an off-the-shelf Python distribution...) You'll have to fix xmllib then. Attributes without prefixes have no namespace. They neither inherit their namespace nor use the default namespace. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From paul@prescod.net Tue May 2 18:21:04 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 12:21:04 -0500 Subject: [XML-SIG] namespace headache References: Message-ID: <390F0E80.FD3620F8@prescod.net> It is a reasonable convention, built *on top of* the XML namespaces specification, to treat "href" on an "html:a" element as equivalent to "html:href". You could imagine this as another layer termed "Simplified Attribute-Inherited Namespaces". But such a document doesn't exist...some particular XML vocabularies just "work that way." -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From tgraham@mulberrytech.com Tue May 2 19:18:13 2000 From: tgraham@mulberrytech.com (Tony Graham) Date: Tue, 2 May 2000 14:18:13 -0400 (EST) Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate (Guido van Rossum) In-Reply-To: <20000502155608.6D41C1CD6E@dinsdale.python.org> References: <20000502155608.6D41C1CD6E@dinsdale.python.org> Message-ID: <14607.7141.80000.709929@menteith.com> I subscribe to the Digest, so I'm a bit behind... At 2 May 2000 11:56 -0400, xml-sig-request@python.org wrote: > From: Guido van Rossum > Date: Mon, 01 May 2000 17:32:38 -0400 > Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate ... > I have a bunch of good reasons (I think) for liking UTF-8: it allows > you to convert between Unicode and 8-bit strings without losses, Tcl UTF-8 is variable-length 8-bit encoding of Unicode characters. The only characters that cleanly convert between UTF-8 and fixed-length 8-bit strings are the ASCII characters. > uses it (so displaying Unicode in Tkinter *just* *works*...), it is > not Western-language-centric. UTF-8 is Western-language-centric. In fact, it's practically English-centric since only the ASCII characters are 1 byte per character, the characters for writing most European languages plus Arabic and Hebrew are 2 bytes per character, and the rest -- including Hangul and the CJK ideographs -- are 3 bytes per character. Japanese text files, for example, are 50% larger as UTF-8 text than as UTF-16 text. > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. > > I claim that as long as we're using an encoding we might as well use > the most accepted 8-bit encoding of Unicode as the default encoding. There have been other proposals for variable-length 8-bit transformation formats of Unicode characters, but UTF-8 is the only one that is specified in the Unicode Standard and ISO/IEC 10646. There is less hassle with characters outside the 16-bit Basic Multilingual Plane (BMP) with UTF-8 than with, for example, UTF-16. When working with UTF-8, you have to consider that all characters are encoded as varying numbers of bytes. When working with UTF-16, it's easy to assume that all characters are 16-bit and write your code accordingly, but there will shortly be characters defined outside of the BMP -- including math characters used in MathML and new but essential CJK ideographs -- so you have to work with UTF-16 data as being "16-bit except when it isn't". It shouldn't matter what encoding or transformation format is used for the internal representation of strings. Python should be able to read and write files in a number of encodings so that it plays well with others. I compared eight languages in the "Programming Language Support" chapter of "Unicode: A Primer" (ISBN: 0-7645-4625-2) and found that there was no Unicode encoding that all eight languages could read and write. Playing well with others also means reading and writing whatever non-Unicode encoding a user keeps his data in. Python should also be able to read Python programs in a number of encodings, including UTF-8 and UTF-16, plus it should include a mechanism for referencing Unicode characters by number (or name) within strings. > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. Given the long tradition of using different encodings in 8-bit strings, surely there's no safe assumption about the encoding in any 8-bit string? ISO 8859-1 (Latin-1) is being superseded by ISO 8859-15 (which shuffled a few things and added the euro); Windows' CP 1252 isn't really ISO 8859-1 despite how some mailers and HTML editors label it; and even I've processed multi-byte Japanese, Chinese, and Korean text using 8-bit scripting languages. Perl, for example, has "byte" and "utf8" pragmata for controlling whether strings are treated as fixed-length 1-byte characters or as variable-length UTF-8 characters, with the current default being "byte". Tcl, to use another example, can read and write files in a number of encodings, but it defaults to using the system encoding, or ISO 8859-1 if it can't determine the system encoding. Python, similarly, should not make assumptions about the encoding used in strings in existing programs and should be flexible in supporting the encodings that people do use. Regards, Tony Graham ====================================================================== Tony Graham mailto:tgraham@mulberrytech.com Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9632 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== From paul@prescod.net Tue May 2 19:23:24 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 13:23:24 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <390F1D1C.6EAF7EAD@prescod.net> Guido van Rossum wrote: > > .... > > Have you tried using this? Yes. I haven't had large problems with it. As long as you know what is going on, it doesn't usually hurt anything because you can just explicitly set up the decoding you want. It's like the int division problem. You get bitten a few times and then get careful. It's the naive user who will be surprised by these random UTF-8 decoding errors. That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 19:56:34 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 14:56:34 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT." <390F1D1C.6EAF7EAD@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us> > It's the naive user who will be surprised by these random UTF-8 decoding > errors. > > That's why this is NOT a convenience issue (are you listening MAL???). > It's a short and long term simplicity issue. There are lots of languages > where it is de rigeur to discover and work around inconvenient and > confusing default behaviors. I just don't think that we should be ADDING > such behaviors. So what do you think of my new proposal of using ASCII as the default "encoding"? It takes care of "a character is a character" but also (almost) guarantees an error message when mixing encoded 8-bit strings with Unicode strings without specifying an explicit conversion -- *any* 8-bit byte with the top bit set is rejected by the default conversion to Unicode. I think this is less confusing than Latin-1: when an unsuspecting user is reading encoded text from a file into 8-bit strings and attempts to use it in a Unicode context, an error is raised instead of producing garbage Unicode characters. It encourages the use of Unicode strings for everything beyond ASCII -- there's no way around ASCII since that's the source encoding etc., but Latin-1 is an inconvenient default in most parts of the world. ASCII is accepted everywhere as the base character set (e.g. for email and for text-based protocols like FTP and HTTP), just like English is the one natural language that we can all sue to communicate (to some extent). --Guido van Rossum (home page: http://www.python.org/~guido/) From dieter@handshake.de Tue May 2 19:44:41 2000 From: dieter@handshake.de (Dieter Maurer) Date: Tue, 2 May 2000 20:44:41 +0200 (CEST) Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390E1F08.EA91599E@prescod.net> References: <390E1F08.EA91599E@prescod.net> Message-ID: <14607.7798.510723.419556@lindm.dm> Paul Prescod writes: > The fact that my proposal has the same effect as making Latin-1 the > "default encoding" is a near-term side effect of the definition of > Unicode. My long term proposal is to do away with the concept of 8-bit > strings (and thus, conversions from 8-bit to Unicode) altogether. One > string to rule them all! Why must this be a long term proposal? I would find it quite attractive, when * the old string type became an imutable list of bytes * automatic conversion between byte lists and unicode strings were performed via user customizable conversion functions (a la __import__). Dieter From jkraai@murlmail.com Tue May 2 20:46:49 2000 From: jkraai@murlmail.com (jkraai@murlmail.com) Date: Tue, 2 May 2000 14:46:49 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate Message-ID: <200005021946.OAA03609@www.polytopic.com> The ever quotable Guido: > English is the one natural language that we can all sue to communicate ------------------------------------------------------------------ You've received MurlMail! -- FREE, web-based email, accessible anywhere, anytime from any browser-enabled device. Sign up now at http://murl.com Murl.com - At Your Service From paul@prescod.net Tue May 2 20:23:27 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 14:23:27 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> Message-ID: <390F2B2F.2953C72D@prescod.net> Guido van Rossum wrote: > > ... > > So what do you think of my new proposal of using ASCII as the default > "encoding"? I can live with it. I am mildly uncomfortable with the idea that I could write a whole bunch of software that works great until some European inserts one of their name characters. Nevertheless, being hard-assed is better than being permissive because we can loosen up later. What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 20:58:20 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 15:58:20 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT." <390F2B2F.2953C72D@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us> [me] > > So what do you think of my new proposal of using ASCII as the default > > "encoding"? [Paul] > I can live with it. I am mildly uncomfortable with the idea that I could > write a whole bunch of software that works great until some European > inserts one of their name characters. Better than that when some Japanese insert *their* name characters and it produces gibberish instead. > Nevertheless, being hard-assed is > better than being permissive because we can loosen up later. Exactly -- just as nobody should *count* on 10**10 raising OverflowError, nobody (except maybe parts of the standard library :-) should *count* on unicode("\347") raising ValueError. I think that's fine. > What do we do about str( my_unicode_string )? Perhaps escape the Unicode > characters with backslashed numbers? Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error. But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does). --Guido van Rossum (home page: http://www.python.org/~guido/) From uogbuji@fourthought.com Tue May 2 19:33:18 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 12:33:18 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from "Fredrik Lundh" of "Tue, 02 May 2000 16:19:12 +0200." <004201bfb441$670f67c0$34aab5d4@hagrid> Message-ID: <200005021833.MAA02854@localhost.localdomain> > reading the XML namespace specification makes my brain > hurt, so I thought I'd ask here before it explodes... > > given the following XML snippet, what's the correct namespace > for the "attribute" attribute? > > > > > > > I'm not smart enough to figure that out from the specification, > my intuition says "no namespace", and so does James Clark's > namespace note (http://www.jclark.com/xml/xmlns.htm) where [snip] > class Parser(xmllib.XMLParser): > def unknown_starttag(self, tag, attr): > print "S", repr(tag), attr > def unknown_endtag(self, tag): > print "E", repr(tag) > > p = Parser() > p.feed(""" > > > > > """) > p.close() > > gives the following output: > > S 'namespace: body' {} > S 'namespace: member' {'namespace: attribute': 'value'} > E 'namespace: member' > E 'namespace: body' > > instead of > > S 'namespace: body' {} > S 'namespace: member' {'attribute': 'value'} > E 'namespace: member' > E 'namespace: body' > > can anyone sort this out for me? Easy enough. Your instinct (and Mr. Clark) are right and xmllib is wrong. > (and no, I really have to be able to use xmllib, to make sure > soaplib.py works under an off-the-shelf Python distribution...) So the next part of my comment is of no use to you but I'll make it anyway: 4DOM does get it right: [uogbuji@borgia uogbuji]$ python Python 1.5.2 (#1, Mar 21 2000, 18:17:19) [GCC 2.95.3 19991030 (prerelease)] on linux-i386 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> from Ft.Dom.Ext.Reader import Sax2 >>> source = """ ... ... ... """ >>> doc = Sax2.FromXml(source) >>> member = doc.documentElement.childNodes[1] >>> member >>> attr = member.attributes[0] >>> attr >>> att.namespaceURI >>> attr.localName 'attribute' >>> attr.nodeName 'attribute' >>> import Ft.Dom.Ext >>> Ft.Dom.Ext.GetAllNs(attr) {'ns': 'namespace:', 'xml': 'http://www.w3.org/XML/1998/namespace'} >>> Hmm. I just noticed that "1 attributes and 1 children". Silliness. I'll sort that out... -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 2 19:41:40 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 12:41:40 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from "Ludvig Svenonius" of "Tue, 02 May 2000 16:52:03 +0200." Message-ID: <200005021841.MAA02895@localhost.localdomain> > I'm pretty sure an unprefixed attribute will default to the same namespace > URI as its host element, so in the snippet: No! A million times "no"! There is _no_ defaulting for attributes. None whatsoever. This is a major XML FAQ and I refer everyone to James Clark's excellent note, which /F already mentioned. http://www.jclark.com/xml/xmlns.htm > > > > > > 'attribute' would have the namespace URI 'namespace:', as its host element, > whereas in: > > > > > > > it would have no namespace URI. I think I read about this somewhere in the > namespace specification at W3C. This also explains why XSL-specific > attributes in XSLT elements needn't be prefixed (they will conveniently > default to the same namespaces as their host elements, i.e. the XSLT > namespace). The same goes for XHTML, I guess. I think xmllib has it right. I > have no explanation for James Clark's note however. The alternative of > forcing the XML author to explicitly prefix every attribute in elements that > belong to a certain namespace just to declare that they belong to the same > namespace seems pretty inconvenient. I'll grant that you pretty much explained the reason for much of the confusion. However, the fact that XSLT uses unprefixed attributes has no bearing on their namespace status. XSLT does this as a convenience: after all, XSLT processors are element-oriented (if that means anything), and will recognize the standard attributes of an XSL instruction without help from namespaces. Since there is no attr namespace defaulting, my guess is that unprefixed instruction attributes is a way to make XSLT a tad less verbose, but no more than that. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Tue May 2 23:01:49 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 02 May 2000 16:01:49 -0600 Subject: [XML-SIG] namespace headache In-Reply-To: Message from Uche Ogbuji of "Tue, 02 May 2000 12:33:18 MDT." <200005021833.MAA02854@localhost.localdomain> Message-ID: <200005022201.QAA03705@localhost.localdomain> > So the next part of my comment is of no use to you but I'll make it anyway: > 4DOM does get it right: > > [uogbuji@borgia uogbuji]$ python > Python 1.5.2 (#1, Mar 21 2000, 18:17:19) [GCC 2.95.3 19991030 (prerelease)] > on linux-i386 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > >>> from Ft.Dom.Ext.Reader import Sax2 > >>> source = """ > ... > ... > ... """ > >>> doc = Sax2.FromXml(source) > >>> member = doc.documentElement.childNodes[1] > >>> member > children> > >>> attr = member.attributes[0] > >>> attr > > >>> att.namespaceURI > >>> attr.localName > 'attribute' > >>> attr.nodeName > 'attribute' > >>> import Ft.Dom.Ext > >>> Ft.Dom.Ext.GetAllNs(attr) > {'ns': 'namespace:', 'xml': 'http://www.w3.org/XML/1998/namespace'} > >>> What I get for C n P-ing from a terminal screen chunks at a time. Looks like not only did I swipe a typo I'd meant to leave out, but I left out 4DOM's answer to "attr.namespaceURI". For the record, it is '', but of course you don't have to take my word for it. Give it a try. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From mal@lemburg.com Wed May 3 00:11:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 01:11:37 +0200 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> Message-ID: <390F60A9.A3AA53A9@lemburg.com> Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > "encoding"? How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.) The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paul@prescod.net Tue May 2 23:54:41 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 17:54:41 -0500 Subject: [XML-SIG] RAX Message-ID: <390F5CB1.FBE70A92@prescod.net> RAX has been getting many good reviews. I propose it for inclusion in the xml-sig as another "higher level" API. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Wed May 3 03:31:21 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 22:31:21 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200." <390F60A9.A3AA53A9@lemburg.com> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us> > Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > > "encoding"? [MAL] > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) > > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. No, the backslash should mean itself when encoding from ASCII to Unicode. --Guido van Rossum (home page: http://www.python.org/~guido/) From tim_one@email.msn.com Wed May 3 06:19:28 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 01:19:28 -0400 Subject: [XML-SIG] RE: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Message-ID: <000401bfb4bf$27ec1600$622d153f@tim> [Fredrik Lundh] > ... > (if you like, I can post more "fun with unicode" messages ;-) By all means! Exposing a gotcha to ridicule does more good than a dozen abstract arguments. But next time stoop to explaining what it is that's surprising . From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid> M.-A. Lemburg wrote: > Guido van Rossum wrote: > >=20 > > > > So what do you think of my new proposal of using ASCII as the = default > > > > "encoding"? >=20 > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) >=20 > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. umm. if you disallow latin-1 characters, how can you call this one loss-less? looks like political correctness taken to an entirely new level... From ht@cogsci.ed.ac.uk Wed May 3 10:59:28 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 03 May 2000 10:59:28 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > Paul, we're both just saying the same thing over and over without > convincing each other. I'll wait till someone who wasn't in this > debate before chimes in. OK, I've never contributed to this discussion, but I have a long history of shipping widely used Python/Tkinter/XML tools (see my homepage). I care _very_ much that heretofore I have been unable to support full XML because of the lack of Unicode support in Python. I've already started playing with 1.6a2 for this reason. I notice one apparent mis-communication between the various contributors: Treating narrow-strings as consisting of UNICODE code points <= 255 is not necessarily the same thing as making Latin-1 the default encoding. I don't think on Paul and Fredrik's account encoding are relevant to narrow-strings at all. I'd rather go right away to the coherent position of byte-arrays, narrow-strings and wide-strings. Encodings are only relevant to conversion between byte-arrays and strings. Decoding a byte-array with a UTF-8 encoding into a narrow string might cause overflow/truncation, just as decoding a byte-array with a UTF-8 encoding into a wide-string might. The fact that decoding a byte-array with a Latin-1 encoding into a narrow-string is a memcopy is just a side-effect of the courtesy of the UNICODE designers wrt the code points between 128 and 255. This is effectively the way our C-based XML toolset (which we embed in Python) works today -- we build an 8-bit version which uses char* strings, and a 16-bit version which uses unsigned short* strings, and convert from/to byte-streams in any supported encoding at the margins. I'd like to keep byte-arrays at the margins in Python as well, for all the reasons advanced by Paul and Fredrik. I think treating existing strings as a sort of pun between narrow-strings and byte-arrays is a recipe for ongoing confusion. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From guido@python.org Wed May 3 13:16:56 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:16:56 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "03 May 2000 10:59:28 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us> [Henry S. Thompson] > OK, I've never contributed to this discussion, but I have a long > history of shipping widely used Python/Tkinter/XML tools (see my > homepage). I care _very_ much that heretofore I have been unable to > support full XML because of the lack of Unicode support in Python. > I've already started playing with 1.6a2 for this reason. Thanks for chiming in! > I notice one apparent mis-communication between the various > contributors: > > Treating narrow-strings as consisting of UNICODE code points <= 255 is > not necessarily the same thing as making Latin-1 the default encoding. > I don't think on Paul and Fredrik's account encoding are relevant to > narrow-strings at all. I agree that's what they are trying to tell me. > I'd rather go right away to the coherent position of byte-arrays, > narrow-strings and wide-strings. Encodings are only relevant to > conversion between byte-arrays and strings. Decoding a byte-array > with a UTF-8 encoding into a narrow string might cause > overflow/truncation, just as decoding a byte-array with a UTF-8 > encoding into a wide-string might. The fact that decoding a > byte-array with a Latin-1 encoding into a narrow-string is a memcopy > is just a side-effect of the courtesy of the UNICODE designers wrt the > code points between 128 and 255. > > This is effectively the way our C-based XML toolset (which we embed in > Python) works today -- we build an 8-bit version which uses char* > strings, and a 16-bit version which uses unsigned short* strings, and > convert from/to byte-streams in any supported encoding at the margins. > > I'd like to keep byte-arrays at the margins in Python as well, for all > the reasons advanced by Paul and Fredrik. > > I think treating existing strings as a sort of pun between > narrow-strings and byte-arrays is a recipe for ongoing confusion. Very good analysis. Unfortunately this is where we're stuck, until we have a chance to redesign this kind of thing from scratch. Python 1.5.2 programs use strings for byte arrays probably as much as they use them for character strings. This is because way back in 1990 I when I was designing Python, I wanted to have smallest set of basic types, but I also wanted to be able to manipulate byte arrays somewhat. Influenced by K&R C, I chose to make strings and string I/O 8-bit clean so that you could read a binary "string" from a file, manipulate it, and write it back to a file, regardless of whether it was character or binary data. This model has never been challenged until now. I agree that the Java model (byte arrays and strings) or perhaps your proposed model (byte arrays, narrow and wide strings) looks better. But, although Python has had rudimentary support for byte arrays for a while (the array module, introduced in 1993), the majority of Python code manipulating binary data still uses string objects. My ASCII proposal is a compromise that tries to be fair to both uses for strings. Introducing byte arrays as a more fundamental type has been on the wish list for a long time -- I see no way to introduce this into Python 1.6 without totally botching the release schedule (June 1st is very close already!). I'd like to be able to move on, there are other important things still to be added to 1.6 (Vladimir's malloc patches, Neil's GC, Fredrik's completed sre...). For 1.7 (which should happen later this year) I promise I'll reopen the discussion on byte arrays. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Wed May 3 14:06:27 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 15:06:27 +0200 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid> Message-ID: <39102453.6923B10@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg wrote: > > Guido van Rossum wrote: > > > > > > > > So what do you think of my new proposal of using ASCII as the default > > > > > "encoding"? > > > > How about using unicode-escape or raw-unicode-escape as > > default encoding ? (They would have to be adapted to disallow > > Latin-1 char input, though.) > > > > The advantage would be that they are compatible with ASCII > > while still providing loss-less conversion and since they > > use escape characters, you can even read them using an > > ASCII based editor. > > umm. if you disallow latin-1 characters, how can you call this > one loss-less? [Guido didn't like this one, so its probably moot investing any more time on this...] I meant that the unicode-escape codec should only take ASCII characters as input and disallow non-escaped Latin-1 characters. Anyway, I'm out of this discussion... I'll wait a week or so until things have been sorted out. Have fun, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From dick.wall@bigfoot.com Wed May 3 14:21:44 2000 From: dick.wall@bigfoot.com (Richard Wall) Date: Wed, 03 May 2000 09:21:44 -0400 Subject: [XML-SIG] Guidance sought Message-ID: <391027E8.BCC39E4F@bigfoot.com> Hello all, I know that this is partially my own doing, but I have not had cause to really get into XML in python up until now. Suddenly I find myself with a need to do it, and everything I know about python tells me it ought to be the natural choice with XML. However I am finding the information about the 5 or 6 different approached to XML confusing in finding a place to start. Probably the easiest thing is to describe what I am trying to do, and maybe you guys can point me in the right direction of what to learn. We have an XML java interface that we use at my company to publish and subscribe data from one of our systems. The output is created directly from the contents of Java objects, and represents exactly the structure of the object model in Java. All objects in this data are wholly contained within parent objects, that is to say that there are no references to other objects to deal with. The basic objects are things like Case and Account, and these have Attributes like Name and value and the like. An example output from a simple class might look something like this: - Allowance For Funds Used During Construction Account 1.1 - _AccountCategory AFUDC value - value ImpactDDItem 1.1 - Inputs _AccountType 1 Label Allowance For Funds Used During Construction I have been able to get the whole thing loaded in to python very easily using the DOM xml stuff (it was very easy actually) but I have a feeling that writing my own document handler would be a better way to do this. there are a high number of different Classes in the system and I want to make classes responsible for their own xml importing and exporting, having them recognize and fish out attributes for themselves, and create child objects as necessary to handle embedded objects in the XML. Is this the right approach, and if so are there any tutorials covering this. The majority of stuff I have seen so far tends to deal with batch processing of XML with python, and what I reall want to do in this case is to import this XML document directly into an equivalent python object model to the java one from which it came. I understand that I will have to convert Java types to python, but that should be pretty easy (java.util.Hashtables to dictionaries, java.lang.String to string). In fact I am thinking that for the hashtable in this case, the __dict__ for the class can be set directly (so that the attribute key/value pairs simply become python attributes). Any pointers would be greatly appreciated, even if it is of the form of "It's right here in this tutorial, moron!". Thanks Dick -- dick.wall@bigfoot.com - Home dwall@newenergyassoc.com - Work QuaintRcky - AIM From ken@bitsko.slc.ut.us Wed May 3 15:58:16 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 03 May 2000 09:58:16 -0500 Subject: [XML-SIG] RAX In-Reply-To: Paul Prescod's message of "Tue, 02 May 2000 17:54:41 -0500" References: <390F5CB1.FBE70A92@prescod.net> Message-ID: Paul Prescod writes: > RAX has been getting many good reviews. I propose it for inclusion > in the xml-sig as another "higher level" API. One of the main features of RAX is that it implements a "pull" style event interface rather than the "push" style interface that SAX implements currently. Since that's good for a reason, it may also be good if there were a version of SAX that _was_ pull-style so that it could be used in applications like RAX. In this way, one could stack SAX modules and filters in a pull-style chain as an alternative to a push-style chain. -- Ken From larsga@garshol.priv.no Wed May 3 18:15:28 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 03 May 2000 19:15:28 +0200 Subject: [XML-SIG] RAX In-Reply-To: References: <390F5CB1.FBE70A92@prescod.net> Message-ID: * Ken MacLeod | | Since that's good for a reason, it may also be good if there were a | version of SAX that _was_ pull-style so that it could be used in | applications like RAX. In this way, one could stack SAX modules and | filters in a pull-style chain as an alternative to a push-style chain. I agree with this and have been thinking about this for a while, but I'm not sure how we would actually implement this. The only XML parser we have that supports a pull-style interface is RXP, and I'm not sure if we can convert the other interfaces to pull-style interfaces in a sensible way (at least not on a level as low as SAX) without storing the entire sequence of events. Good ideas are welcome... --Lars M. From ken@bitsko.slc.ut.us Wed May 3 19:19:26 2000 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 03 May 2000 13:19:26 -0500 Subject: [XML-SIG] RAX In-Reply-To: Lars Marius Garshol's message of "03 May 2000 19:15:28 +0200" References: <390F5CB1.FBE70A92@prescod.net> Message-ID: Lars Marius Garshol writes: > * Ken MacLeod > | > | Since that's good for a reason, it may also be good if there were a > | version of SAX that _was_ pull-style so that it could be used in > | applications like RAX. In this way, one could stack SAX modules and > | filters in a pull-style chain as an alternative to a push-style chain. > > I agree with this and have been thinking about this for a while, but > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. > > Good ideas are welcome... I don't think existing push-style parsers need to be converted, or implied that they could be used in a pull-style chain. I was thinking more of creating the interface definition of a pull-style SAX parser and allowing for new parsers to be developed rather than a wholesale conversion of push-style parsers. RXP and PYX are both good candidates for pull-style parsing. I think an EasySAX-like approach would work best, where next_event() returns a DOM/mini-DOM node: node, is_end = pull_parser.next() while node != None: if node.nodeType == ELEMENT: if is_end: """ do end element processing """ else: """ do start element processing """ elif node.nodeType == TEXT: """ do text processing """ node, is_end = pull_parser.next() Most of the rest of the SAX interface (sources, creating parsers, exceptions, locators) could probably be used without change. If two threads are used, any push-style parser can be used to queue events to be read by a pull-style adapter in the other thread. -- Ken From jday@csihq.com Wed May 3 19:11:15 2000 From: jday@csihq.com (John Day) Date: Wed, 03 May 2000 14:11:15 -0400 Subject: [XML-SIG] RAX In-Reply-To: References: <390F5CB1.FBE70A92@prescod.net> Message-ID: <3.0.6.32.20000503141115.0091ec00@mail.csihq.com> What is RAX? I did a search for "rax" at the Cover XML site and came up with 0 hits. Evidently some SAX-like API for XML? Could someone provide some links please. Tnx, John Day At 07:15 PM 5/3/00 +0200, Lars Marius Garshol wrote: > >* Ken MacLeod >| >| Since that's good for a reason, it may also be good if there were a >| version of SAX that _was_ pull-style so that it could be used in >| applications like RAX. In this way, one could stack SAX modules and >| filters in a pull-style chain as an alternative to a push-style chain. > >I agree with this and have been thinking about this for a while, but >I'm not sure how we would actually implement this. The only XML parser >we have that supports a pull-style interface is RXP, and I'm not sure >if we can convert the other interfaces to pull-style interfaces in a >sensible way (at least not on a level as low as SAX) without storing >the entire sequence of events. > >Good ideas are welcome... > >--Lars M. > > >_______________________________________________ >XML-SIG maillist - XML-SIG@python.org >http://www.python.org/mailman/listinfo/xml-sig > From robin@alldunn.com Wed May 3 19:37:49 2000 From: robin@alldunn.com (Robin Dunn) Date: Wed, 3 May 2000 11:37:49 -0700 Subject: [XML-SIG] RAX References: <390F5CB1.FBE70A92@prescod.net> <3.0.6.32.20000503141115.0091ec00@mail.csihq.com> Message-ID: <016501bfb52e$b0265b10$3225d2d1@ARES> > What is RAX? I did a search for "rax" at the Cover XML site > and came up with 0 hits. Evidently some SAX-like API for XML? > Could someone provide some links please. > http://xml.com/pub/2000/04/26/rax/index.html -- Robin Dunn Software Craftsman robin@AllDunn.com http://wxpython.org Java give you jitters? http://wxpros.com Relax with wxPython! From paul@prescod.net Wed May 3 23:23:25 2000 From: paul@prescod.net (Paul Prescod) Date: Wed, 03 May 2000 15:23:25 -0700 Subject: [XML-SIG] RAX References: <390F5CB1.FBE70A92@prescod.net> Message-ID: <3910A6DD.642B1963@prescod.net> Lars Marius Garshol wrote: > > ... > > I agree with this and have been thinking about this for a while, but > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. Sean's pyx does that. Threads are another solution, but not a very efficient one. I think that the more performant solution for converting push-style parsers into pull-style parsers is Stackless Python. It seems to be the solution to a lot of problems. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From Fredrik Lundh" Message-ID: <011001bfb555$bdc64b00$34aab5d4@hagrid> Lars Marius Garshol wrote: > I'm not sure how we would actually implement this. The only XML parser > we have that supports a pull-style interface is RXP, and I'm not sure > if we can convert the other interfaces to pull-style interfaces in a > sensible way (at least not on a level as low as SAX) without storing > the entire sequence of events. assuming that a pull-style parser is what I think it is, here's how to convert any incremental parser (xmllib, sgmlop, expat, etc) to a pull-style parser: import xmllib START, DATA, END =3D "start", "data", "end" class XMLPuller(xmllib.XMLParser): def __init__(self, stream): xmllib.XMLParser.__init__(self) self.__stream =3D stream self.__tokens =3D [] def get(self): while not self.__tokens: data =3D self.__stream.read(10000) if not data: self.close() break self.feed(data) if self.__tokens: return self.__tokens.pop(0) return None # end of stream def unknown_starttag(self, tag, attr): self.__tokens.append(START, tag, attr) def handle_data(self, data): self.__tokens.append(DATA, data) def unknown_endtag(self, tag): self.__tokens.append(END, tag) puller =3D XMLPuller(open("myfile.xml")) while 1: next =3D puller.get() if not next: break print next From ht@cogsci.ed.ac.uk Thu May 4 09:51:39 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 04 May 2000 09:51:39 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > My ASCII proposal is a compromise that tries to be fair to both uses > for strings. Introducing byte arrays as a more fundamental type has > been on the wish list for a long time -- I see no way to introduce > this into Python 1.6 without totally botching the release schedule > (June 1st is very close already!). I'd like to be able to move on, > there are other important things still to be added to 1.6 (Vladimir's > malloc patches, Neil's GC, Fredrik's completed sre...). > > For 1.7 (which should happen later this year) I promise I'll reopen > the discussion on byte arrays. I think I hear a moderate consensus developing that the 'ASCII proposal' is a reasonable compromise given the time constraints. But let's not fail to come back to this ASAP -- it _really_ narcs me that every time I load XML into my Python-based editor I'm going to convert large amounts of wide-string data into UTF-8 just so Tk can convert it back to wide-strings in order to display it! ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From guido@python.org Thu May 4 13:40:35 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 04 May 2000 08:40:35 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "04 May 2000 09:51:39 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us> > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. But > let's not fail to come back to this ASAP -- it _really_ narcs me that > every time I load XML into my Python-based editor I'm going to convert > large amounts of wide-string data into UTF-8 just so Tk can convert it > back to wide-strings in order to display it! Thanks -- but that's really Tcl's fault, since the only way to get character data *into* Tcl (or out of it) is through the UTF-8 encoding. And is your XML really stored on disk in its 16-bit format? --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Thu May 4 14:21:25 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 4 May 2000 15:21:25 +0200 Subject: [XML-SIG] Re: Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Guido van Rossum wrote: > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new object or modify an existing object to hold a copy of the Unicode string given by unicode and numChars. (Tcl_UniChar* is currently the same thing as Py_UNICODE*) From m.favas@per.dem.csiro.au Thu May 4 20:11:58 2000 From: m.favas@per.dem.csiro.au (Mark Favas) Date: Fri, 05 May 2000 03:11:58 +0800 Subject: [XML-SIG] PyXML 0.5.4 installation glitches Message-ID: <3911CB7E.72CA7564@per.dem.csiro.au> Platform: DEC Alpha, Tru64 Unix V4.0F, Compaq C V6.1-110, Python 1.6a2 (#91, May 5 2000, 01:57:36) (from CVS) Running "python setup.py build" produces the following error: building 'xml.parsers.pyexpat' extension cc -c -Iextensions/expat/xmltok -Iextensions/expat/xmlparse -I/usr/local/include/python1.6 -O -Olimit 1500 extensions/pyexpat.c -o build/temp.osf1V-alpha/extensions/pyexpat.o cc: Error: extensions/pyexpat.c, line 82: The static declaration of "handler_info" is a tentative definition and specifies an incomplete type. (incompstat) static struct HandlerInfo handler_info[]; --------------------------^ error: command 'cc' failed with exit status 1 Changing the indicated line to static struct HandlerInfo handler_info[64]; allows the compilation to proceed with the following warnings: cc: Warning: extensions/pyexpat.c, line 821: In the initializer for handler_info [0].handler, the referenced type of the pointer value "my_StartElementHandler" i s "function (pointer to void, pointer to const char, pointer to pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_StartElementHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 824: In the initializer for handler_info [1].handler, the referenced type of the pointer value "my_EndElementHandler" is "function (pointer to void, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_EndElementHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 827: In the initializer for handler_info [2].handler, the referenced type of the pointer value "my_ProcessingInstructionH andler" is "function (pointer to void, pointer to const char, pointer to const c har) returning void", which is not compatible with "void". (ptrmismatch) my_ProcessingInstructionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 830: In the initializer for handler_info [3].handler, the referenced type of the pointer value "my_CharacterDataHandler" is "function (pointer to void, pointer to const char, int) returning void", whic h is not compatible with "void". (ptrmismatch) my_CharacterDataHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 833: In the initializer for handler_info [4].handler, the referenced type of the pointer value "my_UnparsedEntityDeclHand ler" is "function (pointer to void, pointer to const char, pointer to const char , pointer to const char, pointer to const char, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_UnparsedEntityDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 836: In the initializer for handler_info [5].handler, the referenced type of the pointer value "my_NotationDeclHandler" i s "function (pointer to void, pointer to const char, pointer to const char, poin ter to const char, pointer to const char) returning void", which is not compatib le with "void". (ptrmismatch) my_NotationDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 839: In the initializer for handler_info [6].handler, the referenced type of the pointer value "my_StartNamespaceDeclHand ler" is "function (pointer to void, pointer to const char, pointer to const char ) returning void", which is not compatible with "void". (ptrmismatch) my_StartNamespaceDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 842: In the initializer for handler_info [7].handler, the referenced type of the pointer value "my_EndNamespaceDeclHandle r" is "function (pointer to void, pointer to const char) returning void", which is not compatible with "void". (ptrmismatch) my_EndNamespaceDeclHandler }, --------^ cc: Warning: extensions/pyexpat.c, line 845: In the initializer for handler_info [8].handler, the referenced type of the pointer value "my_CommentHandler" is "fu nction (pointer to void, pointer to const char) returning void", which is not co mpatible with "void". (ptrmismatch) my_CommentHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 848: In the initializer for handler_info [9].handler, the referenced type of the pointer value "my_StartCdataSectionHandl er" is "function (pointer to void) returning void", which is not compatible with "void". (ptrmismatch) my_StartCdataSectionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 851: In the initializer for handler_info [10].handler, the referenced type of the pointer value "my_EndCdataSectionHandle r" is "function (pointer to void) returning void", which is not compatible with "void". (ptrmismatch) my_EndCdataSectionHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 854: In the initializer for handler_info [11].handler, the referenced type of the pointer value "my_DefaultHandler" is "f unction (pointer to void, pointer to const char, int) returning void", which is not compatible with "void". (ptrmismatch) my_DefaultHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 857: In the initializer for handler_info [12].handler, the referenced type of the pointer value "my_DefaultHandlerExpandH andler" is "function (pointer to void, pointer to const char, int) returning voi d", which is not compatible with "void". (ptrmismatch) my_DefaultHandlerExpandHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 860: In the initializer for handler_info [13].handler, the referenced type of the pointer value "my_NotStandaloneHandler" is "function (pointer to void) returning int", which is not compatible with "vo id". (ptrmismatch) my_NotStandaloneHandler}, --------^ cc: Warning: extensions/pyexpat.c, line 863: In the initializer for handler_info [14].handler, the referenced type of the pointer value "my_ExternalEntityRefHand ler" is "function (pointer to void, pointer to const char, pointer to const char , pointer to const char, pointer to const char) returning int", which is not com patible with "void". (ptrmismatch) my_ExternalEntityRefHandler }, --------^ The link step also appears to have a wildcard quoting problem. The ld command used is: ld -shared -expect_unresolved "*" build/temp.osf1V-alpha/extensions/pyexpat.o build/temp.osf1V-alpha/extensions/expat/xmltok/xmltok.o build/temp.osf1V-alpha/extensions/expat/xmltok/xmlrole.o build/temp.osf1V-alpha/extensions/expat/xmlwf/xmlfile.o build/temp.osf1V-alpha/extensions/expat/xmlwf/xmlwf.o build/temp.osf1V-alpha/extensions/expat/xmlwf/codepage.o build/temp.osf1V-alpha/extensions/expat/xmlparse/xmlparse.o build/temp.osf1V-alpha/extensions/expat/xmlparse/hashtable.o build/temp.osf1V-alpha/extensions/expat/xmlwf/unixfilemap.o -o build/lib.osf1V-alpha/xml/parsers/pyexpat.so which works correctly if put into a /bin/sh script produces pyexpat.so without warnings of unresolved externals (the -expect_unresolved "*" pattern matches all). However, when run by Python via the "python setup.py build" command, ld complains about all the unresolved externals: ld: Warning: Unresolved: fread strlen strncpy strcmp free malloc PyType_Type PyObject_GetAttrString _Py_NoneStruct PyObject_Init as if the pattern that ld is trying to match is literally "*" instead of * -- Email - m.favas@per.dem.csiro.au Mark C Favas Phone - +61 8 9333 6268, 0418 926 074 CSIRO Exploration & Mining Fax - +61 8 9383 9891 Private Bag Post Office Wembley GPS - 31.97 S, 115.81 E Western Australia 6014 From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid> Henry S. Thompson wrote: > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. agreed. (but even if we settle for "7-bit unicode" in 1.6, there are still a few issues left to sort out before 1.6 final. but it might be best to get back to that after we've added SRE and GC to 1.6a3. we might all need a short break...) > But let's not fail to come back to this ASAP first week in june, promise ;-) From ht@cogsci.ed.ac.uk Fri May 5 09:19:07 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:19:07 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > > I think I hear a moderate consensus developing that the 'ASCII > > proposal' is a reasonable compromise given the time constraints. But > > let's not fail to come back to this ASAP -- it _really_ narcs me that > > every time I load XML into my Python-based editor I'm going to convert > > large amounts of wide-string data into UTF-8 just so Tk can convert it > > back to wide-strings in order to display it! > > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. > > And is your XML really stored on disk in its 16-bit format? No, I have no idea what encoding it's in, my XML parser supports over a dozen encodings, and quite sensibly always delivers the content, as per the XML REC, as wide-strings. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From ht@cogsci.ed.ac.uk Fri May 5 09:21:41 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:21:41 +0100 Subject: [XML-SIG] Re: Unicode debate In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Message-ID: "Fredrik Lundh" writes: > Guido van Rossum wrote: > > Thanks -- but that's really Tcl's fault, since the only way to get > > character data *into* Tcl (or out of it) is through the UTF-8 > > encoding. > > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm > > Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) > > Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new > object or modify an existing object to hold a copy of the > Unicode string given by unicode and numChars. > > (Tcl_UniChar* is currently the same thing as Py_UNICODE*) > Any way this can be exploited in Tkinter? ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From fredrik@pythonware.com Fri May 5 10:08:41 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 5 May 2000 11:08:41 +0200 Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us><200005041240.IAA08277@eric.cnri.reston.va.us><00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Message-ID: <010401bfb671$82bc6e50$0500a8c0@secret.pythonware.com> Henry S. Thompson wrote: > > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm > >=20 > > Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) > > Any way this can be exploited in Tkinter? fixes for this was checked into CVS last night, so it'll be the next alpha. From guido@python.org Fri May 5 16:07:48 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 05 May 2000 11:07:48 -0400 Subject: [XML-SIG] Moving Unicode debate to i18n-sig@python.org Message-ID: <200005051507.LAA14262@eric.cnri.reston.va.us> I've moved all my responses to the Unicode debate to the i18n-sig mailing list, where it belongs. Please don't cross-post any more. If you're interested in this issue but aren't subscribed to the i18n-sig list, please subscribe at http://www.python.org/mailman/listinfo/i18n-sig/. To view the archives, go to http://www.python.org/pipermail/i18n-sig/. See you there! --Guido van Rossum (home page: http://www.python.org/~guido/) From pwolff@mgfairfax.rr.com Sun May 7 02:49:28 2000 From: pwolff@mgfairfax.rr.com (Greg Wolff) Date: Sat, 06 May 2000 21:49:28 -0400 Subject: [XML-SIG] how to obtain Byte offset from the Locator... Message-ID: <3914CBA8.B54C5189@mgfairfax.rr.com> I've a question for this list about obtaining location information during an event call back to the document handler. I'm writing my first Python xml script and having a good time with it. (This C++ dude thinks Python is great...) But, I can't see how to obtain the byte offset from the locator. In expat's C/C++ interface there is a routine long XMLPARSEAPI XML_GetCurrentByteIndex(XML_Parser parser); that allows me to acquire the current byte offset during an event call back. In the Python interface in SAX there is no equivalent routine. Of course, the java documentation of SAX does not document any way to get the byte offset either. My question is: Is there any way to acquire the byte offset of the current Line and Column that the Locator is pointing to during an event call back? I need the information for search indices that I'm building and would rather build the code in Python than C++. Thanks for the help! Greg Wolff pwolff@cox.rr.com From mwh21@cam.ac.uk Sun May 7 11:58:11 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 07 May 2000 11:58:11 +0100 Subject: [XML-SIG] whither www.w3.org? Message-ID: Vaguely on topic ... I was just starting to learn about things XML-ish when www.w3.org fell of the 'net, which makes reading specifications a bit difficult. Yesterday I got "connection refused", today i get "host not found". Does anybody know (a) what's going on (b) if there is a web mirror anywhere? TIA, Michael From larsga@garshol.priv.no Sun May 7 12:41:59 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 07 May 2000 13:41:59 +0200 Subject: [XML-SIG] how to obtain Byte offset from the Locator... In-Reply-To: <3914CBA8.B54C5189@mgfairfax.rr.com> References: <3914CBA8.B54C5189@mgfairfax.rr.com> Message-ID: * Greg Wolff | | I've a question for this list about obtaining location information | during an event call back to the document handler. I'm writing my first | Python xml script and having a good time with it. (This C++ dude thinks | Python is great...) But, I can't see how to obtain the byte offset from | the locator. There is no way to do that with the Locator. I plan to add SAX 2.0 properties for the byte offset to the expat and xmlproc drivers, since both support this functionality, but at the moment there is no standard way to do this. For speed of access the value of the property should probably be a function (really a method tied to an object). BTW, I've been wondering what namespace to use for this. Should we define common properties/features in the http://www.python.org/ namespace, or should I use my own garshol.priv.no? | I need the information for search indices that I'm building and would | rather build the code in Python than C++. If you _know_ that you are using the expat driver you can look at the drv_pyexpat.py code and see how to find a reference to the expat Parser object and try to get the information from there. Not really the recommended way to do it, but it should work. --Lars M. From larsga@garshol.priv.no Sun May 7 12:43:04 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 07 May 2000 13:43:04 +0200 Subject: [XML-SIG] whither www.w3.org? In-Reply-To: References: Message-ID: * Michael Hudson | | Vaguely on topic ... I was just starting to learn about things XML-ish | when www.w3.org fell of the 'net, which makes reading specifications a | bit difficult. Yesterday I got "connection refused", today i get "host | not found". This may be a local problem. FWIW I can access w3.org with no problems from Norway right now. --Lars M. From tpassin@home.com Sun May 7 15:48:22 2000 From: tpassin@home.com (tpassin@home.com) Date: Sun, 7 May 2000 10:48:22 -0400 Subject: [XML-SIG] whither www.w3.org? Message-ID: <004d01bfb833$4b16ede0$7cac1218@reston1.va.home.com> Michael Hudson asked > Vaguely on topic ... I was just starting to learn about things XML-ish > when www.w3.org fell of the 'net, which makes reading specifications a > bit difficult. Yesterday I got "connection refused", today i get "host > not found". > > Does anybody know (a) what's going on (b) if there is a web mirror > anywhere? > I connected fine today, Sunday 7 May. Maybe it was a transient result of the I-LOVE-U virus. I stopped getting postings from xml-dev the same day it hit, and I still am not getting them. Tom Passin From ht@cogsci.ed.ac.uk Mon May 8 10:03:39 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 08 May 2000 10:03:39 +0100 Subject: [XML-SIG] whither www.w3.org? In-Reply-To: Michael Hudson's message of "07 May 2000 11:58:11 +0100" References: Message-ID: Michael Hudson writes: > Vaguely on topic ... I was just starting to learn about things XML-ish > when www.w3.org fell of the 'net, which makes reading specifications a > bit difficult. Yesterday I got "connection refused", today i get "host > not found". > > Does anybody know (a) what's going on (b) if there is a web mirror > anywhere? It's the Rutherford mirror that's fallen on its face, not www.w3.org. Maybe when those guys in Oxfordshire come back from their weekend things will get better. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From mwh21@cam.ac.uk Mon May 8 20:22:20 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 08 May 2000 20:22:20 +0100 Subject: [XML-SIG] How to get 4DOM to output empty Message-ID: I'm currently using 4DOM to generate XHTML (in a very crufty way that I will probably ask for more help on soon), and I'm finding that 4DOM produces stuff like

which I don't *think* is valid XHTML; certainly validator.w3.org doesn't like it. Currently I produce the HTML by doing: p = PrettyPrintVisitor(" ",80,["img"]) open("books.html","w").write(p.visit(newdoc)) Is this normal/sane? I await your wisdom... Cheers, Michael From akuchlin@mems-exchange.org Mon May 8 23:19:33 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 8 May 2000 18:19:33 -0400 (EDT) Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: References: Message-ID: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Michael Hudson writes: >

>which I don't *think* is valid XHTML; certainly validator.w3.org >doesn't like it. Then the validator is broken; the XML 1.0 spec says "If an element is empty, it must be represented either by a start-tag immediately followed by an end-tag or by an empty-element tag." (Unless XHTML specifies that only the empty-element tag is legal. In which the XHTML spec is what's broken.) Can't say off-hand if there's a way to make 4DOM produce empty-element tags; don't have the source code here at work... --amk From mwh21@cam.ac.uk Mon May 8 23:42:16 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 08 May 2000 23:42:16 +0100 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: "Andrew M. Kuchling"'s message of "Mon, 8 May 2000 18:19:33 -0400 (EDT)" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Message-ID: "Andrew M. Kuchling" writes: > Michael Hudson writes: > >

> >which I don't *think* is valid XHTML; certainly validator.w3.org > >doesn't like it. > > Then the validator is broken; the XML 1.0 spec says "If an element is > empty, it must be represented either by a start-tag immediately > followed by an end-tag or by an empty-element tag." (Unless XHTML > specifies that only the empty-element tag is legal. In which the > XHTML spec is what's broken.) That's what I thought. And in fact the XHTML recommendation says: Empty elements must either have an end tag or the start tag must end with />. But it also says (in the "informative" appendix C): Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives uncertain results in many existing user agents. ... > Can't say off-hand if there's a way to make 4DOM produce empty-element > tags; don't have the source code here at work... ... so I'd still like to know the answer to this question. Plus the empty-element style just looks better to my eyes. Cheers, Michael -- 6. Symmetry is a complexity-reducing concept (co-routines include subroutines); seek it everywhere. -- Alan Perlis, http://www.cs.yale.edu/~perlis-alan/quotes.html From Norman Walsh Mon May 8 23:52:20 2000 From: Norman Walsh (Norman Walsh) Date: 08 May 2000 18:52:20 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Michael Hudson's message of "08 May 2000 20:22:20 +0100" References: Message-ID: <873dnspsd7.fsf@eris.nwalsh.com> / Michael Hudson was heard to say: | I'm currently using 4DOM to generate XHTML (in a very crufty way that | I will probably ask for more help on soon), and I'm finding that 4DOM | produces stuff like | |

Perfectly legit. In XML, there is no distinction between and . Be seeing you, norm -- Norman Walsh | Science is a way of talking about the http://nwalsh.com/ | universe in words that bind it to a | common reality. Magic is a method of | talking to the universe in words that | it cannot ignore. The two are rarely | compatible.--Neil Gaiman From Norman Walsh Mon May 8 23:54:19 2000 From: Norman Walsh (Norman Walsh) Date: 08 May 2000 18:54:19 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Michael Hudson's message of "08 May 2000 23:42:16 +0100" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> Message-ID: <87ya5kodpg.fsf@eris.nwalsh.com> / Michael Hudson was heard to say: | But it also says (in the "informative" appendix C): | | Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives | uncertain results in many existing user agents. Broken (or not properly XML-aware) user agents. Be seeing you, norm -- Norman Walsh | Blessed is he who expects nothing, for http://nwalsh.com/ | he shall never be disappointed.--Pope From mwh21@cam.ac.uk Tue May 9 00:06:33 2000 From: mwh21@cam.ac.uk (Michael Hudson) Date: 09 May 2000 00:06:33 +0100 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Norman Walsh's message of "08 May 2000 18:54:19 -0400" References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> <87ya5kodpg.fsf@eris.nwalsh.com> Message-ID: Norman Walsh writes: > / Michael Hudson was heard to say: > | But it also says (in the "informative" appendix C): > | > | Also, use the minimized tag syntax for empty elements, e.g.
| />, as the alternative syntax

allowed by XML gives > | uncertain results in many existing user agents. > > Broken (or not properly XML-aware) user agents. Or just old. M. -- incidentally, asking why things are "left out of the language" is a good sign that the asker is fairly clueless. -- Erik Naggum, comp.lang.lisp From hannu@tm.ee Mon May 8 23:08:01 2000 From: hannu@tm.ee (Hannu Krosing) Date: Tue, 09 May 2000 01:08:01 +0300 Subject: [XML-SIG] How to get 4DOM to output empty References: <873dnspsd7.fsf@eris.nwalsh.com> Message-ID: <39173AC1.E1D03550@tm.ee> Norman Walsh wrote: > > / Michael Hudson was heard to say: > | I'm currently using 4DOM to generate XHTML (in a very crufty way that > | I will probably ask for more help on soon), and I'm finding that 4DOM > | produces stuff like > | > |

> > Perfectly legit. In XML, there is no distinction between > and . He said XHTML not XML, a standard supposed to be bacwards compatible. ------- Hannu From jsydik@virtualparadigm.com Tue May 9 02:42:59 2000 From: jsydik@virtualparadigm.com (Jeremy J. Sydik) Date: Mon, 08 May 2000 20:42:59 -0500 Subject: [XML-SIG] How to get 4DOM to output empty References: <14615.15733.667016.982985@amarok.cnri.reston.va.us> <87ya5kodpg.fsf@eris.nwalsh.com> Message-ID: <39176D23.8FA5B72D@virtualparadigm.com> The entire Reference is: XHTML 1.0: The Extensible HyperText Markup Language A Reformulation of HTML 4 in XML 1.0 W3C Recommendation 26 January 2000 . . Appendix C. HTML Compatibility Guidelines . . C.2 Empty Elements Include a space before the trailing / and > of empty elements, e.g.
,

and

. Also, use the minimized tag syntax for empty elements, e.g.
, as the alternative syntax

allowed by XML gives uncertain results in many existing user agents. As I'm reading this, the point is EXACTLY that we're working with non-aware agents, in particular, those browsers not capable of handling XML (So, really most of the current market last I knew). As far as the original question, output of

IS valid XHMTL, but not compatible with the current HTML browser base for the most part, hence
. That aside, I think this might work as a workaround for you until an answer from the FT crew shows up: Make the Following Changes to Ft/Dom/Ext/PrettyPrintVisitor: Change __init__ to be: def __init__(self, indent, width, plainElements,singleElements=[]): self.__indent = indent self.__depth = 0 self.__width = width self.__plainElements = plainElements self.__singleElements = singleElements self.__printPlain = 0 self.__plainPrinter = PrintVisitor() self.__prevNodeIsText = 0 self.__emptyReturn = 0 self.__namespaces = [{}] In visitElement: Replace: st = string.rstrip(st) + '>' With: if node.tagName in self.__singleElements: st=string.rstrip(st) + ' />' else: st = string.rstrip(st) + '>' Replace: if node.ownerDocument.isXml() or node.hasChildNodes() or node.tagName not in HTML_SINGLE_TAGS: With: if node.ownerDocument.isXml() or node.hasChildNodes() or node.tagName not in HTML_SINGLE_TAGS or node.tagName not in self.__singleElements: At which time your code example would look like: p = PrettyPrintVisitor(" ",80,[""],["IMG"]) open("books.html","w").write(p.visit(newdoc)) Not seeing your full code example, I don't know if this will actually work or not. Have Fun, Jeremy Norman Walsh wrote: > > / Michael Hudson was heard to say: > | But it also says (in the "informative" appendix C): > | > | Also, use the minimized tag syntax for empty elements, e.g.
| />, as the alternative syntax

allowed by XML gives > | uncertain results in many existing user agents. > > Broken (or not properly XML-aware) user agents. > > Be seeing you, > norm > > -- > Norman Walsh | Blessed is he who expects nothing, for > http://nwalsh.com/ | he shall never be disappointed.--Pope > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://www.python.org/mailman/listinfo/xml-sig From mickael.remond@IDEALX.com Tue May 9 08:44:12 2000 From: mickael.remond@IDEALX.com (Mickael Remond) Date: 09 May 2000 09:44:12 +0200 Subject: [XML-SIG] Bug report in DOM: ' instead of " in attribs Message-ID: <7od7mwyxpv.fsf@snake.ird.idealx.com> Hello to all, I think I have found a bug in the DOM source code (pyMXL 0.5.1). This bug prevent me from reading back the XML I have written. I was using the XMLproc Saxdriver. This bug does not seem to be corrected in PyXML 0.5.4 release candidate. The toxml method in the class Element write the attributes with two single quotes instead of using two double quotes as this should be done usually done in XML. Example: doc.toxml writes : and should writes: The diff is the following on dom/core.py: 803c803 < s = s + " %s='" % (attr,) --- > s = s + " %s=\"" % (attr,) 810c810 < s = s + "'" --- > s = s + "\"" Has this bug been identified before ? I hope this bug will be fixed in pyXML 0.5.4 final release. Thank you in advance. -- Micka�l R�mond - mickael.remond@IDEALX.com - http://IDEALX.com From larsga@garshol.priv.no Tue May 9 08:48:34 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 09:48:34 +0200 Subject: [XML-SIG] Bug report in DOM: ' instead of " in attribs In-Reply-To: <7od7mwyxpv.fsf@snake.ird.idealx.com> References: <7od7mwyxpv.fsf@snake.ird.idealx.com> Message-ID: * Mickael Remond | | I think I have found a bug in the DOM source code (pyMXL | 0.5.1). This bug prevent me from reading back the XML I have | written. I was using the XMLproc Saxdriver. This bug does not seem | to be corrected in PyXML 0.5.4 release candidate. | | The toxml method in the class Element write the attributes with two | single quotes instead of using two double quotes as this should be | done usually done in XML. XML allows both single and double quotes, so this should be perfectly OK. Any parser which does not support single quotes is simply broken. Which XML parser does not allow you to read the document back? And can we see the XML that fails? --Lars M. From Fredrik Lundh" Message-ID: <00e001bfb98b$c674cf80$34aab5d4@hagrid> Mickael Remond wrote: > The toxml method in the class Element write the attributes with two = single > quotes instead of using two double quotes as this should be done = usually done > in XML. XML allows you to use either double quotes or single quotes for attribute values (see the AttValue production). > doc.toxml writes : > that's perfectly valid XML. > Has this bug been identified before ? it's not a bug -- at least not where you think it is. since I strongly doubt that xmlproc messes up on this one, maybe the real bug is that the DOM writer doesn't look for quotes in the attribute content? the following is *not* a valid tag: it should be written as: or From Norman Walsh Tue May 9 11:37:07 2000 From: Norman Walsh (Norman Walsh) Date: 09 May 2000 06:37:07 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Hannu Krosing's message of "Tue, 09 May 2000 01:08:01 +0300" References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> Message-ID: <87u2g8nh64.fsf@eris.nwalsh.com> / Hannu Krosing was heard to say: | > Perfectly legit. In XML, there is no distinction between | > and . | | He said XHTML not XML, a standard supposed to be bacwards compatible. I understand that, but it's also supposed to be XML. The most emphatic thing that the XHTML spec could say is that one form or the other is preferred. XHTML has to obey the rules of XML. Be seeing you, norm -- Norman Walsh | If you settle for what they're giving http://nwalsh.com/ | you, you deserve what you get. From hannu@tm.ee Tue May 9 12:46:29 2000 From: hannu@tm.ee (Hannu Krosing) Date: Tue, 09 May 2000 14:46:29 +0300 Subject: [XML-SIG] How to get 4DOM to output empty References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> <87u2g8nh64.fsf@eris.nwalsh.com> Message-ID: <3917FA95.17F41A79@tm.ee> Norman Walsh wrote: > > / Hannu Krosing was heard to say: > | > Perfectly legit. In XML, there is no distinction between > | > and . > | > | He said XHTML not XML, a standard supposed to be bacwards compatible. > > I understand that, but it's also supposed to be XML. The most emphatic > thing that the XHTML spec could say is that one form or the other is > preferred. XHTML has to obey the rules of XML. My understanding was that XHTML was supposed to define a subset of XML that is also HTML (and actually accepted and rendered more-or-less ok). It is always hard to tell what a recommendation in a "standard" means. For example, if you follow just the requirements and not the recommendations when programming java applets, they usually won't work in the same way on different browsers (or don't work at all). ---------------- Hannu From Norman Walsh Tue May 9 14:10:29 2000 From: Norman Walsh (Norman Walsh) Date: 09 May 2000 09:10:29 -0400 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Hannu Krosing's message of "Tue, 09 May 2000 14:46:29 +0300" References: <873dnspsd7.fsf@eris.nwalsh.com> <39173AC1.E1D03550@tm.ee> <87u2g8nh64.fsf@eris.nwalsh.com> <3917FA95.17F41A79@tm.ee> Message-ID: <87bt2fna2i.fsf@eris.nwalsh.com> / Hannu Krosing was heard to say: | Norman Walsh wrote: | > / Hannu Krosing was heard to say: | > | He said XHTML not XML, a standard supposed to be bacwards compatible. | > | > I understand that, but it's also supposed to be XML. The most emphatic | > thing that the XHTML spec could say is that one form or the other is | > preferred. XHTML has to obey the rules of XML. | | My understanding was that XHTML was supposed to define a subset of XML | that is also HTML (and actually accepted and rendered more-or-less ok). | | It is always hard to tell what a recommendation in a "standard" means. Yep. FWIW, if my concern is for presentation rather than compliance with the standard, I usually just add a bogus attribute:
That works just as well as "
" and is often easier to get tools to render. Be seeing you, norm -- Norman Walsh | Our years, our debts, and our enemies http://nwalsh.com/ | are always more numerous than we | imagine.--Charles Nodier From pwolff@mgfairfax.rr.com Tue May 9 19:52:31 2000 From: pwolff@mgfairfax.rr.com (Greg Wolff) Date: Tue, 09 May 2000 14:52:31 -0400 Subject: [XML-SIG] how to obtain Byte offset from the Locator... References: <3914CBA8.B54C5189@mgfairfax.rr.com> Message-ID: <39185E6F.F794E2D3@mgfairfax.rr.com> I have a copy of expat for my C++ code but I have found that I don't have a copy of the driver for expat for the Python code. I have the xmlproc code and it works just fine, but it doesn't have byte offset as far as I can tell. (My first cursory look at the code suggests that it would be better to ask you'all for help rather than try to hack it...) Which file on the xml-sig download page has the Python Expat code in it? I have tried to download a pyexpat file but the link was broken last night when I tried it. If I can get the pyexpat code I'll hack it as Lars M. suggests below (Thanks!). Also, is there any chance of trying to work with the SAX 2.0 Python code? Thanks for the help. /pgw Greg Wolff Lars Marius Garshol wrote: > > * Greg Wolff > | > | But, I can't see how to obtain the byte offset from > | the locator. > > There is no way to do that with the Locator. > > I plan to add SAX 2.0 properties for the byte offset to the expat and > xmlproc drivers, since both support this functionality, but at the > moment there is no standard way to do this. > > ..... > > | I need the information for search indices that I'm building and would > | rather build the code in Python than C++. > > If you _know_ that you are using the expat driver you can look at the > drv_pyexpat.py code and see how to find a reference to the expat > Parser object and try to get the information from there. Not really > the recommended way to do it, but it should work. > > --Lars M. > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://www.python.org/mailman/listinfo/xml-sig From larsga@garshol.priv.no Tue May 9 20:07:13 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 21:07:13 +0200 Subject: [XML-SIG] how to obtain Byte offset from the Locator... In-Reply-To: <39185E6F.F794E2D3@mgfairfax.rr.com> References: <3914CBA8.B54C5189@mgfairfax.rr.com> <39185E6F.F794E2D3@mgfairfax.rr.com> Message-ID: * Greg Wolff | | I have a copy of expat for my C++ code but I have found that I don't | have a copy of the driver for expat for the Python code. If you download either the XML-SIG package or the saxlib 1.0 package you will get it. | I have the xmlproc code and it works just fine, but it doesn't have | byte offset as far as I can tell. Actually, it does. The get_offset method on the Parser interface will give you what you want. | Which file on the xml-sig download page has the Python Expat code in | it? This is the one you want. | Also, is there any chance of trying to work with the SAX 2.0 Python | code? Well, you can download the SAX 2.0 release for Python and try to use it with xmlproc, but I wouldn't really recommend it. Also, you'd have to add the property yourself. I am working on SAX 2.0 right now, and will try to get this feature in as soon as I can. --Lars M. From larsga@garshol.priv.no Tue May 9 20:10:46 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 09 May 2000 21:10:46 +0200 Subject: [XML-SIG] Updated pyexpat and sgmlop for Windows? Message-ID: I currently don't have access to MSVC++ and as my home machine is Win32 (for the time being) I have problems developing SAX 2.0 drivers for pyexpat and sgmlop. I noticed that the versions in the 0.5.4 release are out of date (at least they don't seem to fit with the pyexpat.c source). If anyone could email me the binaries for these or make them available for download somewhere that would make me very happy as I really need this to be able to write the drivers. Thanks! --Lars M. From uogbuji@fourthought.com Wed May 10 03:55:20 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 09 May 2000 20:55:20 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from Michael Hudson of "08 May 2000 23:42:16 BST." Message-ID: <200005100255.UAA04394@localhost.localdomain> > > Can't say off-hand if there's a way to make 4DOM produce empty-element > > tags; don't have the source code here at work... > > ... so I'd still like to know the answer to this question. Plus the > empty-element style just looks better to my eyes. There isn't and there should be, if only for readibility. What do users think? There are two issues here: A) Should the default for printing empty XML elements be the short or long form? Any different for HTML (note that the CVS 4DOM fixes bugs with HTML 4.0 elements forbidden to have an end tag). B) Should the printers accept optional argumuments that are lists of elements to always shorten if empty (or vice-versa if the answer to A is "yes) I say "yes" to both of the above. -- Uche Ogbuji Senior Software Engineer uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-9036, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uogbuji@fourthought.com Wed May 10 04:01:16 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Tue, 09 May 2000 21:01:16 -0600 Subject: [XML-SIG] How to get 4DOM to output empty In-Reply-To: Message from Hannu Krosing of "Tue, 09 May 2000 01:08:01 +0300." <39173AC1.E1D03550@tm.ee> Message-ID: <200005100301.VAA04423@localhost.localdomain> > Norman Walsh wrote: > > > > / Michael Hudson was heard to say: > > | I'm currently using 4DOM to generate XHTML (in a very crufty way that > > | I will probably ask for more help on soon), and I'm finding that 4DOM > > | produces stuff like > > | > > |

------=_NextPart_000_0053_01BFC0C3.8F4235E0-- From mclay@nist.gov Sat May 20 02:24:42 2000 From: mclay@nist.gov (Michael McLay) Date: Fri, 19 May 2000 21:24:42 -0400 (EDT) Subject: [XML-SIG] XML Schema validator? Message-ID: <14629.59739.66134.318367@fermi.eeel.nist.gov> I'm looking for a validator for XML Schema instance files. The XML Schema validator at http://www.ltg.ed.ac.uk/~ht/xsv-status.html is close to what I need, but it has a GPLed copyright on the validator and a non-commercial restriction on the PyXML module that it is dependant on. The ideal mechanism for checking XML files against an XML Schema would be built on top of the standard Python 1.6 XML package and it would have a Python-like copyright. From ht@cogsci.ed.ac.uk Mon May 22 10:31:43 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 22 May 2000 10:31:43 +0100 Subject: [XML-SIG] XML Schema validator? In-Reply-To: Michael McLay's message of "Fri, 19 May 2000 21:24:42 -0400 (EDT)" References: <14629.59739.66134.318367@fermi.eeel.nist.gov> Message-ID: Michael McLay writes: > I'm looking for a validator for XML Schema instance files. > The XML Schema validator at http://www.ltg.ed.ac.uk/~ht/xsv-status.html > is close to what I need, but it has a GPLed copyright on the validator > and a non-commercial restriction on the PyXML module that it is > dependant on. The ideal mechanism for checking XML files against an > XML Schema would be built on top of the standard Python 1.6 XML > package and it would have a Python-like copyright. Our PyXML module will shortly be re-released for Python1.6 with a GPL license, and a new name so as not to conflict with the existing PyXML module. We will happily discuss with you less-restrictive licensing terms for [whatever PyXML becomes], and we would also be happy if someone ported XSV to run on top of some other validating Python XML interface -- the hooks are there (in layer.py). ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From masjober@ra.abo.fi Mon May 22 13:27:55 2000 From: masjober@ra.abo.fi (Mats Sjoberg IB) Date: Mon, 22 May 2000 15:27:55 +0300 Subject: [XML-SIG] The PyXML distribution Message-ID: <200005221227.PAA05511@rafael.ABO.RA> Is there any possibility to install this package for one user only? I do not have root access so I cannot install the package globally. Mats Sj�berg (mats.sjoberg@abo.fi) Turku Centre for Computer Science Finland From larsga@garshol.priv.no Mon May 22 13:48:42 2000 From: larsga@garshol.priv.no (Lars Marius Garshol) Date: 22 May 2000 14:48:42 +0200 Subject: [XML-SIG] The PyXML distribution In-Reply-To: <200005221227.PAA05511@rafael.ABO.RA> References: <200005221227.PAA05511@rafael.ABO.RA> Message-ID: * Mats Sjoberg | | Is there any possibility to install this package for one user only? | I do not have root access so I cannot install the package globally. What platform are you on? And what installer are you using? --Lars M. From mikl@club-internet.fr Wed May 24 12:45:02 2000 From: mikl@club-internet.fr (mikl@club-internet.fr) Date: 24 May 2000 13:45:02 +0200 Subject: [XML-SIG] XML status, historical data and hierarchy Message-ID: <87d7mcuq81.fsf@western.ird.idealx.com> Hi, I have a question on the way I could structure my XML document to solve this problem: I need to manipulate actions. Actions are created as prevision, with a previsionnal duration. If someone is intereted in achieving this action, he can propose to be responsible for this action. He can even propose a different estimated duration. The status is changing from prevision to proposition. If this action is accepted, the status change to currently being processed, and the estimated duration can be renegociated. Then, the personn can finish his duration with one or several achieved actions, each with a definitive duration. This is the simplest case, because the personn in charge of this action can divide it into several other previsionnal actions and ask for a volunteer. The same process can be deeply nested. My problem is that I hardly can figure out how you would model this thing in XML. Thank you in advance for your help. This problem is not typically Python related but, this group is usually very helpful... -- Micka�l From tpassin@home.com Wed May 24 13:05:44 2000 From: tpassin@home.com (tpassin@home.com) Date: Wed, 24 May 2000 08:05:44 -0400 Subject: [XML-SIG] XML status, historical data and hierarchy References: <87d7mcuq81.fsf@western.ird.idealx.com> Message-ID: <000b01bfc578$63c893a0$7cac1218@reston1.va.home.com> asked > > Hi, > > I have a question on the way I could structure my XML document to solve this > problem: > > I need to manipulate actions. Actions are created as prevision, with a > previsionnal duration. > If someone is intereted in achieving this action, he can propose to be > responsible for this action. He can even propose a different estimated > duration. The status is changing from prevision to proposition. > > If this action is accepted, the status change to currently being processed, > and the estimated duration can be renegociated. > > Then, the personn can finish his duration with one or several achieved > actions, each with a definitive duration. > > This is the simplest case, because the personn in charge of this action can > divide it into several other previsionnal actions and ask for a volunteer. The > same process can be deeply nested. > > My problem is that I hardly can figure out how you would model this thing in > XML. > The problem is not in the XML, but how you model this as an abstract data model. Once you know that, you can translate it into XML. From your description, it sounds like the model would be recursive, each project action possibly containing other project actions subject to certain constraints. Get your data model designed, then the XML will probably be apparent. Tom Passin From andy@reportlab.com Wed May 24 13:05:06 2000 From: andy@reportlab.com (Andy Robinson) Date: Wed, 24 May 2000 13:05:06 +0100 Subject: [XML-SIG] XML status, historical data and hierarchy In-Reply-To: <87d7mcuq81.fsf@western.ird.idealx.com> Message-ID: > My problem is that I hardly can figure out how you would model > this thing in > XML. > I'd start with an easier problem, and try to model it with Python objects. Then when you have a model you like, start to think of coding it in XML. The w3c hypes XML for lots of things, but not yet as a RAD tool :-) - Andy Robinson From bjorn@roguewave.com Wed May 24 17:11:30 2000 From: bjorn@roguewave.com (Bjorn Pettersen) Date: Wed, 24 May 2000 10:11:30 -0600 Subject: [XML-SIG] speed question re DOM parsing Message-ID: <392BFF32.5C0AECE4@roguewave.com> I'm just starting to work with XML, so be gentle The problem is that I'm reading in a 280K xml file using the sample code from the XML howto: def getXmlDomDocument(name): p = saxexts.make_parser() dh = SaxBuilder() p.setDocumentHandler(dh) p.parseFile(open(name)) p.close() doc = dh.document xml.dom.utils.strip_whitespace(doc) return doc it takes about five seconds to read and parse the file... Is there a better way to read the file (or is there updated code that is faster)? -- bjorn From gstein@lyra.org Wed May 24 22:01:27 2000 From: gstein@lyra.org (Greg Stein) Date: Wed, 24 May 2000 14:01:27 -0700 (PDT) Subject: [XML-SIG] speed question re DOM parsing In-Reply-To: <392BFF32.5C0AECE4@roguewave.com> Message-ID: On Wed, 24 May 2000, Bjorn Pettersen wrote: > I'm just starting to work with XML, so be gentle > > The problem is that I'm reading in a 280K xml file using the sample code > from the XML howto: > > def getXmlDomDocument(name): > p = saxexts.make_parser() > dh = SaxBuilder() > p.setDocumentHandler(dh) > p.parseFile(open(name)) > p.close() > doc = dh.document > xml.dom.utils.strip_whitespace(doc) > return doc > > it takes about five seconds to read and parse the file... > > Is there a better way to read the file (or is there updated code that is > faster)? If you want a DOM for the output, then no... you'll have to deal with the speed. If you have simple requirements for the Python representation of the XML, then take a look at xml.utils.qp_xml. Cheers, -g -- Greg Stein, http://www.lyra.org/ From bjorn@roguewave.com Thu May 25 01:49:39 2000 From: bjorn@roguewave.com (Bjorn Pettersen) Date: Wed, 24 May 2000 18:49:39 -0600 Subject: [XML-SIG] speed question re DOM parsing References: Message-ID: <392C78A3.C4176635@roguewave.com> Greg Stein wrote: > > On Wed, 24 May 2000, Bjorn Pettersen wrote: > > I'm just starting to work with XML, so be gentle > > > > The problem is that I'm reading in a 280K xml file using the sample code > > from the XML howto: > > > > def getXmlDomDocument(name): > > p = saxexts.make_parser() > > dh = SaxBuilder() > > p.setDocumentHandler(dh) > > p.parseFile(open(name)) > > p.close() > > doc = dh.document > > xml.dom.utils.strip_whitespace(doc) > > return doc > > > > it takes about five seconds to read and parse the file... > > > > Is there a better way to read the file (or is there updated code that is > > faster)? > > If you want a DOM for the output, then no... you'll have to deal with the > speed. If you have simple requirements for the Python representation of > the XML, then take a look at xml.utils.qp_xml. Hey, that works great! (down to ~0.5 seconds, and it doesn't have problems with installer either -- life is good ;-) -- bjorn From uche.ogbuji@fourthought.com Thu May 25 05:37:32 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:37:32 -0600 Subject: [XML-SIG] ANN: 4DOM 0.10.0 Message-ID: <200005250437.WAA02614@localhost.localdomain> From uche.ogbuji@fourthought.com Thu May 25 05:37:53 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:37:53 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 Message-ID: <200005250437.WAA02657@localhost.localdomain> From uche.ogbuji@fourthought.com Thu May 25 05:45:45 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:45:45 -0600 Subject: [XML-SIG] ANN: 4DOM 0.10.0 Message-ID: <200005250445.WAA02711@localhost.localdomain> Fourthought, Inc. (http://Fourthought.com) announces the release of 4DOM 0.10.0 ----------------------- An XML/HTML Python library using the Document Object Model interface 4DOM is a Python library for XML and HTML processing and manipulation using the W3C's Document Object Model for interface. 4DOM implements DOM Core level 2, HTML level 2 and Level 2 Document Traversal. 4DOM should work on all platforms supported by Python. If you have any problems with a particular platform, please e-mail the authors. 4DOM is designed to allow developers rapidly design applications that read, write or manipulate HTML and XML. News ---- - Moved all static variables to class variables - Fixed printing to work with empty elements - Removed all tabs from files - Change package to xml.dom - major change to the internals to use Node as a Python attribute manager this improves efficiency: cutting down on __g/setattrs__ and simplifies some things More info and Obtaining 4DOM ---------------------------- Please see http://Fourthought.com/4Suite/4DOM Or you can download 4DOM from ftp://Fourthought.com/pub/4Suite There are Linux RPMs available at ftp://Fourthought.com/pub/mirrors/python4linux/redhat/ 4DOM is distributed under a license similar to that of Python. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From uche.ogbuji@fourthought.com Thu May 25 05:45:53 2000 From: uche.ogbuji@fourthought.com (uche.ogbuji@fourthought.com) Date: Wed, 24 May 2000 22:45:53 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 Message-ID: <200005250445.WAA02755@localhost.localdomain> Fourthought, Inc. (http://Fourthought.com) announces the release of 4XSLT and 4XPath 0.9.0 ---------------------- A python implementation of the W3C's XSLT language 4XSLT is an XML transformation processor based on the W3C's specification for the XSLT transform language. 4XPath implements the W3C XPath language for indicating and selecting XML document components. http://www.w3.org/TR/xslt 4XPath implements the full 4XPath recommendation except for the 'lang' core function. 4XSLT all of the XSLT 1.0 Recommendation, except for extension elements and fallback. Note: 4XSLT and 4XPath cannot work with JPython. News ---- - Moved some parsing functionality to C for performance increase - Fixed bugs for Windows build - Converted to BisonGen for performance increase - Fix namespace axis - Change package name to xml.xpath / xml.xslt - Implemented node-set and match proprietary ft extensions - Cleaned up extension function code and simplified use of user ext functions - Changed xml output method to use short form for empty elements - Fixed automatic detection of html output method - Fixed xsl:apply-templates to support with-param - Split Processor from output Writer classes (improved coupling/cohesion) and implemented the core writer as a plain text outputter to avoid messing with SAX output unless necessary - Implemented xsl:attribute-set - Implemented xsl:decimal-format - Implemented disable-output-escaping on xsl:text and xsl:value-of - Implemented number-format extension function - Add proper support for qualified names in vars, params, functions, etc. - Fixed bug with xsl:element and namespaces - Fixed performance bugs - Other bug-fixes More info and Obtaining 4XPath and 4XSLT ---------------------------------------- Please see http://Fourthought.com/4Suite/4XPath http://Fourthought.com/4Suite/4XSLT Or you can download 4XSLT from ftp://Fourthought.com/pub/4Suite/ Source files with "-all" in the name include 4DOM and 4XPath. There are Linux RPMs available at ftp://Fourthought.com/pub/mirrors/python4linux/redhat/ And Windows binaries at ftp://Fourthought.com/pub/4Suite/windows 4XPath and 4XSLT are distributed under a license similar to that of Python. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From tpassin@home.com Thu May 25 12:52:57 2000 From: tpassin@home.com (tpassin@home.com) Date: Thu, 25 May 2000 07:52:57 -0400 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 References: <200005250445.WAA02755@localhost.localdomain> Message-ID: <002601bfc63f$c4e14480$7cac1218@reston1.va.home.com> At last! WIndows binaries for these little babies!. Thanks, guys. Looks like they are really located at ftp://fourthought.com/pub/4Suite/binaries/windows/ Tom announced: > Fourthought, Inc. (http://Fourthought.com) announces the release of > > 4XSLT and 4XPath 0.9.0 > ---------------------- > A python implementation > of the W3C's XSLT language > > > 4XSLT is an XML transformation processor based on the W3C's specification > for the XSLT transform language. 4XPath implements the W3C XPath language > for indicating and selecting XML document components. > > > And Windows binaries at > > ftp://Fourthought.com/pub/4Suite/windows > From uogbuji@fourthought.com Thu May 25 19:17:42 2000 From: uogbuji@fourthought.com (Uche Ogbuji) Date: Thu, 25 May 2000 12:17:42 -0600 Subject: [XML-SIG] ANN: 4XPath 0.9.0 and 4XSLT 0.9.0 In-Reply-To: Message from of "Thu, 25 May 2000 07:52:57 EDT." <002601bfc63f$c4e14480$7cac1218@reston1.va.home.com> Message-ID: <200005251817.MAA03789@localhost.localdomain> > At last! WIndows binaries for these little babies!. Thanks, guys. > Looks like they are really located at > > ftp://fourthought.com/pub/4Suite/binaries/windows/ Oops. Yes. Getting the Windows binaries out was a royal pain, but we'll keep them coming. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +01 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From walter@bnbt.de Fri May 26 15:27:02 2000 From: walter@bnbt.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Fri, 26 May 2000 16:27:02 +0200 Subject: [XML-SIG] Bug in sgmlop? Message-ID: <4.3.1.0.20000526162422.00ac8de0@mail.bnbt.de> Hello all! I'm having a little problem with sgmlop from the 0.5.4 release. sgmlop seems to drop the last character in the string passed to parse: import sgmlop class handler: def handle_data(self,data): print repr(data) parser =3D sgmlop.SGMLParser() parser.register(handler()) parser.parse("gurk") parser.close() This script outputs 'gur' instead of 'gurk'. Bye, Walter D=F6rwald From Fredrik Lundh" Message-ID: <002a01bfc8ae$395994a0$f2a6b5d4@hagrid> Walter D=F6rwald wrote: > I'm having a little problem with sgmlop from > the 0.5.4 release. sgmlop seems to drop the > last character in the string passed to parse: I've verified this in the 1990620 release. here's a tentative patch: --- sgmlop.c.old Sun Jun 20 13:43:17 1999 +++ sgmlop.c Sun May 28 16:02:55 2000 @@ -1080,8 +1080,10 @@ } else { =20 /* raw data */ - if (++p >=3D end) + if (++p >=3D end) { + q =3D p; goto eol; + } continue; =20 } From Fredrik Lundh" Message-ID: <003801bfc8ae$f25b5600$f2a6b5d4@hagrid> Walter D=F6rwald wrote: > parser.register(handler()) > parser.parse("gurk") > parser.close() footnote: the correct way to use the parser is to either call "feed" a couple of time, and call "close" when you don't have more data, or to call "parse" just once, with all the data you have. not that it matters much in the current release... From walter@bnbt.de Sun May 28 18:52:03 2000 From: walter@bnbt.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Sun, 28 May 2000 19:52:03 +0200 Subject: [XML-SIG] Bug in sgmlop? In-Reply-To: <003801bfc8ae$f25b5600$f2a6b5d4@hagrid> References: <4.3.1.0.20000526162422.00ac8de0@mail.bnbt.de> Message-ID: <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> At 16:13 28.05.00, you wrote: >Walter D=F6rwald wrote: > > parser.register(handler()) > > parser.parse("gurk") > > parser.close() > >footnote: the correct way to use the parser is to >either call "feed" a couple of time, and call "close" >when you don't have more data, or to call "parse" >just once, with all the data you have. Thanks for the tips, I'm doing a feed/close loop self.lineno =3D 1 for line in lines: parser.feed(line) self.lineno =3D self.lineno + 1 parser.close() but only because I need line number information so I'm splitting the source into lines. I suppose parsing the string in one go would be faster. Are there any plans to provide line and column number information to the sgmlop user? E.g. the function finish_starttag(self,name,attrs) could be changed to finish_starttag(self,name,attrs,row,col) and should be passed the position in the string where the tag started. (and similar for the other functions). This would greatly simplify finding "bugs" in an XML file and could be used by a XML editor to highlight the position of the error. Bye, Walter D=F6rwald >not that it matters much in the current release... > > > > >_______________________________________________ >XML-SIG maillist - XML-SIG@python.org >http://www.python.org/mailman/listinfo/xml-sig From Fredrik Lundh" <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> Message-ID: <000c01bfc8d4$846e3420$f2a6b5d4@hagrid> I just posted an updated version of sgmlop to the "eff-bot staging = site": http://w1.132.telia.com/~u13208596/sgmlop.htm if I don't hear anything negative, I'll move it over to the pythonware site later this week. enjoy /F From Fredrik Lundh" <4.3.1.0.20000528194212.00ae4de0@mail.bnbt.de> Message-ID: <002301bfc8d5$0c543e20$f2a6b5d4@hagrid> (oops. pilot error. please ignore my last mail) I just posted an updated version of sgmlop to the "staging area" at: http://w1.132.telia.com/~u13208596/sgmlop.htm This release addresses the following issues: SGMLOP1: SGML files containing text only wasn't properly handled. the parser never consumed the last character, not even if the 'close' method was called (reported by Walter D=F6rwald) SGMLOP2: Unicode strings (under 1.6) were treated as binary buffers. In this release, the parser can properly parse 16-bit strings, but the callbacks get 8-bit UTF-8 strings, not true Unicode strings. This will be fixed in a future release. SGMLOP3: The 'close' method no longer accepts an optional argument. Use a separate 'feed' call instead. SGMLOP4: Recursive calls to 'feed' or 'close' (from within a call- back) could lead to all sorts of weird problems. This version checks for this condition, and raises an AssertionError instead. I'll move it over to the pythonware site later this week. Please wait for that announcement before linking to this library. enjoy /F From info@pythonware.com Mon May 29 14:11:44 2000 From: info@pythonware.com (PythonWare) Date: Mon, 29 May 2000 15:11:44 +0200 Subject: [XML-SIG] Re: new sgmlop release (may 28, 2000) Message-ID: <000901bfc96f$722fa0a0$0500a8c0@secret.pythonware.com> (same, but with the official link) It's release week at the labs, and we'll start with something small but tasty: Secret Labs' sgmlop module is a fast replacement for the regular expression-based parsers used in Python's sgmllib, htmllib, and xmllib modules. A new version is now available from: http://www.pythonware.com/products/xml Changes since the last release include: - if a file ends with cdata, make sure all characters are sent to the callback - Unicode strings (under 1.6) are now translated to UTF-8 on the fly (future versions will be fully unicode-aware) - the 'close' method no longer accepts an optional argument. - recursive calls to 'feed' or 'close' now raises an exception. enjoy, the pythonware team "Secret Labs -- makers of fine pythonware since 1997." From Juergen Hermann" The following changes are necessary to extensions/pyexpat.c in order to = get it to compile with VC5: --- pyexpat.c.orig Fri Mar 31 03:44:28 2000 +++ pyexpat.c Mon May 29 13:44:10 2000 @@ -79,7 +79,7 @@ xmlhandler handler; }; -static struct HandlerInfo handler_info[]; +staticforward struct HandlerInfo handler_info[]; static PyObject *conv_atts( XML_Char **atts){ PyObject *attrs_obj=3DNULL; @@ -148,7 +148,7 @@ } #define VOID_HANDLER( NAME, PARAMS, PARAM_FORMAT ) \ - RC_HANDLER( void, NAME, PARAMS, , PARAM_FORMAT, , ,\ + RC_HANDLER( void, NAME, PARAMS, ; , PARAM_FORMAT, ; , ; ,\ (xmlparseobject *)userData ) #define INT_HANDLER( NAME, PARAMS, PARAM_FORMAT )\ Ciao, J=FCrgen -- J=FCrgen Hermann (jhe@webde-ag.de) WEB.DE AG, Amalienbadstr.41, D-76227 Karlsruhe Tel.: 0721/94329-0, Fax: 0721/94329-22 From jarek@sonic.net Wed May 31 00:11:18 2000 From: jarek@sonic.net (Jarek Wilkiewicz) Date: Tue, 30 May 2000 16:11:18 -0700 Subject: [XML-SIG] 0.5.4 and documentType Message-ID: <00c601bfca8c$5d2548e0$010a0a0a@nonia> This is a multi-part message in MIME format. ------=_NextPart_000_00C3_01BFCA51.B0261230 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, I tried creating a DOM tree from an xml file, and the = document.documentType returns a None. Is implementation of the = DocumentType missing from the current PyXML release, or am I doing = something wrong? Thanks, Jarek ------=_NextPart_000_00C3_01BFCA51.B0261230 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hello,

I tried creating a DOM tree from an xml = file, and=20 the document.documentType returns a None. Is implementation of the = DocumentType=20 missing from the current PyXML release, or am I doing something=20 wrong?

Thanks,

Jarek