From tim_one@email.msn.com Mon May 1 07:31:05 2000 From: tim_one@email.msn.com (Tim Peters) Date: Mon, 1 May 2000 02:31:05 -0400 Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306) In-Reply-To: Message-ID: <000001bfb336$d4f512a0$0f2d153f@tim> [Guido] > The email below is a serious bug report. A quick analysis > shows that UserString.count() calls the count() method on a string > object, which calls PyArg_ParseTuple() with the format string "O|ii". > The 'i' format code truncates integers. For people unfamiliar w/ the details, let's be explicit: the "i" code implicitly converts a Python int (== a C long) to a C int (which has no visible Python counterpart). Overflow is not detected, so this is broken on the face of it. > It probably should raise an overflow exception instead. Definitely. > But that would still cause the test to fail -- just in a different > way (more explicit). Then the string methods should be fixed to > use long ints instead -- and then something else would probably break... Yup. Seems inevitable. [MAL] > Since strings and Unicode objects use integers to describe the > length of the object (as well as most if not all other > builtin sequence types), the correct default value should > thus be something like sys.maxlen which then gets set to > INT_MAX. > > I'd suggest adding sys.maxlen and the modifying UserString.py, > re.py and sre_parse.py accordingly. I understand this, but hate it. I'd rather get rid of the user-visible distinction between the two int types already there, not add yet a third artificial int limit. [Guido] > Hm, I'm not so sure. It would be much better if passing sys.maxint > would just WORK... Since that's what people have been doing so far. [Trent Mick] > Possible solutions (I give 4 of them): > > 1. The 'i' format code could raise an overflow exception and the > PyArg_ParseTuple() call in string_count() could catch it and truncate to > INT_MAX (reasoning that any overflow of the end position of a > string can be bound to INT_MAX because that is the limit for any string > in Python). There's stronger reason than that: string_count's "start" and "end" arguments are documented as "interpreted as in slice notation", and slice notation with out-of-range indices is well defined in all cases: The semantics for a simple slicing are as follows. The primary must evaluate to a sequence object. The lower and upper bound expressions, if present, must evaluate to plain integers; defaults are zero and the sequence's length, respectively. If either bound is negative, the sequence's length is added to it. The slicing now selects all items with index k such that i <= k < j where i and j are the specified lower and upper bounds. This may be an empty sequence. It is not an error if i or j lie outside the range of valid indexes (such items don't exist so they aren't selected). (From the Ref Man's section "Slicings") That is, what string_count should do is perfectly clear already (or will be, when you read that two more times ). Note that you need to distinguish between positive and negative overflow, though! > Pros: > - This "would just WORK" for usage of sys.maxint. > > Cons: > - This overflow exception catching should then reasonably be > propagated to other similar functions (like string.endswith(), etc). Absolutely, but they *all* follow from what "sequence slicing* is *always* supposed to do in case of out-of-bounds indices. > - We have to assume that the exception raised in the > PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for > the second integer (i.e. 'last'). This is subtle and ugly. Luckily , it's not that simple: exactly the same treatment needs to be given to both the optional "start" and "end" arguments, and in every function that accepts optional slice indices. So you write one utility function to deal with all that, called in case PyArg_ParseTuple raises an overflow error. > Pro or Con: > - Do we want to start raising overflow exceptions for other conversion > formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on > Win64 where sizeof(long) < size(void*))? I think this is a good idea > in principle but may break code (even if it *does* identify bugs in that > code). The code this would break is already broken <0.1 wink>. > 2. Just change the definitions of the UserString methods to pass > a variable length argument list instead of default value parameters. > For example change UserString.count() from: > > def count(self, sub, start=0, end=sys.maxint): > return self.data.count(sub, start, end) > > to: > > def count(self, *args)): > return self.data.count(*args) > > The result is that the default value for 'end' is now set by > string_count() rather than by the UserString implementation: > ... This doesn't solve anything -- users can (& do) pass sys.maxint explicitly. That's part of what Guido means by "since that's what people have been doing so far". > ... > Cons: > - Does not fix the general problem of the (common?) usage of sys.maxint to > mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit > Unices). Anyone using sys.maxint to mean INT_MAX is fatally confused; passing sys.maxint as a slice index is not an instance of that confusion, it's just relying on the documented behavior of out-of-bounds slice indices. > 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with > breaks the logical difference with sys.maxint (set to LONG_MAX): > ... I hate this (see above). > ... > 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f. > ANSI size_t type). It is probably not a biggie, but Python currently > makes the assumption that string never exceed INT_MAX in length. It's not an assumption, it's an explicit part of the design: PyObject_VAR_HEAD declares ob_size to be an int. This leads to strain for sure, partly because the *natural* limit on sizes is derived from malloc (which indeed takes neither int nor long, but size_t), and partly because Python exposes no corresponding integer type. I vaguely recall that this was deliberate, with the *intent* being to save space in object headers on the upcoming 128-bit KSR machines . > While this assumption is not likely to be proven false it technically > could be on 64-bit systems. Well, Guido once said he would take away Python's recursion overflow checks just as soon as machines came with infinite memory -- 2Gb is a reasonable limit for string length, and especially if it's a tradeoff against increasing the header size for all string objects (it's almost certainly more important to cater to oodles of small strings on smaller machines than to one or two gigantic strings on huge machines). > As well, when you start compiling on Win64 (where sizeof(int) == > sizeof(long) < sizeof(size_t)) then you are going to be annoyed > by hundreds of warnings about implicit casts from size_t (64-bits) to > int (32-bits) for every strlen, str*, fwrite, and sizeof call that > you make. Every place the code implicitly downcasts from size_t to int is plainly broken today, so we *should* get warnings. Python has been sloppy about this! In large part it's because Python was written before ANSI C, and size_t simply wasn't supported at the time. But as with all software, people rarely go back to clean up; it's overdue (just be thankful you're not working on the Perl source <0.9 wink>). > Pros: > - IMHO logically more correct. > - Might clean up some subtle bugs. > - Cleans up annoying and disconcerting warnings. > - Will probably mean less pain down the road as 64-bit systems > (esp. Win64) become more prevalent. > > Cons: > - Lot of coding changes. > - As Guido said: "and then something else would probably break". > (Though, on currently 32-bits system, there should be no effective > change). Only 64-bit systems should be affected and, I would hope, > the effect would be a clean up. I support this as a long-term solution, perhaps for P3K. Note that ob_refcnt should also be declared size_t (no overflow-checking is done on refcounts today; the theory is that a refcount can't possibly get bigger than the total # of pointers in the system, and so if you declare ob_refcnt to be large enough to hold that, refcount overflow is impossible; but, in theory, this argument has been broken on every machine where sizeof(int) < sizeof(void*)). > I apologize for not being succinct. Humbug -- it was a wonderfully concise posting, Trent! The issues are simply messy. > Note that I am volunteering here. Opinions and guidance please. Alas, the first four letters in "guidance" spell out four-fifths of the only one able to give you that. opinions-are-fun-but-don't-count-ly y'rs - tim From mal@lemburg.com Mon May 1 11:55:52 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 01 May 2000 12:55:52 +0200 Subject: [Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306) References: <000001bfb336$d4f512a0$0f2d153f@tim> Message-ID: <390D62B8.15331407@lemburg.com> I've just posted a simple patch to the patches list which implements the idea I posted earlier: Silent truncation still takes place, but in a somwhat more natural way ;-) ... /* Silently truncate to INT_MAX/INT_MIN to make passing sys.maxint to 'i' parser markers work on 64-bit platforms work just like on 32-bit platforms. Overflow errors are not raised. */ else if (ival > INT_MAX) ival = INT_MAX; else if (ival < INT_MIN) ival = INT_MIN; *p = ival; -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fdrake@acm.org Mon May 1 15:04:08 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 1 May 2000 10:04:08 -0400 (EDT) Subject: [Python-Dev] At the interactive port In-Reply-To: References: Message-ID: <14605.36568.455646.598506@seahag.cnri.reston.va.us> Moshe Zadka writes: > 1. I'm not sure what to call this function. Currently, I call it > __print_expr__, but I'm not sure it's a good name It's not. ;) How about printresult? Another thing to think about is interface; formatting a result and "printing" it may be different, and you may want to overload them separately in an environment like IDLE. Some people may want to just say: import sys sys.formatresult = str I'm inclined to think that level of control may be better left to the application; if one hook is provided as you've described, the application can build different layers as appropriate. > 2. I haven't yet supplied a default in __builtin__, so the user *must* > override this. This is unacceptable, of course. You're right! But a default is easy enough to add. I'd put it in sys instead of __builtin__ though. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From Moshe Zadka Mon May 1 15:19:46 2000 From: Moshe Zadka (Moshe Zadka) Date: Mon, 1 May 2000 17:19:46 +0300 (IDT) Subject: [Python-Dev] At the interactive port In-Reply-To: <14605.36568.455646.598506@seahag.cnri.reston.va.us> Message-ID: On Mon, 1 May 2000, Fred L. Drake, Jr. wrote: > It's not. ;) How about printresult? Hmmmm...better then mine at least. > import sys > sys.formatresult = str And where does the "don't print if it's None" enter? I doubt if there is a really good way to divide functionality. OF course, specific IDEs may provide their own hooks. > You're right! But a default is easy enough to add. I agree. It was more to spur discussion -- with the advantage that there is already a way to include Python sessions. > I'd put it in > sys instead of __builtin__ though. Hmmm.. that's a Guido Issue(TM). Guido? -- Moshe Zadka http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com From fdrake@acm.org Mon May 1 16:19:10 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 1 May 2000 11:19:10 -0400 (EDT) Subject: [Python-Dev] documentation for new modules Message-ID: <14605.41070.290137.787832@seahag.cnri.reston.va.us> The "winreg" module needs some documentation; is anyone here up to the task? I don't think I know enough about the registry to write something reasonable. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From fdrake@acm.org Mon May 1 16:23:06 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 1 May 2000 11:23:06 -0400 (EDT) Subject: [Python-Dev] documentation for new modules In-Reply-To: <14605.41070.290137.787832@seahag.cnri.reston.va.us> References: <14605.41070.290137.787832@seahag.cnri.reston.va.us> Message-ID: <14605.41306.146320.597637@seahag.cnri.reston.va.us> I wrote: > The "winreg" module needs some documentation; is anyone here up to > the task? I don't think I know enough about the registry to write > something reasonable. Of course, as soon as I sent this message I remembered that there's also the linuxaudiodev module; that needs documentation as well! (I guess I'll need to add a Linux-specific chapter; ugh.) If anyone wants to document audiodev, perhaps I could avoid the Linux chapter (with one module) by adding documentation for the portable interface. There's also the pyexpat module; Andrew/Paul, did one of you want to contribute something for that? -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From guido@python.org Mon May 1 16:26:44 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 11:26:44 -0400 Subject: [Python-Dev] documentation for new modules In-Reply-To: Your message of "Mon, 01 May 2000 11:23:06 EDT." <14605.41306.146320.597637@seahag.cnri.reston.va.us> References: <14605.41070.290137.787832@seahag.cnri.reston.va.us> <14605.41306.146320.597637@seahag.cnri.reston.va.us> Message-ID: <200005011526.LAA20332@eric.cnri.reston.va.us> > > The "winreg" module needs some documentation; is anyone here up to > > the task? I don't think I know enough about the registry to write > > something reasonable. Maybe you could adapt the documentation for the registry functions in Mark Hammond's win32all? Not all the APIs are the same but the should mostly do the same thing... > Of course, as soon as I sent this message I remembered that there's > also the linuxaudiodev module; that needs documentation as well! (I > guess I'll need to add a Linux-specific chapter; ugh.) If anyone > wants to document audiodev, perhaps I could avoid the Linux chapter > (with one module) by adding documentation for the portable interface. There's also sunaudiodev. Is it documented? linuxaudiodev should be mostly the same. > There's also the pyexpat module; Andrew/Paul, did one of you want to > contribute something for that? I would hope so! --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Mon May 1 17:17:06 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Mon, 1 May 2000 12:17:06 -0400 (EDT) Subject: [Python-Dev] documentation for new modules In-Reply-To: <200005011526.LAA20332@eric.cnri.reston.va.us> References: <14605.41070.290137.787832@seahag.cnri.reston.va.us> <14605.41306.146320.597637@seahag.cnri.reston.va.us> <200005011526.LAA20332@eric.cnri.reston.va.us> Message-ID: <14605.44546.568978.296426@seahag.cnri.reston.va.us> Guido van Rossum writes: > Maybe you could adapt the documentation for the registry functions in > Mark Hammond's win32all? Not all the APIs are the same but the should > mostly do the same thing... I'll take a look at it when I have time, unless anyone beats me to it. > There's also sunaudiodev. Is it documented? linuxaudiodev should be > mostly the same. It's been documented for a long time. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From guido@python.org Mon May 1 19:02:32 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 14:02:32 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Sat, 29 Apr 2000 09:18:05 CDT." <390AEF1D.253B93EF@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> Message-ID: <200005011802.OAA21612@eric.cnri.reston.va.us> [Guido] > > And this is exactly why encodings will remain important: entities > > encoded in ISO-2022-JP have no compelling reason to be recoded > > permanently into ISO10646, and there are lots of forces that make it > > convenient to keep it encoded in ISO-2022-JP (like existing tools). [Paul] > You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is > a character *set* and not an encoding. ISO-2022-JP says how you should > represent characters in terms of bits and bytes. ISO10646 defines a > mapping from integers to characters. OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8. > They are both important, but separate. I think that this automagical > re-encoding conflates them. Who is proposing any automagical re-encoding? Are you sure you understand what we are arguing about? *I* am not even sure what we are arguing about. I am simply saying that 8-bit strings (literals or otherwise) in Python have always been able to contain encoded strings. Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes. There's plenty of use of 8-bit Python strings for non-character uses so your "proof" that 8-bit strings should contain "characters" according to your definition is invalid. --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon May 1 19:05:33 2000 From: tree@basistech.com (Tom Emerson) Date: Mon, 1 May 2000 14:05:33 -0400 (EDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200005011802.OAA21612@eric.cnri.reston.va.us> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> Message-ID: <14605.51053.369016.283239@cymru.basistech.com> Guido van Rossum writes: > OK. I really meant recoding in UTF-8 -- I maintain that there are > lots of forces that prevent recoding most ISO-2022-JP documents in > UTF-8. Such as? -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com> Message-ID: <009f01bfb39c$a603cc00$34aab5d4@hagrid> Tom Emerson wrote: > Guido van Rossum writes: > > OK. I really meant recoding in UTF-8 -- I maintain that there are > > lots of forces that prevent recoding most ISO-2022-JP documents in > > UTF-8. >=20 > Such as? ISO-2022-JP includes language/locale information, UTF-8 doesn't. if you just recode the character codes, you'll lose important information. From tree@basistech.com Mon May 1 19:42:40 2000 From: tree@basistech.com (Tom Emerson) Date: Mon, 1 May 2000 14:42:40 -0400 (EDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <009f01bfb39c$a603cc00$34aab5d4@hagrid> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com> <009f01bfb39c$a603cc00$34aab5d4@hagrid> Message-ID: <14605.53280.55595.335112@cymru.basistech.com> Fredrik Lundh writes: > ISO-2022-JP includes language/locale information, UTF-8 doesn't. if > you just recode the character codes, you'll lose important information. So encode them using the Plane 14 language tags. I won't start with whether language/locale should be encoded in a character encoding... 8-) -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@python.org Mon May 1 19:52:04 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 14:52:04 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 14:05:33 EDT." <14605.51053.369016.283239@cymru.basistech.com> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com> Message-ID: <200005011852.OAA21973@eric.cnri.reston.va.us> > Guido van Rossum writes: > > OK. I really meant recoding in UTF-8 -- I maintain that there are > > lots of forces that prevent recoding most ISO-2022-JP documents in > > UTF-8. [Tom Emerson] > Such as? The standard forces that work against all change -- existing tools, user habits, compatibility, etc. --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon May 1 19:46:04 2000 From: tree@basistech.com (Tom Emerson) Date: Mon, 1 May 2000 14:46:04 -0400 (EDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200005011852.OAA21973@eric.cnri.reston.va.us> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com> <200005011852.OAA21973@eric.cnri.reston.va.us> Message-ID: <14605.53484.225980.235301@cymru.basistech.com> Guido van Rossum writes: > The standard forces that work against all change -- existing tools, > user habits, compatibility, etc. Ah... I misread your original statement, which I took to be a technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of course one cannot expect everyone to switch en masse to a new encoding, pulling their existing documents with them. I'm in full agreement there. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From paul@prescod.net Mon May 1 21:38:29 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 15:38:29 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> Message-ID: <390DEB45.D8D12337@prescod.net> Uche asked for a summary so I cc:ed the xml-sig. Guido van Rossum wrote: > > ... > > OK. I really meant recoding in UTF-8 -- I maintain that there are > lots of forces that prevent recoding most ISO-2022-JP documents in > UTF-8. Absolutely agree. > Are you sure you understand what we are arguing about? Here's what I thought we were arguing about: If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8). I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it. > Earlier, you quoted some reference documentation that defines 8-bit > strings as containing characters. That's taken out of context -- this > was written in a time when there was (for most people anyway) no > difference between characters and bytes, and I really meant bytes. Actually, I think that that was Fredrik. Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make. We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem. >>> a=1000000000000000000000000000000000000L >>> int(a) Traceback (innermost last): File "", line 1, in ? OverflowError: long int too long to convert And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Mon May 1 22:32:38 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 17:32:38 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT." <390DEB45.D8D12337@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us> > > Are you sure you understand what we are arguing about? > > Here's what I thought we were arguing about: > > If you put a bunch of "funny characters" into a Python string literal, > and then compare that string literal against a Unicode object, should > those funny characters be treated as logical units of text (characters) > or as bytes? And if bytes, should some transformation be automatically > performed to have those bytes be reinterpreted as characters according > to some particular encoding scheme (probably UTF-8). > > I claim that we should *as far as possible* treat strings as character > lists and not add any new functionality that depends on them being byte > list. Ideally, we could add a byte array type and start deprecating the > use of strings in that manner. Yes, it will take a long time to fix this > bug but that's what happens when good software lives a long time and the > world changes around it. > > > Earlier, you quoted some reference documentation that defines 8-bit > > strings as containing characters. That's taken out of context -- this > > was written in a time when there was (for most people anyway) no > > difference between characters and bytes, and I really meant bytes. > > Actually, I think that that was Fredrik. Yes, I came across the post again later. Sorry. > Anyhow, you wrote the documentation that way because it was the most > intuitive way of thinking about strings. It remains the most intuitive > way. I think that that was the point Fredrik was trying to make. I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately... > We can't make "byte-list" strings go away soon but we can start moving > people towards the "character-list" model. In concrete terms I would > suggest that old fashioned lists be automatically coerced to Unicode by > interpreting each byte as a Unicode character. Trying to go the other > way could cause the moral equivalent of an OverflowError but that's not > a problem. > > >>> a=1000000000000000000000000000000000000L > >>> int(a) > Traceback (innermost last): > File "", line 1, in ? > OverflowError: long int too long to convert > > And just as with ints and longs, we would expect to eventually unify > strings and unicode strings (but not byte arrays). OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode. I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally. I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric. Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one. I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding. I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Mon May 1 23:17:17 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 18:17:17 -0400 Subject: [Python-Dev] At the interactive port In-Reply-To: Your message of "Sat, 29 Apr 2000 21:09:40 +0300." References: Message-ID: <200005012217.SAA23503@eric.cnri.reston.va.us> > Continuing the recent debate about what is appropriate to the interactive > prompt printing, and the wide agreement that whatever we decide, users > might think otherwise, I've written up a patch to have the user control > via a function in __builtin__ the way things are printed at the prompt. > This is not patches@python level stuff for two reasons: > > 1. I'm not sure what to call this function. Currently, I call it > __print_expr__, but I'm not sure it's a good name > > 2. I haven't yet supplied a default in __builtin__, so the user *must* > override this. This is unacceptable, of course. > > I'd just like people to tell me if they think this is worth while, and if > there is anything I missed. Thanks for bringing this up again. I think it should be called sys.displayhook. The default could be something like import __builtin__ def displayhook(obj): if obj is None: return __builtin__._ = obj sys.stdout.write("%s\n" % repr(obj)) to be nearly 100% compatible with current practice; or use str(obj) to do what most people would probably prefer. (Note that you couldn't do "%s\n" % obj because obj might be a tuple.) --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Guido van Rossum wrote: > I just wish he made the point more eloquently. The eff-bot seems to > be in a crunchy mood lately... I've posted a few thousand messages on this topic, most of which seem to have been ignored. if you'd read all my messages, and seen all the replies, you'd be cranky too... > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. maybe, but it's a darn good axiom, and it's used by everyone else. Perl uses it, Tcl uses it, XML uses it, etc. see: http://www.python.org/pipermail/python-dev/2000-April/005218.html > I have a bunch of good reasons (I think) for liking UTF-8: it allows > you to convert between Unicode and 8-bit strings without losses, Tcl > uses it (so displaying Unicode in Tkinter *just* *works*...), it is > not Western-language-centric. the "Tcl uses it" is a red herring -- their internal implementation uses 16-bit integers, and the external interface works very hard to keep the "strings are character sequences" illusion. in other words, the length of a string is *always* the number of characters, the character at index i is *always* the i'th character in the string, etc. that's not true in Python 1.6a2. (as for Tkinter, you only have to add 2-3 lines of code to make it use 16-bit strings instead...) > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. this is another red herring: my argument is that 8-bit strings should contain unicode characters, using unicode character codes. there should be only one character repertoire, and that repertoire is uni- code. for a definition of these terms, see: http://www.python.org/pipermail/python-dev/2000-April/005225.html obviously, you can only store 256 different values in a single 8-bit character (just like you can only store 4294967296 different values in a single 32-bit int). to store larger values, use unicode strings (or long integers). conversion from a small type to a large type always work, conversion from a large type to a small one may result in an OverflowError. it has nothing to do with encodings. > I claim that as long as we're using an encoding we might as well use > the most accepted 8-bit encoding of Unicode as the default encoding. yeah, and I claim that it won't fly, as long as it breaks the "strings are character sequences" rule used by all other contemporary (and competing) systems. (if you like, I can post more "fun with unicode" messages ;-) and as I've mentioned before, there are (at least) two ways to solve this: 1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and Perl). make sure len(s) returns the number of characters in the string, make sure s[i] returns the i'th character (not necessarily starting at the i'th byte, and not necessarily one byte), etc. to make this run reasonable fast, use as many implementation tricks as you can come up with (I've described three ways to implement this in an earlier post). 2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i]) is a unicode character code, whether s is an 8-bit string or a = unicode string. for alternative 1 to work, you need to add some way to explicitly work with binary strings (like it's done in Perl and Tcl). alternative 2 doesn't need that; 8-bit strings can still be used to hold any kind of binary data, as in 1.5.2. just keep in mind you cannot use use all methods on such an object... > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. I still think it's very unfortunate that you think that unicode strings are a special kind of strings. Perl and Tcl don't, so why should we? From gward@mems-exchange.org Mon May 1 23:40:18 2000 From: gward@mems-exchange.org (Greg Ward) Date: Mon, 1 May 2000 18:40:18 -0400 Subject: [Python-Dev] Comparison inconsistency with ExtensionClass Message-ID: <20000501184017.A1171@mems-exchange.org> Hi all -- I seem to have discovered an inconsistency in the semantics of object comparison between plain old Python instances and ExtensionClass instances. (I've cc'd python-dev because it looks as though one *could* blame Python for the inconsistency, but I don't really understand the guts of either Python or ExtensionClass enough to know.) Here's a simple script that shows the difference: class Simple: def __init__ (self, data): self.data = data def __repr__ (self): return "<%s at %x: %s>" % (self.__class__.__name__, id(self), `self.data`) def __cmp__ (self, other): print "Simple.__cmp__: self=%s, other=%s" % (`self`, `other`) return cmp (self.data, other) if __name__ == "__main__": v1 = 36 v2 = Simple (36) print "v1 == v2?", (v1 == v2 and "yes" or "no") print "v2 == v1?", (v2 == v1 and "yes" or "no") print "v1 == v2.data?", (v1 == v2.data and "yes" or "no") print "v2.data == v1?", (v2.data == v1 and "yes" or "no") If I run this under Python 1.5.2, then all the comparisons come out true and my '__cmp__()' method is called twice: v1 == v2? Simple.__cmp__: self=, other=36 yes v2 == v1? Simple.__cmp__: self=, other=36 yes v1 == v2.data? yes v2.data == v1? yes The first one and the last two are obvious, but the second one only works thanks to a trick in PyObject_Compare(): if (PyInstance_Check(v) || PyInstance_Check(w)) { ... if (!PyInstance_Check(v)) return -PyObject_Compare(w, v); ... } However, if I make Simple an ExtensionClass: from ExtensionClass import Base class Simple (Base): Then the "swap v and w and use w's comparison method" no longer works. Here's the output of the script with Simple as an ExtensionClass: v1 == v2? no v2 == v1? Simple.__cmp__: self=, other=36 yes v1 == v2.data? yes v2.data == v1? yes It looks as though ExtensionClass would have to duplicate the trick in PyObject_Compare() that I quoted, since Python has no idea that ExtensionClass instances really should act like instances. This smells to me like a bug in ExtensionClass. Comments? BTW, I'm using the ExtensionClass provided with Zope 2.1.4. Mostly tested with Python 1.5.2, but also under the latest CVS Python and we observed the same behaviour. Greg From mhammond@skippinet.com.au Tue May 2 00:45:02 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 2 May 2000 09:45:02 +1000 Subject: [Python-Dev] documentation for new modules In-Reply-To: <14605.44546.568978.296426@seahag.cnri.reston.va.us> Message-ID: > Guido van Rossum writes: > > Maybe you could adapt the documentation for the > registry functions in > > Mark Hammond's win32all? Not all the APIs are the > same but the should > > mostly do the same thing... > > I'll take a look at it when I have time, unless anyone > beats me to > it. I wonder if that anyone could be me? :-) Note that all the win32api docs for the registry all made it into docstrings - so winreg has OK documentation as it is... But I will try and put something together. It will need to be plain text or HTML, but I assume that is better than nothing! Give me a few days... Mark. From paul@prescod.net Tue May 2 01:19:20 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 01 May 2000 19:19:20 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> Message-ID: <390E1F08.EA91599E@prescod.net> Sorry for the long message. Of course you need only respond to that which is interesting to you. I don't think that most of it is redundant. Guido van Rossum wrote: > > ... > > OK, you've made your claim -- like Fredrik, you want to interpret > 8-bit strings as Latin-1 when converting (not just comparing!) them to > Unicode. If the user provides an explicit conversion function (e.g. UTF-8-decode) then of course we should use that function. Under my character is a character is a character model, this "conversion" is morally equivalent to ROT-13, strupr or some other text->text translation. So you could apply UTF-8-decode even to a Unicode string as long as each character in the string has ord()<256 (so that it could be interpreted as a character representation for a byte). > I don't think I've heard a good *argument* for this rule though. "A > character is a character is a character" sounds like an axiom to me -- > something you can't prove or disprove rationally. I don't see it as an axiom, but rather as a design decision you make to keep your language simple. Along the lines of "all values are objects" and (now) all integer values are representable with a single type. Are you happy with this? a="\244" b=u"\244" assert len(a)==len(b) assert ord(a[0])==ord(b[0]) # same thing, right? print b==a # Traceback (most recent call last): # File "", line 1, in ? # UnicodeError: UTF-8 decoding error: unexpected code byte If I type "\244" it means I want character 244, not the first half of a UTF-8 escape sequence. "\244" is a string with one character. It has no encoding. It is not latin-1. It is not UTF-8. It is a string with one character and should compare as equal with another string with the same character. I would laugh my ass off if I was using Perl and it did something weird like this to me (as long as it didn't take a month to track down the bug!). Now it isn't so funny. > I have a bunch of good reasons (I think) for liking UTF-8: I'm not against UTF-8. It could be an internal representation for some Unicode objects. > it allows > you to convert between Unicode and 8-bit strings without losses, Here's the heart of our disagreement: ****** I don't want, in Py3K, to think about "converting between Unicode and 8-bit strings." I want strings and I want byte-arrays and I want to worry about converting between *them*. There should be only one string type, its characters should all live in the Unicode character repertoire and the character numbers should all come from Unicode. "Special" characters can be assigned to the Unicode Private User Area. Byte arrays would be entirely seperate and would be converted to Unicode strings with explicit conversion functions. ***** In the meantime I'm just trying to get other people thinking in this mode so that the transition is easier. If I see people embedding UTF-8 escape sequences in literal strings today, I'm going to hit them. I recognize that we can't design the universe right now but we could agree on this direction and use it to guide our decision-making. By the way, if we DID think of 8-bit strings as essentially "byte arrays" then let's use that terminology and imagine some future documentation: "Python's string type is equivalent to a list of bytes. For clarity, we will call this type a byte list from now on. In contexts where a Unicode character-string is desired, Python automatically converts byte lists to charcter strings by doing a UTF-8 decode on them." What would you think if Java had a default (I say "magical") conversion from byte arrays to character strings. The only reason we are discussing this is because Python strings have a dual personality which was useful in the past but will (IMHO, of course) become increasingly confusing in the future. We want the best of both worlds without confusing anybody and I don't think that we can have it. If you want 8-bit strings to be really byte arrays in perpetuity then let's be consistent in that view. We can compare them to Unicode as we would two completely separate types. "U" comes after "S" so unicode strings always compare greater than 8-bit strings. The use of the word "string" for both objects can be considered just a historical accident. > Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), Don't follow this entirely. Shouldn't the next version of TKinter accept and return Unicode strings? It would be rather ugly for two Unicode-aware systems (Python and TK) to talk to each other in 8-bit strings. I mean I don't care what you do at the C level but at the Python level arguments should be "just strings." Consider that len() on the TKinter side would return a different value than on the Python side. What about integral indexes into buffers? I'm totally ignorant about TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is between the 5th and 6th character when in an 8-bit string the equivalent index might be the 11th or 12th byte? > it is not Western-language-centric. If you look at encoding efficiency it is. > Another reason: while you may claim that your (and /F's, and Just's) > preferred solution doesn't enter into the encodings issue, I claim it > does: Latin-1 is just as much an encoding as any other one. The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all! Is Unicode going to be the canonical Py3K character set or will we have different objects for different character sets/encodings with different default (I say "magical") conversions between them. Such a design would not be entirely insane though it would be a PITA to implement and maintain. If we aren't ready to establish Unicode as the one true character set then we should probably make no special concessions for Unicode at all. Let a thousand string objects bloom! Even if we agreed to allow many string objects, byte==character should not be the default string object. Unicode should be the default. > I also think that the issue is blown out of proportions: this ONLY > happens when you use Unicode objects, and it ONLY matters when some > other part of the program uses 8-bit string objects containing > non-ASCII characters. Won't this be totally common? Most people are going to use 8-bit literals in their program text but work with Unicode data from XML parsers, COM, WebDAV, Tkinter, etc? > Given the long tradition of using different > encodings in 8-bit strings, at that point it is anybody's guess what > encoding is used, and UTF-8 is a better guess than Latin-1. If we are guessing then we are doing something wrong. My answer to the question of "default encoding" falls out naturally from a certain way of looking at text, popularized in various other languages and increasingly "the norm" on the Web. If you accept the model (a character is a character is a character), the right behavior is obvious. "\244"==u"\244" Nobody is ever going to have trouble understanding how this works. Choose simplicity! -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From mhammond@skippinet.com.au Tue May 2 01:34:16 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 2 May 2000 10:34:16 +1000 Subject: [Python-Dev] Neil Hodgson on python-dev? Message-ID: I'd like to propose that we invite Neil Hodgson to join the python-dev family. Neil is the author of the Scintilla editor control, now used by wxPython and Pythonwin... Smart guy, and very experienced with Python (scintilla was originally written because he had trouble converting Pythonwin to be a color'd editor :-) But most relevant at the moment is his Unicode experience. He worked for along time with Fujitsu, working with Japanese and all the encoding issues there. I have heard him echo the exact sentiments of Andy. He is also in the process of polishing the recent Unicode support in Scintilla. As this Unicode debate seems to be going nowhere fast, and appears to simply need more people with _experience_, I think he would be valuable. Further, he is a pretty quiet guy - you wont find him offering his opinion on every post that moves through here :-) Thoughts? Mark. From guido@python.org Tue May 2 01:41:43 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 20:41:43 -0400 Subject: [Python-Dev] Neil Hodgson on python-dev? In-Reply-To: Your message of "Tue, 02 May 2000 10:34:16 +1000." References: Message-ID: <200005020041.UAA23648@eric.cnri.reston.va.us> > I'd like to propose that we invite Neil Hodgson to join the > python-dev family. Excellent! > As this Unicode debate seems to be going nowhere fast, and appears > to simply need more people with _experience_, I think he would be > valuable. Further, he is a pretty quiet guy - you wont find him > offering his opinion on every post that moves through here :-) As long as he isn't too quiet on the Unicode thing ;-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 01:53:26 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 20:53:26 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT." <390E1F08.EA91599E@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us> Paul, we're both just saying the same thing over and over without convincing each other. I'll wait till someone who wasn't in this debate before chimes in. Have you tried using this? --Guido van Rossum (home page: http://www.python.org/~guido/) From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid> Paul Prescod wrote: > I would laugh my ass off if I was using Perl and it did something = weird > like this to me. you don't have to -- in Perl 5.6, a character is a character... does anyone on this list follow the perl-porters list? was this as controversial over in Perl land as it appears to be over here? From tpassin@home.com Tue May 2 02:55:25 2000 From: tpassin@home.com (tpassin@home.com) Date: Mon, 1 May 2000 21:55:25 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate Message-ID: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Guido van Rossum wrote, about how to represent strings: > Paul, we're both just saying the same thing over and over without > convincing each other. I'll wait till someone who wasn't in this > debate before chimes in. I'm with Paul and Federick on this one - at least about characters being the atoms of a string. We **have** to be able to refer to **characters** in a string, and without guessing. Otherwise, how could you ever construct a test, like theString[3]==[a particular japanese ideograph]? If we do it by having a "string" datatype, which is really a byte list, and a "unicodeString" datatype which is a list of abstract characters, I'd say everyone could get used to working with them. We'd have to supply conversion functions, of course. This route might be the easiest to understand for users. We'd have to be very clear about what file.read() would return, for example, and all those similar read and write functions. And we'd have to work out how real 8-bit calls (like writing to a socket?) would play with the new types. For extra clarity, we could leave string the way it is, introduce stringU (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to be the best equivalent to the current string). Then we would deprecate string in favor of string8. Then if tcl and perl go to unicode strings we pass them a stringU, and if they go some other way, we pass them something else. COme to think of it, we need some some data type that will continue to work with c and c++. Would that be string8 or would we keep string for that purpose? Clarity and ease of use for the user should be primary, fast implementations next. If we didn't care about ease of use and clarity, we could all use Scheme or c, don't use sight of it. I'd suggest we could create some use cases or scenarios for this area - needs input from those who know encodings and low level Python stuff better than I. Then we could examine more systematically how well various approaches would work out. Regards, Tom Passin From mhammond@skippinet.com.au Tue May 2 03:17:09 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Tue, 2 May 2000 12:17:09 +1000 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: > Guido van Rossum wrote, about how to represent strings: > > > Paul, we're both just saying the same thing over and > over without > > convincing each other. I'll wait till someone who > wasn't in this > > debate before chimes in. Ive chimed in a little, but Ill chime in again :-) > I'm with Paul and Federick on this one - at least about > characters being the > atoms of a string. We **have** to be able to refer to > **characters** in a > string, and without guessing. Otherwise, how could you I see the point, and agree 100% with the intent. However, reality does bite. As far as I can see, the following are immuatable: * There will be 2 types - a string type and a Unicode type. * History dicates that the string type may hold binary data. Thus, it is clear that Python simply can not treat characters as the smallest atoms of strings. If I understand things correctly, this is key to Guido's point, and a bit of a communication block. The issue, to my mind, is how we handle these facts to produce "the principal of least surprise". We simply need to accept that Python 1.x will never be able to treat string objects as sequences of "characters" - only bytes. However, with my limited understanding of the full issues, it does appear that the proposal championed by Fredrik, Just and Paul is the best solution - not because it magically causes Python to treat strings as characters in all cases, but because it offers the prinipcal of least surprise. As I said, I dont really have a deep enough understanding of the issues, so this is probably (hopefully!?) my last word on the matter - but that doesnt mean I dont share the concerns raised here... Mark. From guido@python.org Tue May 2 04:31:54 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 01 May 2000 23:31:54 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: <200005020331.XAA23818@eric.cnri.reston.va.us> Tom Passin: > I'm with Paul and Federick on this one - at least about characters being the > atoms of a string. We **have** to be able to refer to **characters** in a > string, and without guessing. Otherwise, how could you ever construct a > test, like theString[3]==[a particular japanese ideograph]? If we do it by > having a "string" datatype, which is really a byte list, and a > "unicodeString" datatype which is a list of abstract characters, I'd say > everyone could get used to working with them. We'd have to supply > conversion functions, of course. You seem unfamiliar with the details of the implementation we're proposing? We already have two datatypes, 8-bit string (call it byte array) and Unicode string. There are conversions between them: explicit conversions such as u.encode("utf-8") or unicode(s, "latin-1") and implicit conversions used in situations like u+s or u==s. The whole discussion is *only* about what the default conversion in the latter cases should be -- the rest of the implementation is rock solid and works well. Users can accomplish what you are proposing by simply ensuring that theString is a Unicode string. > This route might be the easiest to understand for users. We'd have to be > very clear about what file.read() would return, for example, and all those > similar read and write functions. And we'd have to work out how real 8-bit > calls (like writing to a socket?) would play with the new types. These are all well defined -- they all deal in 8-bit strings internally, and all use the default conversions when given Unicode strings. Programs that only deal in 8-bit strings don't need to change. Programs that want to deal with Unicode and sockets, for example, must know what encoding to use on the socket, and if it's not the default encoding, must use explicit conversions. > For extra clarity, we could leave string the way it is, introduce stringU > (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to > be the best equivalent to the current string). Then we would deprecate > string in favor of string8. Then if tcl and perl go to unicode strings we > pass them a stringU, and if they go some other way, we pass them something > else. COme to think of it, we need some some data type that will continue > to work with c and c++. Would that be string8 or would we keep string for > that purpose? What would be the difference between string and string8? > Clarity and ease of use for the user should be primary, fast implementations > next. If we didn't care about ease of use and clarity, we could all use > Scheme or c, don't use sight of it. > > I'd suggest we could create some use cases or scenarios for this area - > needs input from those who know encodings and low level Python stuff better > than I. Then we could examine more systematically how well various > approaches would work out. Very good. Here's one usage scenario. A Japanese user is reading lines from a file encoded in ISO-2022-JP. The readline() method returns 8-bit strings in that encoding (the file object doesn't do any decoding). She realizes that she wants to do some character-level processing on the file so she decides to convert the strings to Unicode. I believe that whether the default encoding is UTF-8 or Latin-1 doesn't matter for here -- both are wrong, she needs to write explicit unicode(line, "iso-2022-jp") code anyway. I would argue that UTF-8 is "better", because interpreting ISO-2022-JP data as UTF-8 will most likely give an exception (when a \300 range byte isn't followed by a \200 range byte) -- while interpreting it as Latin-1 will silently do the wrong thing. (An explicit error is always better than silent failure.) I'd love to discuss other scenarios. --Guido van Rossum (home page: http://www.python.org/~guido/) From Moshe Zadka Tue May 2 05:39:12 2000 From: Moshe Zadka (Moshe Zadka) Date: Tue, 2 May 2000 07:39:12 +0300 (IDT) Subject: [Python-Dev] At the interactive port In-Reply-To: <200005012217.SAA23503@eric.cnri.reston.va.us> Message-ID: > Thanks for bringing this up again. I think it should be called > sys.displayhook. That should be the easy part -- I'll do it as soon as I'm home. > The default could be something like > > import __builtin__ import sys # Sorry, I couldn't resist > def displayhook(obj): > if obj is None: > return > __builtin__._ = obj > sys.stdout.write("%s\n" % repr(obj)) This brings up a painful point -- the reason I haven't wrote the default is because it was way much easier to write it in Python. Of course, I shouldn't be preaching Python-is-easier-to-write-then-C here, but it pains me Python cannot be written with more Python and less C. A while ago we started talking about the mini-interpreter idea, which would then freeze Python code into itself, and then it sort of died out. What have become of it? -- Moshe Zadka http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com From just@letterror.com Tue May 2 06:47:35 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 06:47:35 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200005020331.XAA23818@eric.cnri.reston.va.us> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote: >Here's one usage scenario. > >A Japanese user is reading lines from a file encoded in ISO-2022-JP. >The readline() method returns 8-bit strings in that encoding (the file >object doesn't do any decoding). She realizes that she wants to do >some character-level processing on the file so she decides to convert >the strings to Unicode. > >I believe that whether the default encoding is UTF-8 or Latin-1 >doesn't matter for here -- both are wrong, she needs to write explicit >unicode(line, "iso-2022-jp") code anyway. I would argue that UTF-8 is >"better", because interpreting ISO-2022-JP data as UTF-8 will most >likely give an exception (when a \300 range byte isn't followed by a >\200 range byte) -- while interpreting it as Latin-1 will silently do >the wrong thing. (An explicit error is always better than silent >failure.) But then it's even better to *always* raise an exception, since it's entirely possible a string contains valid utf-8 while not *being* utf-8. I really think the exception argument is moot, since there can *always* be situations that will pass silently. Encoding issues are silent by nature -- eg. there's no way any system can tell that interpreting MacRoman data as Latin-1 is wrong, maybe even fatal -- the user will just have to deal with it. You can argue what you want, but *any* multi-byte encoding stored in an 8-bit string is a buffer, not a string, for all the reasons Fredrik and Paul have thrown at you, and right they are. Choosing such an encoding as a default conversion to Unicode makes no sense at all. Recap of the main arguments: pro UTF-8: always reversible when going from Unicode to 8-bit con UTF-8: not a string: confusing semantics pro Latin-1: simpler semantics con Latin-1: non-reversible, western-centric Given the fact that very often *both* will be wrong, I'd go for the simpler semantics. Just From guido@python.org Tue May 2 05:51:45 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 00:51:45 -0400 Subject: [Python-Dev] At the interactive port In-Reply-To: Your message of "Tue, 02 May 2000 07:39:12 +0300." References: Message-ID: <200005020451.AAA23940@eric.cnri.reston.va.us> > > import __builtin__ > import sys # Sorry, I couldn't resist > > def displayhook(obj): > > if obj is None: > > return > > __builtin__._ = obj > > sys.stdout.write("%s\n" % repr(obj)) > > This brings up a painful point -- the reason I haven't wrote the default > is because it was way much easier to write it in Python. Of course, I > shouldn't be preaching Python-is-easier-to-write-then-C here, but it > pains me Python cannot be written with more Python and less C. > But the C code on how to do it was present in the code you deleted from ceval.c! > A while ago we started talking about the mini-interpreter idea, > which would then freeze Python code into itself, and then it sort of > died out. What have become of it? Nobody sent me a patch :-( --Guido van Rossum (home page: http://www.python.org/~guido/) From nhodgson@bigpond.net.au Tue May 2 06:04:12 2000 From: nhodgson@bigpond.net.au (Neil Hodgson) Date: Tue, 2 May 2000 15:04:12 +1000 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: <035501bfb3f3$db87fb10$e3cb8490@neil> I'm dropping in a bit late in this thread but can the current problem be summarised in an example as "how is 'literal' interpreted here"? s = aUnicodeStringFromSomewhere DoSomething(s + "") The two options being that literal is either assumed to be encoded in Latin-1 or UTF-8. I can see some arguments for both sides. Latin-1: more current code was written in a European locale with an implicit assumption that all string handling was Latin-1. Current editors are more likely to be displaying literal as it is meant to be interpreted. UTF-8: all languages can be written in UTF-8 and more recent editors can display this correctly. Thus people using non-Roman alphabets can write code which is interpreted as is seen with no need to remember to call conversion functions. Neil From tpassin@home.com Tue May 2 06:07:07 2000 From: tpassin@home.com (tpassin@home.com) Date: Tue, 2 May 2000 01:07:07 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <200005020331.XAA23818@eric.cnri.reston.va.us> Message-ID: <006101bfb3f4$454f99e0$7cac1218@reston1.va.home.com> Guido van Rossum said > What would be the difference between string and string8? Probably none, except to alert people that string8 might have different behavior than the present-day string, perhaps when interacting with unicode - probably its behavior would be specified more tightly (i.e., is it strictly a list of bytes or does it have some assumption about encoding?) or changed in some way from what we have now. Or if it turned out that a lot of programmers in other languages (perl, tcl, perhaps?) expected "string" to behave in particular ways, the use of a term like "string8" might reduce confusion. Possibly none of these apply - no need for "string8" then. > > > Clarity and ease of use for the user should be primary, fast implementations > > next. If we didn't care about ease of use and clarity, we could all use > > Scheme or c, don't use sight of it. > > > > I'd suggest we could create some use cases or scenarios for this area - > > needs input from those who know encodings and low level Python stuff better > > than I. Then we could examine more systematically how well various > > approaches would work out. > > Very good. > Tom Passin From Fredrik Lundh" <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <035501bfb3f3$db87fb10$e3cb8490@neil> Message-ID: <003b01bfb404$03cd0560$34aab5d4@hagrid> Neil Hodgson wrote: > I'm dropping in a bit late in this thread but can the current = problem be > summarised in an example as "how is 'literal' interpreted here"? >=20 > s =3D aUnicodeStringFromSomewhere > DoSomething(s + "") nope. the whole discussion centers around what happens if you type: # example 1 u =3D aUnicodeStringFromSomewhere s =3D an8bitStringFromSomewhere DoSomething(s + u) and # example 2 u =3D aUnicodeStringFromSomewhere s =3D an8bitStringFromSomewhere if len(u) + len(s) =3D=3D len(u + s): print "true" else: print "not true" in Guido's design, the first example may or may not result in an "UTF-8 decoding error: UTF-8 decoding error: unexpected code byte" exception. the second example may result in a similar error, print "true", or print "not true", depending on the contents of the 8-bit string. (under the counter proposal, the first example will never raise an exception, and the second will always print "true") ... the string literal issue is a slightly different problem. > The two options being that literal is either assumed to be encoded in > Latin-1 or UTF-8. I can see some arguments for both sides. better make that "two options", not "the two options" ;-) a more flexible scheme would be to borrow the design from XML (see http://www.w3.org/TR/1998/REC-xml-19980210). for those who haven't looked closer at XML, it basically treats the source file as an encoded unicode character stream, and does all pro- cessing on the decoded side. replace "entity" with "script file" in the following excerpts, and you get close: section 2.2: A parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646. section 4.3.3: Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors must be able to read entities in either UTF-8 or UTF-16.=20 Entities encoded in UTF-16 must begin with the Byte Order Mark /.../ XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. Parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration containing an encoding declaration. (also see appendix F: Autodetection of Character Encodings) I propose that we adopt a similar scheme for Python -- but not in 1.6. the current "dunno, so we just copy the characters" is good enough for now... From tim_one@email.msn.com Tue May 2 08:20:52 2000 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 2 May 2000 03:20:52 -0400 Subject: [Python-Dev] fun with unicode, part 1 In-Reply-To: <200004271523.LAA13614@eric.cnri.reston.va.us> Message-ID: <000201bfb406$f2f35520$df2d153f@tim> [Guido asks good questions about how Windows deals w/ Unicode filenames, last Thursday, but gets no answers] > ... > I'd like to solve this problem, but I have some questions: what *IS* > the encoding used for filenames on Windows? This may differ per > Windows version; perhaps it can differ drive letter? Or per > application or per thread? On Windows NT, filenames are supposed to > be Unicode. (I suppose also on Windowns 2000?) How do I open a file > with a given Unicode string for its name, in a C program? I suppose > there's a Win32 API call for that which has a Unicode variant. > > On Windows 95/98, the Unicode variants of the Win32 API calls don't > exist. So what is the poor Python runtime to do there? > > Can Japanese people use Japanese characters in filenames on Windows > 95/98? Let's assume they can. Since the filesystem isn't Unicode > aware, the filenames must be encoded. Which encoding is used? Let's > assume they use Microsoft's multibyte encoding. If they put such a > file on a floppy and ship it to Link�ping, what will Fredrik see as > the filename? (I.e., is the encoding fixed by the disk volume, or by > the operating system?) > > Once we have a few answers here, we can solve the problem. Note that > sometimes we'll have to refuse a Unicode filename because there's no > mapping for some of the characters it contains in the filename > encoding used. I just thought I'd repeat the questions . However, I don't think you'll really want the answers -- Windows is a legacy-encrusted mess, and there are always many ways to get a thing done in the end. For example ... > Question: how does Fredrik create a file with a Euro > character (u'\u20ac') in its name? This particular one is shallower than you were hoping: in many of the TrueType fonts (e.g., Courier New but not Courier), Windows extended its Latin-1 encoding by mapping the Euro symbol to the "control character" 0x80. So I can get a Euro symbol into a file name just by typing Alt+0+1+2+8. This is true even on US Win98 (which has no visible Unicode support) -- but was not supported in US Win95. i've-been-tracking-down-what-appears-to-be-a-hw-bug-on-a-japanese-laptop- at-work-so-can-verify-ms-sure-got-japanese-characters-into-the- filenames-somehow-but-doubt-it's-via-unicode-ly y'rs - tim From Fredrik Lundh" Message-ID: <007d01bfb40b$d7693720$34aab5d4@hagrid> Tim Peters wrote: > [Guido asks good questions about how Windows deals w/ Unicode = filenames, > last Thursday, but gets no answers] you missed Finn Bock's post on how Java does it. here's another data point: Tcl uses a system encoding to convert from unicode to a suitable system API encoding, and uses the following approach to figure out what that one is: windows NT/2000: unicode (use wide api) windows 95/98: "cp%d" % GetACP() (note that this is "cp1252" in us and western europe, not "iso-8859-1") =20 macintosh: determine encoding for fontId 0 based on (script, smScriptLanguage) tuple. if that fails, assume "macroman" unix: figure out the locale from LC_ALL, LC_CTYPE, or LANG. use heuristics to map from the locale to an encoding (see unix/tclUnixInit). if that fails, assume "iso-8859-1" I propose adding a similar mechanism to Python, along these lines: sys.getdefaultencoding() returns the right thing for windows and macintosh, "iso-8859-1" for other platforms. sys.setencoding(codec) changes the system encoding. it's used from site.py to set things up properly on unix and other non-unicode platforms. From nhodgson@bigpond.net.au Tue May 2 09:22:36 2000 From: nhodgson@bigpond.net.au (Neil Hodgson) Date: Tue, 2 May 2000 18:22:36 +1000 Subject: [Python-Dev] fun with unicode, part 1 References: <000201bfb406$f2f35520$df2d153f@tim> Message-ID: <004501bfb40f$92ff0980$e3cb8490@neil> > > I'd like to solve this problem, but I have some questions: what *IS* > > the encoding used for filenames on Windows? This may differ per > > Windows version; perhaps it can differ drive letter? Or per > > application or per thread? On Windows NT, filenames are supposed to > > be Unicode. (I suppose also on Windowns 2000?) How do I open a file > > with a given Unicode string for its name, in a C program? I suppose > > there's a Win32 API call for that which has a Unicode variant. Its decided by each file system. For FAT file systems, the OEM code page is used. The OEM code page generally used in the United States is code page 437 which is different from the code page windows uses for display. I had to deal with this in a system where people used fractions (1/4, 1/2 and 3/4) as part of names which had to be converted into valid file names. For example 1/4 is 0xBC for display but 0xAC when used in a file name. In Japan, I think different manufacturers used different encodings with NEC trying to maintain market control with their own encoding. VFAT stores both Unicode long file names and shortened aliases. However the Unicode variant is hard to get to from Windows 95/98. NTFS stores Unicode. > > On Windows 95/98, the Unicode variants of the Win32 API calls don't > > exist. So what is the poor Python runtime to do there? Fail the call. All existing files can be opened because they have short non-Unicode aliases. If a file with a Unicode name can not be created because the OS doesn't support it then you should give up. Just as you should give up if you try to save a file with a name that includes a character not allowed by the file system. > > Can Japanese people use Japanese characters in filenames on Windows > > 95/98? Yes. > > Let's assume they can. Since the filesystem isn't Unicode > > aware, the filenames must be encoded. Which encoding is used? Let's > > assume they use Microsoft's multibyte encoding. If they put such a > > file on a floppy and ship it to Link�ping, what will Fredrik see as > > the filename? (I.e., is the encoding fixed by the disk volume, or by > > the operating system?) If Fredrik is running a non-Japanese version of Windows 9x, he will see some 'random' western characters replacing the Japanese. Neil From Fredrik Lundh" <004501bfb40f$92ff0980$e3cb8490@neil> Message-ID: <008501bfb411$8e0502c0$34aab5d4@hagrid> Neil Hodgson wrote: > Its decided by each file system. ...but the system API translates from the active code page to the encoding used by the file system, right? on my w95 box, GetACP() returns 1252, and GetOEMCP() returns 850. =20 if I create a file with a name containing latin-1 characters, on a FAT drive, it shows up correctly in the file browser (cp1252), and also shows up correctly in the MS-DOS window (under cp850). if I print the same filename to stdout in the same DOS window, I get gibberish. > > > On Windows 95/98, the Unicode variants of the Win32 API calls = don't > > > exist. So what is the poor Python runtime to do there? >=20 > Fail the call. ...if you fail to convert from unicode to the local code page. From mal@lemburg.com Tue May 2 09:36:43 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 10:36:43 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: <390E939B.11B99B71@lemburg.com> Just a small note on the subject of a character being atomic which seems to have been forgotten by the discussing parties: Unicode itself can be understood as multi-word character encoding, just like UTF-8. The reason is that Unicode entities can be combined to produce single display characters (e.g. u"e"+u"\u0301" will print "�" in a Unicode aware renderer). Slicing such a combined Unicode string will have the same effect as slicing UTF-8 data. It seems that most Latin-1 proponents seem to have single display characters in mind. While the same is true for many Unicode entities, there are quite a few cases of combining characters in Unicode 3.0 and the Unicode nomarization algorithm uses these as basis for its work. So in the end the "UTF-8 doesn't slice" argument holds for Unicode itself too, just as it also does for many Asian multi-byte variable length character encodings, image formats, audio formats, database formats, etc. You can't really expect slicing to always "just work" without some knowledge about the data you are slicing. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From ping@lfw.org Tue May 2 09:42:51 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 01:42:51 -0700 (PDT) Subject: [Python-Dev] Unicode debate In-Reply-To: Message-ID: I'll warn you that i'm not much experienced or well-informed, but i suppose i might as well toss in my naive opinion. At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote: > > I believe that whether the default encoding is UTF-8 or Latin-1 > doesn't matter for here -- both are wrong, she needs to write explicit > unicode(line, "iso-2022-jp") code anyway. I would argue that UTF-8 is > "better", because [this] will most likely give an exception... On Tue, 2 May 2000, Just van Rossum wrote: > But then it's even better to *always* raise an exception, since it's > entirely possible a string contains valid utf-8 while not *being* utf-8. I believe it is time for me to make a truly radical proposal: No automatic conversions between 8-bit "strings" and Unicode strings. If you want to turn UTF-8 into a Unicode string, say so. If you want to turn Latin-1 into a Unicode string, say so. If you want to turn ISO-2022-JP into a Unicode string, say so. Adding a Unicode string and an 8-bit "string" gives an exception. I know this sounds tedious, but at least it stands the least possible chance of confusing anyone -- and given all i've seen here and in other i18n and l10n discussions, there's plenty enough confusion to go around already. If it turns out automatic conversions *are* absolutely necessary, then i vote in favour of the simple, direct method promoted by Paul and Fredrik: just copy the numerical values of the bytes. The fact that this happens to correspond to Latin-1 is not really the point; the main reason is that it satisfies the Principle of Least Surprise. Okay. Feel free to yell at me now. -- ?!ng P. S. The scare-quotes when i talk about 8-bit "strings" expose my sense of them as byte-buffers -- since that *is* all you get when you read in some bytes from a file. If you manipulate an 8-bit "string" as a character string, you are implicitly making the assumption that the byte values correspond to the character encoding of the character repertoire you want to work with, and that's your responsibility. P. P. S. If always having to specify encodings is really too much, i'd probably be willing to consider a default-encoding state on the Unicode class, but it would have to be a stack of values, not a single value. From Fredrik Lundh" <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> Message-ID: <009701bfb414$d35d0ea0$34aab5d4@hagrid> M.-A. Lemburg wrote: > Just a small note on the subject of a character being atomic > which seems to have been forgotten by the discussing parties: >=20 > Unicode itself can be understood as multi-word character > encoding, just like UTF-8. The reason is that Unicode entities > can be combined to produce single display characters (e.g. > u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer). > Slicing such a combined Unicode string will have the same > effect as slicing UTF-8 data. really? does it result in a decoder error? or does it just result in a rendering error, just as if you slice off any trailing character without looking... > It seems that most Latin-1 proponents seem to have single > display characters in mind. While the same is true for > many Unicode entities, there are quite a few cases of > combining characters in Unicode 3.0 and the Unicode > nomarization algorithm uses these as basis for its > work. do we supported automatic normalization in 1.6? From ping@lfw.org Tue May 2 10:46:40 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 02:46:40 -0700 (PDT) Subject: [Python-Dev] At the interactive port In-Reply-To: Message-ID: On Tue, 2 May 2000, Moshe Zadka wrote: > > > Thanks for bringing this up again. I think it should be called > > sys.displayhook. I apologize profusely for dropping the ball on this. I was going to do it; i have been having a tough time lately figuring out a Big Life Decision. (Hate those BLDs.) I was partway through hacking the patch and didn't get back to it, but i wanted to at least air the plan i had in mind. I hope you'll allow me this indulgence. I was planning to submit a patch that adds the built-in routines sys.display sys.displaytb sys.__display__ sys.__displaytb__ sys.display(obj) would be implemented as 'print repr(obj)' and sys.displaytb(tb, exc) would call the same built-in traceback printer we all know and love. I assumed that sys.__stdin__ was added to make it easier to restore sys.stdin to its original value. In the same vein, sys.__display__ and sys.__displaytb__ would be saved references to the original sys.display and sys.displaytb. I hate to contradict Guido, but i'll gently suggest why i like "display" better than "displayhook": "display" is a verb, and i prefer function names to be verbs rather than nouns describing what the functions are (e.g. "read" rather than "reader", etc.) -- ?!ng From ping@lfw.org Tue May 2 10:47:34 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 02:47:34 -0700 (PDT) Subject: [Python-Dev] Traceback style Message-ID: This was also going to go out after i posted the display/displaytb patch. But anyway, let's see what you all think. I propose the following stylistic changes to traceback printing: 1. If there is no function name for a given level in the traceback, just omit the ", in ?" at the end of the line. 2. If a given level of the traceback is in a method, instead of just printing the method name, print the class and the method name. 3. Instead of beginning each line with: File "foo.py", line 5 print the line first and drop the quotes: Line 5 of foo.py In the common interactive case that the file is a typed-in string, the current printout is File "", line 1 and the following is easier to read in my opinion: Line 1 of Here is an example: >>> class Spam: ... def eggs(self): ... return self.ham ... >>> s = Spam() >>> s.eggs() Traceback (innermost last): File "", line 1, in ? File "", line 3, in eggs AttributeError: ham With the suggested changes, this would print as Traceback (innermost last): Line 1 of Line 3 of , in Spam.eggs AttributeError: ham -- ?!ng "In the sciences, we are now uniquely privileged to sit side by side with the giants on whose shoulders we stand." -- Gerald Holton From ping@lfw.org Tue May 2 10:53:01 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 02:53:01 -0700 (PDT) Subject: [Python-Dev] Traceback behaviour in exceptional cases Message-ID: Here is how i was planning to take care of exceptions in sys.displaytb... 1. When the 'sys' module does not contain a 'stderr' attribute, Python currently prints 'lost sys.stderr' to the original stderr instead of printing the traceback. I propose that it proceed to try to print the traceback to the real stderr in this case. 2. If 'sys.stderr' is buffered, the traceback does not appear in the file. I propose that Python flush 'sys.stderr' immediately after printing a traceback. 3. Tracebacks get printed to whatever object happens to be in 'sys.stderr'. If the object is not a file (or other problems occur during printing), nothing gets printed anywhere. I propose that Python warn about this on stderr, then try to print the traceback to the real stderr as above. 4. Similarly, 'sys.displaytb' may cause an exception. I propose that when this happens, Python invoke its default traceback printer to print the exception from 'sys.displaytb' as well as the original exception. #4 may seem a little convoluted, so here is the exact logic i suggest (described here in Python but to be implemented in C), where 'handle_exception()' is the routine the interpreter uses to handle an exception, 'print_exception' is the built-in exception printer currently implemented in PyErr_PrintEx and PyTraceBack_Print, and 'err' is the actual, original stderr. def print_double_exception(tb, exc, disptb, dispexc, file): file.write("Exception occured during traceback display:\n") print_exception(disptb, dispexc, file) file.write("\n") file.write("Original exception passed to display routine:\n") print_exception(tb, exc, file) def handle_double_exception(tb, exc, disptb, dispexc): if hasattr(sys, 'stderr'): err.write("Missing sys.stderr; printing exception to stderr.\n") print_double_exception(tb, exc, disptb, dispexc, err) return try: print_double_exception(tb, exc, disptb, dispexc, sys.stderr) except: err.write("Error on sys.stderr; printing exception to stderr.\n") print_double_exception(tb, exc, disptb, dispexc, err) def handle_exception(): tb, exc = sys.exc_traceback, sys.exc_value try: sys.displaytb(tb, exc) except: disptb, dispexc = sys.exc_traceback, sys.exc_value try: handle_double_exception(tb, exc, disptb, dispexc) except: pass def default_displaytb(tb, exc): if hasattr(sys, 'stderr'): print_exception(tb, exc, sys.stderr) else: print "Missing sys.stderr; printing exception to stderr." print_exception(tb, exc, err) sys.displaytb = sys.__displaytb__ = default_displaytb -- ?!ng "In the sciences, we are now uniquely privileged to sit side by side with the giants on whose shoulders we stand." -- Gerald Holton From mal@lemburg.com Tue May 2 10:56:21 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 11:56:21 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> Message-ID: <390EA645.89E3B22A@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg wrote: > > Just a small note on the subject of a character being atomic > > which seems to have been forgotten by the discussing parties: > > > > Unicode itself can be understood as multi-word character > > encoding, just like UTF-8. The reason is that Unicode entities > > can be combined to produce single display characters (e.g. > > u"e"+u"\u0301" will print "�" in a Unicode aware renderer). > > Slicing such a combined Unicode string will have the same > > effect as slicing UTF-8 data. > > really? does it result in a decoder error? or does it just result > in a rendering error, just as if you slice off any trailing character > without looking... In the example, if you cut off the u"\u0301", the "e" would appear without the acute accent, cutting off the u"e" would probably result in a rendering error or worse put the accent over the next character to the left. UTF-8 is better in this respect: it warns you about the error by raising an exception when being converted to Unicode. > > It seems that most Latin-1 proponents seem to have single > > display characters in mind. While the same is true for > > many Unicode entities, there are quite a few cases of > > combining characters in Unicode 3.0 and the Unicode > > normalization algorithm uses these as basis for its > > work. > > do we supported automatic normalization in 1.6? No, but it is likely to appear in 1.7... not sure about the "automatic" though. FYI: Normalization is needed to make comparing Unicode strings robust, e.g. u"�" should compare equal to u"e\u0301". -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From esr@thyrsus.com Tue May 2 11:16:55 2000 From: esr@thyrsus.com (Eric S. Raymond) Date: Tue, 2 May 2000 06:16:55 -0400 Subject: [Python-Dev] Traceback style In-Reply-To: ; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700 References: Message-ID: <20000502061655.A16999@thyrsus.com> Ka-Ping Yee : > I propose the following stylistic changes to traceback > printing: > > 1. If there is no function name for a given level > in the traceback, just omit the ", in ?" at the > end of the line. > > 2. If a given level of the traceback is in a method, > instead of just printing the method name, print > the class and the method name. > > 3. Instead of beginning each line with: > > File "foo.py", line 5 > > print the line first and drop the quotes: > > Line 5 of foo.py > > In the common interactive case that the file > is a typed-in string, the current printout is > > File "", line 1 > > and the following is easier to read in my opinion: > > Line 1 of > > Here is an example: > > >>> class Spam: > ... def eggs(self): > ... return self.ham > ... > >>> s = Spam() > >>> s.eggs() > Traceback (innermost last): > File "", line 1, in ? > File "", line 3, in eggs > AttributeError: ham > > With the suggested changes, this would print as > > Traceback (innermost last): > Line 1 of > Line 3 of , in Spam.eggs > AttributeError: ham IMHO, this is not a good idea. Emacs users like me want traceback labels to be *more* like C compiler error messages, not less. -- Eric S. Raymond The United States is in no way founded upon the Christian religion -- George Washington & John Adams, in a diplomatic message to Malta. From Moshe Zadka Tue May 2 11:12:14 2000 From: Moshe Zadka (Moshe Zadka) Date: Tue, 2 May 2000 13:12:14 +0300 (IDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: On Mon, 1 May 2000, Guido van Rossum wrote: > Paul, we're both just saying the same thing over and over without > convincing each other. I'll wait till someone who wasn't in this > debate before chimes in. Well, I'm guessing you had someone specific in mind (Neil?), but I want to say someothing too, as the only one here (I think) using ISO-8859-8 natively. I much prefer the Fredrik-Paul position, known also as the character is a character position, to the UTF-8 as default encoding. Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8 is even more horribly western-centered (or I should say USA centered) -- ASCII documents are the same. I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal). If I'm using Hebrew characters in my source (which I won't for a long while), I'll use them in Unicode strings only, and make sure I use Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a conversion to Unicode on the fly anyway, since most bidi libraries work on Unicode. So having UTF-8 conversions magically happen won't help me at all, and will only cause problem when I use "sort-for-uniqueness" on a list with mixed binary-goop and Unicode strings. In short, this sounds like a recipe for disaster. internationally y'rs, Z. -- Moshe Zadka http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com From pf@artcom-gmbh.de Tue May 2 11:12:26 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Tue, 2 May 2000 12:12:26 +0200 (MEST) Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39 In-Reply-To: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> from "Barry A. Warsaw" at "May 1, 2000 12:18:25 pm" Message-ID: Barry A. Warsaw: > Update of /projects/cvsroot/python/dist/src/Doc/lib [...] > libos.tex [...] > Availability: Macintosh, \UNIX{}, Windows. > \end{funcdesc} > --- 703,712 ---- > \end{funcdesc} > > ! \begin{funcdesc}{utime}{path, times} > ! Set the access and modified times of the file specified by \var{path}. > ! If \var{times} is \code{None}, then the file's access and modified > ! times are set to the current time. Otherwise, \var{times} must be a > ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to > ! set the access and modified times, respectively. > Availability: Macintosh, \UNIX{}, Windows. > \end{funcdesc} I may have missed something, but I haven't seen a patch to the WinXX and MacOS implementation of the 'utime' function. So either the documentation should explicitly point out, that the new additional signature is only available on Unices or even better it should be implemented on all platforms so that programmers intending to write portable Python have not to worry about this. I suggest an additional note saying that this signature has been added in Python 1.6. There used to be several such notes all over the documentation saying for example: "New in version 1.5.2." which I found very useful in the past! Regards, Peter From nhodgson@bigpond.net.au Tue May 2 11:22:00 2000 From: nhodgson@bigpond.net.au (Neil Hodgson) Date: Tue, 2 May 2000 20:22:00 +1000 Subject: [Python-Dev] fun with unicode, part 1 References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid> Message-ID: <00d101bfb420$4197e510$e3cb8490@neil> > ...but the system API translates from the active code page to the > encoding used by the file system, right? Yes, although I think that wasn't the case with Win16 and there are still some situations in which you have to deal with the differences. Copying a file from the console on Windows 95 to a FAT volume appears to allow use of the OEM character set with no conversion. > if I create a file with a name containing latin-1 characters, on a > FAT drive, it shows up correctly in the file browser (cp1252), and > also shows up correctly in the MS-DOS window (under cp850). Do you have a FAT drive or a VFAT drive? If you format as FAT on 9x or NT you will get a VFAT volume. Neil From ping@lfw.org Tue May 2 11:23:26 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 03:23:26 -0700 (PDT) Subject: [Python-Dev] Traceback style In-Reply-To: <20000502061655.A16999@thyrsus.com> Message-ID: On Tue, 2 May 2000, Eric S. Raymond wrote: > > Ka-Ping Yee : > > > > With the suggested changes, this would print as > > > > Traceback (innermost last): > > Line 1 of > > Line 3 of , in Spam.eggs > > AttributeError: ham > > IMHO, this is not a good idea. Emacs users like me want traceback > labels to be *more* like C compiler error messages, not less. I suppose Python could go all the way and say things like Traceback (innermost last): :3 foo.py:25: in Spam.eggs AttributeError: ham but that might be more intimidating for a beginner. Besides, you Emacs guys have plenty of programmability anyway :) You would have to do a little parsing to get the file name and line number from the current format; it's no more work to get it from the suggested format. (What i would really like, by the way, is to see the values of the function arguments on the stack -- but that's a lot of work to do in C, so implementing this with the help of repr.repr will probably be the first thing i do with sys.displaytb.) -- ?!ng From mal@lemburg.com Tue May 2 11:46:06 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 12:46:06 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Message-ID: <390EB1EE.EA557CA9@lemburg.com> Moshe Zadka wrote: > > I'd much prefer Python to reflect a > fundamental truth about Unicode, which at least makes sure binary-goop can > pass through Unicode and remain unharmed, then to reflect a nasty problem > with UTF-8 (not everything is legal). Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead. BTW, I think that this behaviour should be changed: >>> buffer('binary') + 'data' 'binarydata' while: >>> 'data' + buffer('binary') Traceback (most recent call last): File "", line 1, in ? TypeError: illegal argument type for built-in operation IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects: >>> buffer('binary')[2:5] 'nar' should prefereably be buffer('nar'). -- Hmm, perhaps we need something like a data string object to get this 100% right ?! >>> d = data("...data...") or >>> d = d"...data..." >>> print type(d) >>> 'string' + d d"string...data..." >>> u'string' + d d"s\000t\000r\000i\000n\000g\000...data..." >>> d[:5] d"...da" etc. Ideally, string and Unicode objects would then be subclasses of this type in Py3K. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pf@artcom-gmbh.de Tue May 2 11:59:55 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Tue, 2 May 2000 12:59:55 +0200 (MEST) Subject: [Python-Dev] Traceback style In-Reply-To: from Ka-Ping Yee at "May 2, 2000 3:23:26 am" Message-ID: > > Ka-Ping Yee : > > > > > > With the suggested changes, this would print as > > > > > > Traceback (innermost last): > > > Line 1 of > > > Line 3 of , in Spam.eggs > > > AttributeError: ham > On Tue, 2 May 2000, Eric S. Raymond wrote: > > IMHO, this is not a good idea. Emacs users like me want traceback > > labels to be *more* like C compiler error messages, not less. > Ka-Ping Yee : [...] > Besides, you Emacs guys have plenty of programmability anyway :) > You would have to do a little parsing to get the file name and > line number from the current format; it's no more work to get > it from the suggested format. I like pings proposed traceback output. But beside existing Elisp code there might be other software relying on a particular format. As a long time vim user I have absolutely no idea about other IDEs. So before changing the default format this should be carefully checked. > (What i would really like, by the way, is to see the values of > the function arguments on the stack -- but that's a lot of work > to do in C, so implementing this with the help of repr.repr > will probably be the first thing i do with sys.displaytb.) I'm eagerly waiting to see this. ;-) Regards, Peter From just@letterror.com Tue May 2 13:34:57 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 13:34:57 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390E939B.11B99B71@lemburg.com> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote: >Just a small note on the subject of a character being atomic >which seems to have been forgotten by the discussing parties: > >Unicode itself can be understood as multi-word character >encoding, just like UTF-8. The reason is that Unicode entities >can be combined to produce single display characters (e.g. >u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer). Erm, are you sure Unicode prescribes this behavior, for this example? I know similar behaviors are specified for certain languages/scripts, but I didn't know it did that for latin. >Slicing such a combined Unicode string will have the same >effect as slicing UTF-8 data. Not true. As Fredrik noted: no exception will be raised. [ Speaking of exceptions, after I sent off my previous post I realized Guido's non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception argument can easily be turned around, backfiring at utf-8: Defaulting to utf-8 when going from Unicode to 8-bit and back only gives the *illusion* things "just work", since it will *silently* "work", even if utf-8 is *not* the desired 8-bit encoding -- as shown by Fredrik's excellent "fun with Unicode, part 1" example. Defaulting to Latin-1 will warn the user *much* earlier, since it'll barf when converting a Unicode string that contains any character code > 255. So there. ] >It seems that most Latin-1 proponents seem to have single >display characters in mind. While the same is true for >many Unicode entities, there are quite a few cases of >combining characters in Unicode 3.0 and the Unicode >nomarization algorithm uses these as basis for its >work. Still, two combining characters are still two input characters for the renderer! They may result in one *glyph*, but trust me, that's an entirly different can of worms. However, if you'd be talking about Unicode surrogates, you'd definitely have a point. How do Java/Perl/Tcl deal with surrogates? Just From nhodgson@bigpond.net.au Tue May 2 12:40:44 2000 From: nhodgson@bigpond.net.au (Neil Hodgson) Date: Tue, 2 May 2000 21:40:44 +1000 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <035501bfb3f3$db87fb10$e3cb8490@neil> <003b01bfb404$03cd0560$34aab5d4@hagrid> Message-ID: <013e01bfb42b$41a3f200$e3cb8490@neil> > u = aUnicodeStringFromSomewhere > s = an8bitStringFromSomewhere > > DoSomething(s + u) > in Guido's design, the first example may or may not result in > an "UTF-8 decoding error: UTF-8 decoding error: unexpected > code byte" exception. I would say it is less surprising for most people for this to follow the silent-widening of each byte - the Fredrik-Paul position. With the current scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to UTF-16 conversion. While complete prohibition of automatic conversion has some appeal, it will just be more noise to many. > u = aUnicodeStringFromSomewhere > s = an8bitStringFromSomewhere > > if len(u) + len(s) == len(u + s): > print "true" > else: > print "not true" > the second example may result in a > similar error, print "true", or print "not true", depending on the > contents of the 8-bit string. I don't see this as important as its trying to take the Unicode strings are equivalent to 8 bit strings too far. How much further before you have to break? I always thought of len measuring the number of bytes rather than characters when applied to strings. The same as strlen in C when you have a DBCS string. I should correct some of the stuff Mark wrote about me. At Fujitsu we did a lot more DBCS work than Unicode because that's what Japanese code uses. Even with Java most storage is still DBCS. I was more involved with Unicode architecture at Reuters 6 or so years ago. Neil From guido@python.org Tue May 2 12:53:10 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 07:53:10 -0400 Subject: [Python-Dev] At the interactive port In-Reply-To: Your message of "Tue, 02 May 2000 02:46:40 PDT." References: Message-ID: <200005021153.HAA24134@eric.cnri.reston.va.us> > I was planning to submit a patch that adds the built-in routines > > sys.display > sys.displaytb > > sys.__display__ > sys.__displaytb__ > > sys.display(obj) would be implemented as 'print repr(obj)' > and sys.displaytb(tb, exc) would call the same built-in > traceback printer we all know and love. Sure. Though I would recommend to separate the patch in two parts, because their implementation is totally unrelated. > I assumed that sys.__stdin__ was added to make it easier to > restore sys.stdin to its original value. In the same vein, > sys.__display__ and sys.__displaytb__ would be saved references > to the original sys.display and sys.displaytb. Good idea. > I hate to contradict Guido, but i'll gently suggest why i > like "display" better than "displayhook": "display" is a verb, > and i prefer function names to be verbs rather than nouns > describing what the functions are (e.g. "read" rather than > "reader", etc.) Good idea. But I hate the "displaytb" name (when I read your message I had no idea what the "tb" stood for until you explained it). Hm, perhaps we could do showvalue and showtraceback? ("displaytraceback" is a bit long.) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 13:15:28 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:15:28 -0400 Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39 In-Reply-To: Your message of "Tue, 02 May 2000 12:12:26 +0200." References: Message-ID: <200005021215.IAA24169@eric.cnri.reston.va.us> > > ! \begin{funcdesc}{utime}{path, times} > > ! Set the access and modified times of the file specified by \var{path}. > > ! If \var{times} is \code{None}, then the file's access and modified > > ! times are set to the current time. Otherwise, \var{times} must be a > > ! 2-tuple of numbers, of the form \var{(atime, mtime)} which is used to > > ! set the access and modified times, respectively. > > Availability: Macintosh, \UNIX{}, Windows. > > \end{funcdesc} > > I may have missed something, but I haven't seen a patch to the WinXX > and MacOS implementation of the 'utime' function. So either the > documentation should explicitly point out, that the new additional > signature is only available on Unices or even better it should be > implemented on all platforms so that programmers intending to write > portable Python have not to worry about this. Actually, it works on WinXX (tested on 98). The utime() implementation there is the same file as on Unix, so the patch fixed both platforms. The MS C library only seems to set the mtime, but that's okay. On Mac, I hope that the utime() function in GUSI 2 does this, in which case Jack Jansen needs to copy Barry's patch. > I suggest an additional note saying that this signature has been > added in Python 1.6. There used to be several such notes all over > the documentation saying for example: "New in version 1.5.2." which > I found very useful in the past! Thanks, you're right! --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 13:19:38 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:19:38 -0400 Subject: [Python-Dev] fun with unicode, part 1 In-Reply-To: Your message of "Tue, 02 May 2000 20:22:00 +1000." <00d101bfb420$4197e510$e3cb8490@neil> References: <000201bfb406$f2f35520$df2d153f@tim> <004501bfb40f$92ff0980$e3cb8490@neil> <008501bfb411$8e0502c0$34aab5d4@hagrid> <00d101bfb420$4197e510$e3cb8490@neil> Message-ID: <200005021219.IAA24181@eric.cnri.reston.va.us> > Yes, although I think that wasn't the case with Win16 and there are still > some situations in which you have to deal with the differences. Copying a > file from the console on Windows 95 to a FAT volume appears to allow use of > the OEM character set with no conversion. BTW, MS's use of code pages is full of shit. Yesterday I was spell-checking a document that had the name Andre in it (the accent was missing). The popup menu suggested Andr* where the * was an upper case slashed O. I first thought this was because the menu character set might be using a different code page, but no -- it must have been bad in the database, because selecting that entry from the menu actually inserted the slashed O character. So they must have been maintaining their database with a different code page. Just to indicate that when we sort out the rest of the Unicode debate (which I'm sure we will :-) there will still be surprises on Windows... --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 13:22:24 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:22:24 -0400 Subject: [Python-Dev] Traceback style In-Reply-To: Your message of "Tue, 02 May 2000 03:23:26 PDT." References: Message-ID: <200005021222.IAA24192@eric.cnri.reston.va.us> > > Ka-Ping Yee : > > > With the suggested changes, this would print as > > > > > > Traceback (innermost last): > > > Line 1 of > > > Line 3 of , in Spam.eggs > > > AttributeError: ham ESR: > > IMHO, this is not a good idea. Emacs users like me want traceback > > labels to be *more* like C compiler error messages, not less. Ping: > I suppose Python could go all the way and say things like > > Traceback (innermost last): > :3 > foo.py:25: in Spam.eggs > AttributeError: ham > > but that might be more intimidating for a beginner. > > Besides, you Emacs guys have plenty of programmability anyway :) > You would have to do a little parsing to get the file name and > line number from the current format; it's no more work to get > it from the suggested format. Not sure -- I think I carefully designed the old format to be one of the formats that Emacs parses *by default*: File "...", line ... Your change breaks this. > (What i would really like, by the way, is to see the values of > the function arguments on the stack -- but that's a lot of work > to do in C, so implementing this with the help of repr.repr > will probably be the first thing i do with sys.displaytb.) Yes, this is much easier in Python. Watch out for values that are uncomfortably big or recursive or that cause additional exceptions on displaying. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 13:26:50 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:26:50 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 12:46:06 +0200." <390EB1EE.EA557CA9@lemburg.com> References: <390EB1EE.EA557CA9@lemburg.com> Message-ID: <200005021226.IAA24203@eric.cnri.reston.va.us> [MAL] > Let's not do the same mistake again: Unicode objects should *not* > be used to hold binary data. Please use buffers instead. Easier said than done -- Python doesn't really have a buffer data type. Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers. I agree there's no reason to extend this to Unicode strings. > BTW, I think that this behaviour should be changed: > > >>> buffer('binary') + 'data' > 'binarydata' > > while: > > >>> 'data' + buffer('binary') > Traceback (most recent call last): > File "", line 1, in ? > TypeError: illegal argument type for built-in operation > > IMHO, buffer objects should never coerce to strings, but instead > return a buffer object holding the combined contents. The > same applies to slicing buffer objects: > > >>> buffer('binary')[2:5] > 'nar' > > should prefereably be buffer('nar'). Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though. > -- > > Hmm, perhaps we need something like a data string object > to get this 100% right ?! > > >>> d = data("...data...") > or > >>> d = d"...data..." > >>> print type(d) > > > >>> 'string' + d > d"string...data..." > >>> u'string' + d > d"s\000t\000r\000i\000n\000g\000...data..." > > >>> d[:5] > d"...da" > > etc. > > Ideally, string and Unicode objects would then be subclasses > of this type in Py3K. Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist. --Guido van Rossum (home page: http://www.python.org/~guido/) From gward@mems-exchange.org Tue May 2 13:27:51 2000 From: gward@mems-exchange.org (Greg Ward) Date: Tue, 2 May 2000 08:27:51 -0400 Subject: [Python-Dev] Traceback style In-Reply-To: ; from ping@lfw.org on Tue, May 02, 2000 at 02:47:34AM -0700 References: Message-ID: <20000502082751.A1504@mems-exchange.org> On 02 May 2000, Ka-Ping Yee said: > I propose the following stylistic changes to traceback > printing: > > 1. If there is no function name for a given level > in the traceback, just omit the ", in ?" at the > end of the line. +0 on this: it doesn't really add anything, but it does neaten things up. > 2. If a given level of the traceback is in a method, > instead of just printing the method name, print > the class and the method name. +1 here too: this definitely adds utility. > 3. Instead of beginning each line with: > > File "foo.py", line 5 > > print the line first and drop the quotes: > > Line 5 of foo.py -0: adds nothing, cleans nothing up, and just generally breaks things for no good reason. > In the common interactive case that the file > is a typed-in string, the current printout is > > File "", line 1 > > and the following is easier to read in my opinion: > > Line 1 of OK, that's a good reason. Maybe you could special-case the "" case? How about , line 1 ? Greg From guido@python.org Tue May 2 13:30:02 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:30:02 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 11:56:21 +0200." <390EA645.89E3B22A@lemburg.com> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> <390EA645.89E3B22A@lemburg.com> Message-ID: <200005021230.IAA24232@eric.cnri.reston.va.us> [MAL] > > > Unicode itself can be understood as multi-word character > > > encoding, just like UTF-8. The reason is that Unicode entities > > > can be combined to produce single display characters (e.g. > > > u"e"+u"\u0301" will print "�" in a Unicode aware renderer). > > > Slicing such a combined Unicode string will have the same > > > effect as slicing UTF-8 data. [/F] > > really? does it result in a decoder error? or does it just result > > in a rendering error, just as if you slice off any trailing character > > without looking... [MAL] > In the example, if you cut off the u"\u0301", the "e" would > appear without the acute accent, cutting off the u"e" would > probably result in a rendering error or worse put the accent > over the next character to the left. > > UTF-8 is better in this respect: it warns you about > the error by raising an exception when being converted to > Unicode. I think /F's point was that the Unicode standard prescribes different behavior here: for UTF-8, a missing or lone continuation byte is an error; for Unicode, accents are separate characters that may be inserted and deleted in a string but whose display is undefined under certain conditions. (I just noticed that this doesn't work in Tkinter but it does work in wish. Strange.) > FYI: Normalization is needed to make comparing Unicode > strings robust, e.g. u"�" should compare equal to u"e\u0301". Aha, then we'll see u == v even though type(u) is type(v) and len(u) != len(v). /F's world will collapse. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 13:31:55 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 08:31:55 -0400 Subject: [Python-Dev] Re: Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 01:42:51 PDT." References: Message-ID: <200005021231.IAA24249@eric.cnri.reston.va.us> > No automatic conversions between 8-bit "strings" and Unicode strings. > > If you want to turn UTF-8 into a Unicode string, say so. > If you want to turn Latin-1 into a Unicode string, say so. > If you want to turn ISO-2022-JP into a Unicode string, say so. > Adding a Unicode string and an 8-bit "string" gives an exception. I'd accept this, with one change: mixing Unicode and 8-bit strings is okay when the 8-bit strings contain only ASCII (byte values 0 through 127). That does the right thing when the program is combining ASCII data (e.g. literals or data files) with Unicode and warns you when you are using characters for which the encoding matters. I believe that this is important because much existing code dealing with strings can in fact deal with Unicode just fine under these assumptions. (E.g. I needed only 4 changes to htmllib/sgmllib to make it deal with Unicode strings -- those changes were all getattr() and setattr() calls.) When *comparing* 8-bit and Unicode strings, the presence of non-ASCII bytes in either should make the comparison fail; when ordering is important, we can make an arbitrary choice e.g. "\377" < u"\200". Why not Latin-1? Because it gives us Western-alphabet users a false sense that our code works, where in fact it is broken as soon as you change the encoding. > P. S. The scare-quotes when i talk about 8-bit "strings" expose my > sense of them as byte-buffers -- since that *is* all you get when you > read in some bytes from a file. If you manipulate an 8-bit "string" > as a character string, you are implicitly making the assumption that > the byte values correspond to the character encoding of the character > repertoire you want to work with, and that's your responsibility. This is how I think of them too. > P. P. S. If always having to specify encodings is really too much, > i'd probably be willing to consider a default-encoding state on the > Unicode class, but it would have to be a stack of values, not a > single value. Please elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/) From just@letterror.com Tue May 2 14:44:30 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 14:44:30 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200005021230.IAA24232@eric.cnri.reston.va.us> References: Your message of "Tue, 02 May 2000 11:56:21 +0200." <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> <390EA645.89E3B22A@lemburg.com> Message-ID: At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote: >I think /F's point was that the Unicode standard prescribes different >behavior here: for UTF-8, a missing or lone continuation byte is an >error; for Unicode, accents are separate characters that may be >inserted and deleted in a string but whose display is undefined under >certain conditions. > >(I just noticed that this doesn't work in Tkinter but it does work in >wish. Strange.) > >> FYI: Normalization is needed to make comparing Unicode >> strings robust, e.g. u"=C8" should compare equal to u"e\u0301". > >Aha, then we'll see u =3D=3D v even though type(u) is type(v) and len(u) >!=3D len(v). /F's world will collapse. :-) Does the Unicode spec *really* specifies u should compare equal to v? This behavior would be the responsibility of a layout engine, a role which is way beyond the scope of Unicode support in Python, as it is language- and script-dependent. Just From just@letterror.com Tue May 2 14:39:24 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 14:39:24 +0100 Subject: [Python-Dev] Re: [I18n-sig] Unicode debate In-Reply-To: References: Message-ID: At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote: >If it turns out automatic conversions *are* absolutely necessary, >then i vote in favour of the simple, direct method promoted by Paul >and Fredrik: just copy the numerical values of the bytes. The fact >that this happens to correspond to Latin-1 is not really the point; >the main reason is that it satisfies the Principle of Least Surprise. Exactly. I'm not sure if automatic conversions are absolutely necessary, but seeing 8-bit strings as Latin-1 encoded Unicode strings seems most natural to me. Heck, even 8-bit strings should have an s.encode() method, that would behave *just* like u.encode(), and unicode(blah) could even *return* an 8-bit string if it turns out the string has no character codes > 255! Conceptually, this gets *very* close to the ideal of "there is only one string type", and at the same times leaves room for 8-bit strings doubling as byte arrays for backward compatibility reasons. (Unicode strings and 8-bit strings could even be the same type, which only uses wide chars when neccesary!) Just From just@letterror.com Tue May 2 14:55:31 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 14:55:31 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us> References: Your message of "Tue, 02 May 2000 01:42:51 PDT."

Message-ID: At 8:31 AM -0400 02-05-2000, Guido van Rossum wrote: >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII >bytes in either should make the comparison fail; when ordering is >important, we can make an arbitrary choice e.g. "\377" < u"\200". Blech. Just document 8-bit strings *are* Latin-1 unless converted explicitly, and you're done. It's really much simpler this way. For you as well as the users. >Why not Latin-1? Because it gives us Western-alphabet users a false >sense that our code works, where in fact it is broken as soon as you >change the encoding. Yeah, and? It least it'll *show* it's broken instead of *silently* doing the wrong thing with utf-8. It's like using Python ints all over the place, and suddenly a user of the application enters data that causes an integer overflow. Boom. Program needs to be fixed. What's the big deal? Just From Fredrik Lundh" <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us> Message-ID: <00f301bfb437$227bc180$34aab5d4@hagrid> Guido van Rossum wrote: > > FYI: Normalization is needed to make comparing Unicode > > strings robust, e.g. u"=E9" should compare equal to u"e\u0301". >=20 > Aha, then we'll see u =3D=3D v even though type(u) is type(v) and = len(u) > !=3D len(v). /F's world will collapse. :-) you're gonna do automatic normalization? that's interesting. will this make Python the first language to defines strings as a "sequence of graphemes"? or was this just the cheap shot it appeared to be? From skip@mojam.com (Skip Montanaro) Tue May 2 14:10:22 2000 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Tue, 2 May 2000 08:10:22 -0500 (CDT) Subject: [Python-Dev] Traceback style In-Reply-To: References: Message-ID: <14606.54206.559407.213584@beluga.mojam.com> [... completely eliding Ping's note and stealing his subject ...] On a not-quite unrelated tack, I wonder if traceback printing can be enhanced in the case where Python code calls a function or method written in C (possibly calling multiple C functions), which in turn calls a Python function that raises an exception. Currently, the Python functions on either side of the C functions are printed, but no hint of the C function's existence is displayed. Any way to get some indication there's another function in the middle? Thanks, -- Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/ "We have become ... the stewards of life's continuity on earth. We did not ask for this role... We may not be suited to it, but here we are." - Stephen Jay Gould From tdickenson@geminidataloggers.com Tue May 2 14:46:44 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Tue, 02 May 2000 14:46:44 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us> References: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum wrote: >> No automatic conversions between 8-bit "strings" and Unicode = strings. >>=20 >> If you want to turn UTF-8 into a Unicode string, say so. >> If you want to turn Latin-1 into a Unicode string, say so. >> If you want to turn ISO-2022-JP into a Unicode string, say so. >> Adding a Unicode string and an 8-bit "string" gives an exception. > >I'd accept this, with one change: mixing Unicode and 8-bit strings is >okay when the 8-bit strings contain only ASCII (byte values 0 through >127). That does the right thing when the program is combining >ASCII data (e.g. literals or data files) with Unicode and warns you >when you are using characters for which the encoding matters. I >believe that this is important because much existing code dealing with >strings can in fact deal with Unicode just fine under these >assumptions. (E.g. I needed only 4 changes to htmllib/sgmllib to make >it deal with Unicode strings -- those changes were all getattr() and >setattr() calls.) > >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII >bytes in either should make the comparison fail; when ordering is >important, we can make an arbitrary choice e.g. "\377" < u"\200". I assume 'fail' means 'non-equal', rather than 'raises an exception'? Toby Dickenson tdickenson@geminidataloggers.com From guido@python.org Tue May 2 14:58:51 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 09:58:51 -0400 Subject: [Python-Dev] Traceback style In-Reply-To: Your message of "Tue, 02 May 2000 08:10:22 CDT." <14606.54206.559407.213584@beluga.mojam.com> References: <14606.54206.559407.213584@beluga.mojam.com> Message-ID: <200005021358.JAA24443@eric.cnri.reston.va.us> [Skip] > On a not-quite unrelated tack, I wonder if traceback printing can be > enhanced in the case where Python code calls a function or method written in > C (possibly calling multiple C functions), which in turn calls a Python > function that raises an exception. Currently, the Python functions on > either side of the C functions are printed, but no hint of the C function's > existence is displayed. Any way to get some indication there's another > function in the middle? In some cases, that's a good thing -- in others, it's not. There should probably be an API that a C function can call to add an entry onto the stack. It's not going to be a trivial fix though -- you'd have to manufacture a frame object. I can see two options: you can do this "on the way out" when you catch an exception, or you can do this "on the way in" when you are called. The latter would require you to explicitly get rid of the frame too -- probably both on normal returns and on exception returns. That seems hairier than only having to make a call on exception returns; but it means that the C function is invisible to the Python debugger unless it fails. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Tue May 2 15:00:14 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 10:00:14 -0400 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 14:46:44 BST." References: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: <200005021400.KAA24464@eric.cnri.reston.va.us> [me] > >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII > >bytes in either should make the comparison fail; when ordering is > >important, we can make an arbitrary choice e.g. "\377" < u"\200". [Toby] > I assume 'fail' means 'non-equal', rather than 'raises an exception'? Yes, sorry for the ambiguity. --Guido van Rossum (home page: http://www.python.org/~guido/) From fdrake@acm.org Tue May 2 15:04:17 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 2 May 2000 10:04:17 -0400 (EDT) Subject: [Python-Dev] documentation for new modules In-Reply-To: References: <14605.44546.568978.296426@seahag.cnri.reston.va.us> Message-ID: <14606.57441.97184.499435@seahag.cnri.reston.va.us> Mark Hammond writes: > I wonder if that anyone could be me? :-) I certainly wouldn't object! ;) > But I will try and put something together. It will need to be plain > text or HTML, but I assume that is better than nothing! Plain text would be better than HTML. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From just@letterror.com Tue May 2 16:11:39 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 16:11:39 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us> References: Your message of "Tue, 02 May 2000 14:46:44 BST."

<200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: At 10:00 AM -0400 02-05-2000, Guido van Rossum wrote: >[me] >> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII >> >bytes in either should make the comparison fail; when ordering is >> >important, we can make an arbitrary choice e.g. "\377" < u"\200". > >[Toby] >> I assume 'fail' means 'non-equal', rather than 'raises an exception'? > >Yes, sorry for the ambiguity. You're going to have a hard time explaining that "\377" != u"\377". Again, if you define that "all strings are unicode" and that 8-bit strings contain Unicode characters up to 255, you're all set. Clear semantics, few surprises, simple implementation, etc. etc. Just From guido@python.org Tue May 2 15:21:28 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 10:21:28 -0400 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 16:11:39 BST." References: Your message of "Tue, 02 May 2000 14:46:44 BST."

<200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: <200005021421.KAA24526@eric.cnri.reston.va.us> [Just] > You're going to have a hard time explaining that "\377" != u"\377". I agree. You are an example of how hard it is to explain: you still don't understand that for a person using CJK encodings this is in fact the truth. > Again, if you define that "all strings are unicode" and that 8-bit strings > contain Unicode characters up to 255, you're all set. Clear semantics, few > surprises, simple implementation, etc. etc. But not all 8-bit strings occurring in programs are Unicode. Ask Moshe. --Guido van Rossum (home page: http://www.python.org/~guido/) From just@letterror.com Tue May 2 16:42:24 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 16:42:24 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <200005021421.KAA24526@eric.cnri.reston.va.us> References: Your message of "Tue, 02 May 2000 16:11:39 BST." Your message of "Tue, 02 May 2000 14:46:44 BST."

<200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: >[Just] >> You're going to have a hard time explaining that "\377" != u"\377". > [GvR] >I agree. You are an example of how hard it is to explain: you still >don't understand that for a person using CJK encodings this is in fact >the truth. That depends on the definition of truth: it you document that 8-bit strings are Latin-1, the above is the truth. Conceptually classify all other 8-bit encodings as binary goop makes the semantics chrystal clear. >> Again, if you define that "all strings are unicode" and that 8-bit strings >> contain Unicode characters up to 255, you're all set. Clear semantics, few >> surprises, simple implementation, etc. etc. > >But not all 8-bit strings occurring in programs are Unicode. Ask >Moshe. I know. They can be anything, even binary goop. But that's *only* an artifact of the fact that 8-bit strings need to double as buffer objects. Just From just@letterror.com Tue May 2 16:45:01 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 16:45:01 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: References: <200005021421.KAA24526@eric.cnri.reston.va.us> Your message of "Tue, 02 May 2000 16:11:39 BST." Your message of "Tue, 02 May 2000 14:46:44 BST."

<200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: I wrote: >That depends on the definition of truth: it you document that 8-bit strings >are Latin-1, the above is the truth. Oops, I meant of course that "\377" == u"\377" is then the truth... Sorry, Just From mal@lemburg.com Tue May 2 16:18:21 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 17:18:21 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Tue, 02 May 2000 11:56:21 +0200." <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> <390EA645.89E3B22A@lemburg.com> Message-ID: <390EF1BD.E6C7AF74@lemburg.com> Just van Rossum wrote: > > At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote: > >I think /F's point was that the Unicode standard prescribes different > >behavior here: for UTF-8, a missing or lone continuation byte is an > >error; for Unicode, accents are separate characters that may be > >inserted and deleted in a string but whose display is undefined under > >certain conditions. > > > >(I just noticed that this doesn't work in Tkinter but it does work in > >wish. Strange.) > > > >> FYI: Normalization is needed to make comparing Unicode > >> strings robust, e.g. u"�" should compare equal to u"e\u0301". ^ | Here's a good example of what encoding errors can do: the above character was an "e" with acute accent (u"�"). Looks like some mailer converted this to some other code page and yet another back to Latin-1 again and this even though the message header for Content-Type clearly states that the document uses ISO-8859-1. > > > >Aha, then we'll see u == v even though type(u) is type(v) and len(u) > >!= len(v). /F's world will collapse. :-) > > Does the Unicode spec *really* specifies u should compare equal to v? The behaviour is needed in order to implement sorting Unicode. See the www.unicode.org site for more information and the tech reports describing this. Note that I haven't mentioned anything about "automatic" normalization. This should be a method on Unicode strings and could then be used in sorting compare callbacks. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue May 2 16:55:40 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 17:55:40 +0200 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate References: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: <390EFA7B.F6B622F0@lemburg.com> [Guido going ASCII] Do you mean going ASCII all the way (using it for all aspects where Unicode gets converted to a string and cases where strings get converted to Unicode), or just for some aspect of conversion, e.g. just for the silent conversions from strings to Unicode ? [BTW, I'm pretty sure that the Latin-1 folks won't like ASCII for the same reason they don't like UTF-8: it's simply an inconvenient way to write strings in their favorite encoding directly in Python source code. My feeling in this whole discussion is that it's more about convenience than anything else. Still, it's very amusing ;-) ] FYI, here's the conversion table of (potentially) all conversions done by the implementation: Python: ------- string + unicode: unicode(string,'utf-8') + unicode string.method(unicode): unicode(string,'utf-8').method(unicode) print unicode: print unicode.encode('utf-8'); with stdout redirection this can be changed to any other encoding str(unicode): unicode.encode('utf-8') repr(unicode): repr(unicode.encode('unicode-escape')) C (PyArg_ParserTuple): ---------------------- "s" + unicode: same as "s" + unicode.encode('utf-8') "s#" + unicode: same as "s#" + unicode.encode('unicode-internal') "t" + unicode: same as "t" + unicode.encode('utf-8') "t#" + unicode: same as "t#" + unicode.encode('utf-8') This effects all C modules and builtins. In case a C module wants to receive a certain predefined encoding, it can use the new "es" and "es#" parser markers. Ways to enter Unicode: ---------------------- u'' + string same as unicode(string,'utf-8') unicode(string,encname) any supported encoding u'...unicode-escape...' unicode-escape currently accepts Latin-1 chars as single-char input; using escape sequences any Unicode char can be entered (*) codecs.open(filename,mode,encname) opens an encoded file for reading and writing Unicode directly raw_input() + stdin redirection (see one of my earlier posts for code) returns UTF-8 strings based on the input encoding IO: --- open(file,'w').write(unicode) same as open(file,'w').write(unicode.encode('utf-8')) open(file,'wb').write(unicode) same as open(file,'wb').write(unicode.encode('unicode-internal')) codecs.open(file,'wb',encname).write(unicode) same as open(file,'wb').write(unicode.encode(encname)) codecs.open(file,'rb',encname).read() same as unicode(open(file,'rb').read(),encname) stdin + stdout can be redirected using StreamRecoders to handle any of the supported encodings -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue May 2 16:27:39 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 17:27:39 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <390EB1EE.EA557CA9@lemburg.com> <200005021226.IAA24203@eric.cnri.reston.va.us> Message-ID: <390EF3EB.5BCE9EC3@lemburg.com> Guido van Rossum wrote: > > [MAL] > > Let's not do the same mistake again: Unicode objects should *not* > > be used to hold binary data. Please use buffers instead. > > Easier said than done -- Python doesn't really have a buffer data > type. Or do you mean the array module? It's not trivial to read a > file into an array (although it's possible, there are even two ways). > Fact is, most of Python's standard library and built-in objects use > (8-bit) strings as buffers. > > I agree there's no reason to extend this to Unicode strings. > > > BTW, I think that this behaviour should be changed: > > > > >>> buffer('binary') + 'data' > > 'binarydata' > > > > while: > > > > >>> 'data' + buffer('binary') > > Traceback (most recent call last): > > File "", line 1, in ? > > TypeError: illegal argument type for built-in operation > > > > IMHO, buffer objects should never coerce to strings, but instead > > return a buffer object holding the combined contents. The > > same applies to slicing buffer objects: > > > > >>> buffer('binary')[2:5] > > 'nar' > > > > should prefereably be buffer('nar'). > > Note that a buffer object doesn't hold data! It's only a pointer to > data. I can't off-hand explain the asymmetry though. Dang, you're right... > > -- > > > > Hmm, perhaps we need something like a data string object > > to get this 100% right ?! > > > > >>> d = data("...data...") > > or > > >>> d = d"...data..." > > >>> print type(d) > > > > > > >>> 'string' + d > > d"string...data..." > > >>> u'string' + d > > d"s\000t\000r\000i\000n\000g\000...data..." > > > > >>> d[:5] > > d"...da" > > > > etc. > > > > Ideally, string and Unicode objects would then be subclasses > > of this type in Py3K. > > Not clear. I'd rather do the equivalent of byte arrays in Java, for > which no "string literal" notations exist. Anyway, one way or another I think we should make it clear to users that they should start using some other type for storing binary data. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue May 2 16:24:24 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 02 May 2000 17:24:24 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: <390EF327.86D8C3D8@lemburg.com> Just van Rossum wrote: > > At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote: > >Just a small note on the subject of a character being atomic > >which seems to have been forgotten by the discussing parties: > > > >Unicode itself can be understood as multi-word character > >encoding, just like UTF-8. The reason is that Unicode entities > >can be combined to produce single display characters (e.g. > >u"e"+u"\u0301" will print "�" in a Unicode aware renderer). > > Erm, are you sure Unicode prescribes this behavior, for this > example? I know similar behaviors are specified for certain > languages/scripts, but I didn't know it did that for latin. The details are on the www.unicode.org web-site burried in some of the tech reports on normalization and collation. > >Slicing such a combined Unicode string will have the same > >effect as slicing UTF-8 data. > > Not true. As Fredrik noted: no exception will be raised. Huh ? You will always get an exception when you convert a broken UTF-8 sequence to Unicode. This is per design of UTF-8 itself which uses the top bit to identify multi-byte character encodings. Or can you give an example (perhaps you've found a bug that needs fixing) ? > [ Speaking of exceptions, > > after I sent off my previous post I realized Guido's > non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception > argument can easily be turned around, backfiring at utf-8: > > Defaulting to utf-8 when going from Unicode to 8-bit and > back only gives the *illusion* things "just work", since it > will *silently* "work", even if utf-8 is *not* the desired > 8-bit encoding -- as shown by Fredrik's excellent "fun with > Unicode, part 1" example. Defaulting to Latin-1 will > warn the user *much* earlier, since it'll barf when > converting a Unicode string that contains any character > code > 255. So there. > ] > > >It seems that most Latin-1 proponents seem to have single > >display characters in mind. While the same is true for > >many Unicode entities, there are quite a few cases of > >combining characters in Unicode 3.0 and the Unicode > >nomarization algorithm uses these as basis for its > >work. > > Still, two combining characters are still two input characters for > the renderer! They may result in one *glyph*, but trust me, > that's an entirly different can of worms. No. Please see my other post on the subject... > However, if you'd be talking about Unicode surrogates, > you'd definitely have a point. How do Java/Perl/Tcl deal with > surrogates? Good question... anybody know the answers ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paul@prescod.net Tue May 2 17:05:20 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:05:20 -0500 Subject: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com><002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <035501bfb3f3$db87fb10$e3cb8490@neil> Message-ID: <390EFCC0.240BC56B@prescod.net> Neil, I sincerely appreciate your informed input. I want to emphasize one ideological difference though. :) Neil Hodgson wrote: > > ... > > The two options being that literal is either assumed to be encoded in > Latin-1 or UTF-8. I reject that characterization. I claim that both strings contain Unicode characters but one can contain Unicode charactes with higher digits. UTF-8 versus latin-1 does not enter into it. Python strings should not be documented in terms of encodings any more than Python ints are documented in terms of their two's complement representation. Then we could describe the default conversion from integers to floats in terms of their bit-representation. Ugh! I accept that the effect is similar to calling Latin-1 the "default" that's a side effect of the simple logical model that we are proposing. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From just@letterror.com Tue May 2 18:33:56 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 18:33:56 +0100 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <390EFA7B.F6B622F0@lemburg.com> References: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: At 5:55 PM +0200 02-05-2000, M.-A. Lemburg wrote: >[BTW, I'm pretty sure that the Latin-1 folks won't like >ASCII for the same reason they don't like UTF-8: it's >simply an inconvenient way to write strings in their favorite >encoding directly in Python source code. My feeling in this >whole discussion is that it's more about convenience than >anything else. Still, it's very amusing ;-) ] For the record, I don't want Latin-1 because it's my favorite encoding. It isn't. Guido's right: I can't even *use* it derictly on my platform. I want it *only* because it's the most logical 8-bit subset of Unicode -- as we have stated over and opver and over and over again. What's so hard to understand about this? Just From paul@prescod.net Tue May 2 17:11:13 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:11:13 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> Message-ID: <390EFE21.DAD7749B@prescod.net> Combining characters are a whole 'nother level of complexity. Charater sets are hard. I don't accept that the argument that "Unicode itself has complexities so that gives us license to introduce even more complexities at the character representation level." > FYI: Normalization is needed to make comparing Unicode > strings robust, e.g. u"�" should compare equal to u"e\u0301". That's a whole 'nother debate at a whole 'nother level of abstraction. I think we need to get the bytes/characters level right and then we can worry about display-equivalent characters (or leave that to the Python programmer to figure out...). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From paul@prescod.net Tue May 2 17:13:00 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:13:00 -0500 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate References: Your message of "Tue, 02 May 2000 14:46:44 BST."

<200005021231.IAA24249@eric.cnri.reston.va.us> <200005021421.KAA24526@eric.cnri.reston.va.us> Message-ID: <390EFE8C.4C10473C@prescod.net> Guido van Rossum wrote: > > ... > > But not all 8-bit strings occurring in programs are Unicode. Ask > Moshe. Where are we going? What's our long-range vision? Three years from now where will we be? 1. How will we handle characters? 2. How will we handle bytes? 3. What will unadorned literal strings "do"? 4. Will literal strings be the same type as byte arrays? I don't see how we can make decisions today without a vision for the future. I think that this is the central point in our disagreement. Some of us are aiming for as much compatibility with where we think we should be going and others are aiming for as much compatibility as possible with where we came from. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From just@letterror.com Tue May 2 18:37:09 2000 From: just@letterror.com (Just van Rossum) Date: Tue, 2 May 2000 18:37:09 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390EF327.86D8C3D8@lemburg.com> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> Message-ID: At 5:24 PM +0200 02-05-2000, M.-A. Lemburg wrote: >> Still, two combining characters are still two input characters for >> the renderer! They may result in one *glyph*, but trust me, >> that's an entirly different can of worms. > >No. Please see my other post on the subject... It would help if you'd post some actual doco. Just From paul@prescod.net Tue May 2 17:25:33 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 11:25:33 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid> <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us> Message-ID: <390F017C.91C7A8A0@prescod.net> Guido van Rossum wrote: > > Aha, then we'll see u == v even though type(u) is type(v) and len(u) > != len(v). /F's world will collapse. :-) There are many levels of equality that are interesting. I don't think we would move to grapheme equivalence until "the rest of the world" (XML, Java, W3C, SQL) did. If we were going to move to grapheme equivalence (some day), the right way would be to normalize characters in the construction of the Unicode string. This is known as "Early normalization": http://www.w3.org/TR/charmod/#NormalizationApplication -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From ping@lfw.org Tue May 2 17:43:25 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 09:43:25 -0700 (PDT) Subject: [Python-Dev] Traceback style In-Reply-To: <20000502082751.A1504@mems-exchange.org> Message-ID: On Tue, 2 May 2000, Greg Ward wrote: > > In the common interactive case that the file > > is a typed-in string, the current printout is > > > > File "", line 1 > > > > and the following is easier to read in my opinion: > > > > Line 1 of > > OK, that's a good reason. Maybe you could special-case the "" > case? ...and "", and "", and perhaps others... ? File "", line 3 just looks downright clumsy the first time you see it. (Well, it still looks kinda clumsy to me or i wouldn't be proposing the change.) Can someone verify the already-parseable-by-Emacs claim, and describe how you get Emacs to do something useful with bits of traceback? (Alas, i'm not an Emacs user, so understanding just how the current format is useful would help.) -- ?!ng From bwarsaw@python.org Tue May 2 18:13:03 2000 From: bwarsaw@python.org (bwarsaw@python.org) Date: Tue, 2 May 2000 13:13:03 -0400 (EDT) Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39 References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> Message-ID: <14607.3231.115841.262068@anthem.cnri.reston.va.us> >>>>> "PF" == Peter Funk writes: PF> I suggest an additional note saying that this signature has PF> been added in Python 1.6. There used to be several such notes PF> all over the documentation saying for example: "New in version PF> 1.5.2." which I found very useful in the past! Good point. Fred, what is the Right Way to do this? -Barry From bwarsaw@python.org Tue May 2 18:16:22 2000 From: bwarsaw@python.org (bwarsaw@python.org) Date: Tue, 2 May 2000 13:16:22 -0400 (EDT) Subject: [Python-Dev] Traceback style References: <20000502082751.A1504@mems-exchange.org> Message-ID: <14607.3430.941026.496225@anthem.cnri.reston.va.us> I concur with Greg's scores. From guido@python.org Tue May 2 18:22:02 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 13:22:02 -0400 Subject: [Python-Dev] Traceback style In-Reply-To: Your message of "Tue, 02 May 2000 08:27:51 EDT." <20000502082751.A1504@mems-exchange.org> References: <20000502082751.A1504@mems-exchange.org> Message-ID: <200005021722.NAA25854@eric.cnri.reston.va.us> > On 02 May 2000, Ka-Ping Yee said: > > I propose the following stylistic changes to traceback > > printing: > > > > 1. If there is no function name for a given level > > in the traceback, just omit the ", in ?" at the > > end of the line. Greg Ward expresses my sentiments: > +0 on this: it doesn't really add anything, but it does neaten things > up. > > > 2. If a given level of the traceback is in a method, > > instead of just printing the method name, print > > the class and the method name. > > +1 here too: this definitely adds utility. > > > 3. Instead of beginning each line with: > > > > File "foo.py", line 5 > > > > print the line first and drop the quotes: > > > > Line 5 of foo.py > > -0: adds nothing, cleans nothing up, and just generally breaks things > for no good reason. > > > In the common interactive case that the file > > is a typed-in string, the current printout is > > > > File "", line 1 > > > > and the following is easier to read in my opinion: > > > > Line 1 of > > OK, that's a good reason. Maybe you could special-case the "" > case? How about > > , line 1 > > ? I'd special-case any filename that starts with < and ends with > -- those are all made-up names like or . You can display them however you like, perhaps In "", line 3 For regular files I'd leave the formatting alone -- there are tools out there that parse these. (E.g. Emacs' Python mode jumps to the line with the error if you run a file and it begets an exception.) --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Tue May 2 18:14:24 2000 From: tree@basistech.com (Tom Emerson) Date: Tue, 2 May 2000 13:14:24 -0400 (EDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390EF327.86D8C3D8@lemburg.com> References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390EF327.86D8C3D8@lemburg.com> Message-ID: <14607.3312.660077.42872@cymru.basistech.com> M.-A. Lemburg writes: > The details are on the www.unicode.org web-site burried > in some of the tech reports on normalization and > collation. This is described in the Unicode standard itself, and in UTR #15 and UTR #10. Normalization is an issue with wider imlications than just handling glyph variants: indeed, it's irrelevant. The question is this: should U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS compare equal to U+0055 LATIN CAPITAL LETTER U U+0308 COMBINING DIAERESIS or not? It depends on the application. Certainly in a database system I would want these to compare equal. Perhaps normalization form needs to be an option of the string comparator? -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From bwarsaw@python.org Tue May 2 18:51:17 2000 From: bwarsaw@python.org (bwarsaw@python.org) Date: Tue, 2 May 2000 13:51:17 -0400 (EDT) Subject: [Python-Dev] Traceback style References: <20000502082751.A1504@mems-exchange.org> <200005021722.NAA25854@eric.cnri.reston.va.us> Message-ID: <14607.5525.160379.760452@anthem.cnri.reston.va.us> >>>>> "GvR" == Guido van Rossum writes: GvR> For regular files I'd leave the formatting alone -- there are GvR> tools out there that parse these. (E.g. Emacs' Python mode GvR> jumps to the line with the error if you run a file and it GvR> begets an exception.) py-traceback-line-re is what matches those lines. It's current definition is (defconst py-traceback-line-re "[ \t]+File \"\$[^\"]+\$\", line \$[0-9]+\$" "Regular expression that describes tracebacks.") There are probably also gud.el (and maybe compile.el) regexps that need to be changed too. I'd rather see something that outputs the same regardless of whether it's a real file, or something "fake". Something like Line 1 of Line 12 of foo.py should be fine. I'm not crazy about something like File "foo.py", line 12 In , line 1 -Barry From fdrake@acm.org Tue May 2 18:59:43 2000 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Tue, 2 May 2000 13:59:43 -0400 (EDT) Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39 In-Reply-To: <14607.3231.115841.262068@anthem.cnri.reston.va.us> References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> <14607.3231.115841.262068@anthem.cnri.reston.va.us> Message-ID: <14607.6031.770981.424012@seahag.cnri.reston.va.us> bwarsaw@python.org writes: > Good point. Fred, what is the Right Way to do this? Pester me night and day until it gets done (email only!). Unless of course you've already seen the check-in messages. ;) -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From bwarsaw@python.org Tue May 2 19:05:00 2000 From: bwarsaw@python.org (bwarsaw@python.org) Date: Tue, 2 May 2000 14:05:00 -0400 (EDT) Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Doc/lib libos.tex,1.38,1.39 References: <20000501161825.9F3AE6616D@anthem.cnri.reston.va.us> <14607.3231.115841.262068@anthem.cnri.reston.va.us> <14607.6031.770981.424012@seahag.cnri.reston.va.us> Message-ID: <14607.6348.453682.219847@anthem.cnri.reston.va.us> >>>>> "Fred" == Fred L Drake, Jr writes: Fred> Pester me night and day until it gets done (email only!). Okay, I'll cancel the daily delivery of angry rabid velco monkeys. Fred> Unless of course you've already seen the check-in messages. Fred> ;) Saw 'em. Thanks. -Barry Return-Path: Delivered-To: python-dev@python.org Received: from merlin.codesourcery.com (merlin.codesourcery.com [206.168.99.1]) by dinsdale.python.org (Postfix) with SMTP id 81F951CD8B for ; Tue, 2 May 2000 14:45:04 -0400 (EDT) Received: (qmail 9404 invoked by uid 513); 2 May 2000 18:53:01 -0000 Mailing-List: contact sc-publicity-help@software-carpentry.com; run by ezmlm Precedence: bulk X-No-Archive: yes Delivered-To: mailing list sc-publicity@software-carpentry.com Delivered-To: moderator for sc-publicity@software-carpentry.com Received: (qmail 5829 invoked from network); 2 May 2000 18:12:54 -0000 Date: Tue, 2 May 2000 14:04:56 -0400 (EDT) From: To: sc-discuss@software-carpentry.com, sc-announce@software-carpentry.com, sc-publicity@software-carpentry.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [Python-Dev] Software Carpentry Design Competition Finalists Sender: python-dev-admin@python.org Errors-To: python-dev-admin@python.org X-BeenThere: python-dev@python.org X-Mailman-Version: 2.0beta3 List-Id: Python core developers Software Carpentry Design Competition First-Round Results http://www.software-carpentry.com May 2, 2000 The Software Carpentry Project is pleased to announce the selection of finalists in its first Open Source Design Competition. There were many strong entries, and we would like to thank everyone who took the time to participate. We would also like to invite everyone who has been involved to contact the teams listed below, and see if there is any way to collaborate in the second round. Many of you had excellent ideas that deserve to be in the final tools, and the more involved you are in discussions over the next two months, the easier it will be for you to take part in the ensuing implementation effort. The 12 entries that are going forward in the "Configuration", "Build", and "Track" categories are listed below (in alphabetical order). The four prize-winning entries in the "Test" category are also listed, but as is explained there, we are putting this section of the competition on hold for a couple of months while we try to refine the requirements. You can inspect these entries on-line at: http://www.software-carpentry.com/first-round.html And so, without further ado... == Configuration The final four entries in the "Configuration" category are: * BuildConf Vassilis Virvilis * ConfBase Stefan Knappmann * SapCat Lindsay Todd * Tan David Ascher == Build The finalists in the "Build" category are: * Black David Ascher and Trent Mick * PyMake Rich Miller * ScCons Steven Knight * Tromey Tom Tromey Honorable mentions in this category go to: * Forge Bill Bitner, Justin Patterson, and Gilbert Ramirez * Quilt David Lamb == Track The four entries to go forward in the "Track" category are: * Egad John Martin * K2 David Belfer-Shevett * Roundup Ka-Ping Yee * Tracker Ken Manheimer There is also an honorable mention for: * TotalTrack Alex Samuel, Mark Mitchell == Test This category was the most difficult one for the judges. First-round prizes are being awarded to * AppTest Linda Timberlake * TestTalk Chang Liu * Thomas Patrick Campbell-Preston * TotalQuality Alex Samuel, Mark Mitchell However, the judges did not feel that any of these tools would have an impact on Open Source software development in general, or scientific and numerical programming in particular. This is due in large part to the vagueness of the posted requirements, for which the project coordinator (Greg Wilson) accepts full responsibility. We will therefore not be going forward with this category at the present time. Instead, the judges and others will develop narrower, more specific requirements, guidelines, and expectations. The category will be re-opened in July 2000. == Contact The aim of the Software Carpentry project is to create a new generation of easy-to-use software engineering tools, and to document both those tools and the working practices they are meant to support. The Advanced Computing Laboratory at Los Alamos National Laboratory is providing $860,000 of funding for Software Carpentry, which is being administered by Code Sourcery, LLC. For more information, contact the project coordinator, Dr. Gregory V. Wilson, at 'gvwilson@software-carpentry.com', or on +1 (416) 504 2325 ext. 229. == Footnote: Entries from CodeSourcery, LLC Two entries (TotalTrack and TotalQuality) were received from employees of CodeSourcery, LLC, the company which is hosting the Software Carpentry web site. We discussed this matter with Dr. Rod Oldehoeft, Deputy Directory of the Advanced Computing Laboratory at Los Alamos National Laboratory. His response was: John Reynders [Director of the ACL] and I have discussed this matter. We agree that since the judges who make decisions are not affiliated with Code Sourcery, there is no conflict of interest. Code Sourcery gains no advantage by hosting the Software Carpentry web pages. Please continue evaluating all the entries on their merits, and choose the best for further eligibility. Note that the project coordinator, Greg Wilson, is neither employed by CodeSourcery, nor a judge in the competition. From paul@prescod.net Tue May 2 19:23:24 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 13:23:24 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <390F1D1C.6EAF7EAD@prescod.net> Guido van Rossum wrote: > > .... > > Have you tried using this? Yes. I haven't had large problems with it. As long as you know what is going on, it doesn't usually hurt anything because you can just explicitly set up the decoding you want. It's like the int division problem. You get bitten a few times and then get careful. It's the naive user who will be surprised by these random UTF-8 decoding errors. That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 19:56:34 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 14:56:34 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT." <390F1D1C.6EAF7EAD@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us> > It's the naive user who will be surprised by these random UTF-8 decoding > errors. > > That's why this is NOT a convenience issue (are you listening MAL???). > It's a short and long term simplicity issue. There are lots of languages > where it is de rigeur to discover and work around inconvenient and > confusing default behaviors. I just don't think that we should be ADDING > such behaviors. So what do you think of my new proposal of using ASCII as the default "encoding"? It takes care of "a character is a character" but also (almost) guarantees an error message when mixing encoded 8-bit strings with Unicode strings without specifying an explicit conversion -- *any* 8-bit byte with the top bit set is rejected by the default conversion to Unicode. I think this is less confusing than Latin-1: when an unsuspecting user is reading encoded text from a file into 8-bit strings and attempts to use it in a Unicode context, an error is raised instead of producing garbage Unicode characters. It encourages the use of Unicode strings for everything beyond ASCII -- there's no way around ASCII since that's the source encoding etc., but Latin-1 is an inconvenient default in most parts of the world. ASCII is accepted everywhere as the base character set (e.g. for email and for text-based protocols like FTP and HTTP), just like English is the one natural language that we can all sue to communicate (to some extent). --Guido van Rossum (home page: http://www.python.org/~guido/) From dieter@handshake.de Tue May 2 19:44:41 2000 From: dieter@handshake.de (Dieter Maurer) Date: Tue, 2 May 2000 20:44:41 +0200 (CEST) Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390E1F08.EA91599E@prescod.net> References: <390E1F08.EA91599E@prescod.net> Message-ID: <14607.7798.510723.419556@lindm.dm> Paul Prescod writes: > The fact that my proposal has the same effect as making Latin-1 the > "default encoding" is a near-term side effect of the definition of > Unicode. My long term proposal is to do away with the concept of 8-bit > strings (and thus, conversions from 8-bit to Unicode) altogether. One > string to rule them all! Why must this be a long term proposal? I would find it quite attractive, when * the old string type became an imutable list of bytes * automatic conversion between byte lists and unicode strings were performed via user customizable conversion functions (a la __import__). Dieter From paul@prescod.net Tue May 2 20:01:32 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 14:01:32 -0500 Subject: [Python-Dev] Unicode compromise? References: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: <390F260C.2314F97E@prescod.net> Guido van Rossum wrote: > > > No automatic conversions between 8-bit "strings" and Unicode strings. > > > > If you want to turn UTF-8 into a Unicode string, say so. > > If you want to turn Latin-1 into a Unicode string, say so. > > If you want to turn ISO-2022-JP into a Unicode string, say so. > > Adding a Unicode string and an 8-bit "string" gives an exception. > > I'd accept this, with one change: mixing Unicode and 8-bit strings is > okay when the 8-bit strings contain only ASCII (byte values 0 through > 127). I could live with this compromise as long as we document that a future version may use the "character is a character" model. I just don't want people to start depending on a catchable exception being thrown because that would stop us from ever unifying unmarked literal strings and Unicode strings. -- Are there any steps we could take to make a future divorce of strings and byte arrays easier? What if we added a binary_read() function that returns some form of byte array. The byte array type could be just like today's string type except that its type object would be distinct, it wouldn't have as many string-ish methods and it wouldn't have any auto-conversion to Unicode at all. People could start to transition code that reads non-ASCII data to the new function. We could put big warning labels on read() to state that it might not always be able to read data that is not in some small set of recognized encodings (probably UTF-8 and UTF-16). Or perhaps binary_open(). Or perhaps both. I do not suggest just using the text/binary flag on the existing open function because we cannot immediately change its behavior without breaking code. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From jkraai@murlmail.com Tue May 2 20:46:49 2000 From: jkraai@murlmail.com (jkraai@murlmail.com) Date: Tue, 2 May 2000 14:46:49 -0500 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate Message-ID: <200005021946.OAA03609@www.polytopic.com> The ever quotable Guido: > English is the one natural language that we can all sue to communicate ------------------------------------------------------------------ You've received MurlMail! -- FREE, web-based email, accessible anywhere, anytime from any browser-enabled device. Sign up now at http://murl.com Murl.com - At Your Service From paul@prescod.net Tue May 2 20:23:27 2000 From: paul@prescod.net (Paul Prescod) Date: Tue, 02 May 2000 14:23:27 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> Message-ID: <390F2B2F.2953C72D@prescod.net> Guido van Rossum wrote: > > ... > > So what do you think of my new proposal of using ASCII as the default > "encoding"? I can live with it. I am mildly uncomfortable with the idea that I could write a whole bunch of software that works great until some European inserts one of their name characters. Nevertheless, being hard-assed is better than being permissive because we can loosen up later. What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From guido@python.org Tue May 2 20:58:20 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 15:58:20 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT." <390F2B2F.2953C72D@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us> [me] > > So what do you think of my new proposal of using ASCII as the default > > "encoding"? [Paul] > I can live with it. I am mildly uncomfortable with the idea that I could > write a whole bunch of software that works great until some European > inserts one of their name characters. Better than that when some Japanese insert *their* name characters and it produces gibberish instead. > Nevertheless, being hard-assed is > better than being permissive because we can loosen up later. Exactly -- just as nobody should *count* on 10**10 raising OverflowError, nobody (except maybe parts of the standard library :-) should *count* on unicode("\347") raising ValueError. I think that's fine. > What do we do about str( my_unicode_string )? Perhaps escape the Unicode > characters with backslashed numbers? Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error. But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does). --Guido van Rossum (home page: http://www.python.org/~guido/) From trentm@activestate.com Tue May 2 21:47:17 2000 From: trentm@activestate.com (Trent Mick) Date: Tue, 2 May 2000 13:47:17 -0700 Subject: [Python-Dev] Cannot declare the largest integer literal. Message-ID: <20000502134717.A16825@activestate.com> >>> i = -2147483648 OverflowError: integer literal too large >>> i = -2147483648L >>> int(i) # it *is* a valid integer literal -2147483648 As far as I traced back: Python/compile.c::com_atom() calls Python/compile.c::parsenumber(s = "2147483648") calls Python/mystrtoul.c::PyOS_strtol() which returns the ERANGE errno because it is given 2147483648 (which *is* out of range) rather than -2147483648. My question: Why is the minus sign not considered part of the "atom", i.e. the integer literal? Should it be? PyOS_strtol() can properly parse this integer literal if it is given the whole number with the minus sign. Otherwise the special case largest negative number will always erroneously be considered out of range. I don't know how the tokenizer works in Python. Was there a design decision to separate the integer literal and the leading sign? And was the effect on functions like PyOS_strtol() down the pipe missed? Trent -- Trent Mick trentm@activestate.com From guido@python.org Tue May 2 21:47:30 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 16:47:30 -0400 Subject: [Python-Dev] Unicode compromise? In-Reply-To: Your message of "Tue, 02 May 2000 14:01:32 CDT." <390F260C.2314F97E@prescod.net> References: <200005021231.IAA24249@eric.cnri.reston.va.us> <390F260C.2314F97E@prescod.net> Message-ID: <200005022047.QAA26828@eric.cnri.reston.va.us> > I could live with this compromise as long as we document that a future > version may use the "character is a character" model. I just don't want > people to start depending on a catchable exception being thrown because > that would stop us from ever unifying unmarked literal strings and > Unicode strings. Agreed (as I've said before). > -- > > Are there any steps we could take to make a future divorce of strings > and byte arrays easier? What if we added a > > binary_read() > > function that returns some form of byte array. The byte array type could > be just like today's string type except that its type object would be > distinct, it wouldn't have as many string-ish methods and it wouldn't > have any auto-conversion to Unicode at all. You can do this now with the array module, although clumsily: >>> import array >>> f = open("/core", "rb") >>> a = array.array('B', [0]) * 1000 >>> f.readinto(a) 1000 >>> Or if you wanted to read raw Unicode (UTF-16): >>> a = array.array('H', [0]) * 1000 >>> f.readinto(a) 2000 >>> u = unicode(a, "utf-16") >>> There are some performance issues, e.g. you have to initialize the buffer somehow and that seems a bit wasteful. > People could start to transition code that reads non-ASCII data to the > new function. We could put big warning labels on read() to state that it > might not always be able to read data that is not in some small set of > recognized encodings (probably UTF-8 and UTF-16). > > Or perhaps binary_open(). Or perhaps both. > > I do not suggest just using the text/binary flag on the existing open > function because we cannot immediately change its behavior without > breaking code. A new method makes most sense -- there are definitely situations where you want to read in text mode for a while and then switch to binary mode (e.g. HTTP). I'd like to put this off until after Python 1.6 -- but it deserves attention. --Guido van Rossum (home page: http://www.python.org/~guido/) From trentm@activestate.com Wed May 3 00:03:22 2000 From: trentm@activestate.com (Trent Mick) Date: Tue, 2 May 2000 16:03:22 -0700 Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h Message-ID: <20000502160322.A19101@activestate.com> I apologize if I am hitting covered ground. What about a module (called limits or something like that) that would expose some appropriate #define's in limits.h and float.h. For example: limits.FLT_EPSILON could expose the C DBL_EPSILON limits.FLT_MAX could expose the C DBL_MAX limits.INT_MAX could expose the C LONG_MAX (although that particulay name would cause confusion with the actual C INT_MAX) - Does this kind of thing already exist somewhere? Maybe in NumPy. - If we ever (perhaps in Py3K) turn the basic types into classes then these could turn into constant attributes of those classes, i.e.: f = 3.14159 f.EPSILON = - I thought of these values being useful when I thought of comparing two floats for equality. Doing a straight comparison of floats is dangerous/wrong but is it not okay to consider two floats reasonably equal iff: -EPSILON < float2 - float1 < EPSILON Or maybe that should be two or three EPSILONs. It has been a while since I've done any numerical analysis stuff. I suppose the answer to my question is: "It depends on the situation." Could this algorithm for float comparison be a better default than the status quo? I know that Mark H. and others have suggested that Python should maybe not provide a float comparsion operator at all to beginners. Trent -- Trent Mick trentm@activestate.com From mal@lemburg.com Wed May 3 00:11:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 01:11:37 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> Message-ID: <390F60A9.A3AA53A9@lemburg.com> Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > "encoding"? How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.) The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mhammond@skippinet.com.au Wed May 3 00:12:18 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Wed, 3 May 2000 09:12:18 +1000 Subject: [Python-Dev] Cannot declare the largest integer literal. In-Reply-To: <20000502134717.A16825@activestate.com> Message-ID: > >>> i = -2147483648 > OverflowError: integer literal too large > >>> i = -2147483648L > >>> int(i) # it *is* a valid integer literal > -2147483648 I struck this years ago! At the time, the answer was "yes, its an implementation flaw thats not worth fixing". Interestingly, it _does_ work as a hex literal: >>> 0x80000000 -2147483648 >>> -2147483648 Traceback (OverflowError: integer literal too large >>> Mark. From mal@lemburg.com Wed May 3 00:05:28 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 01:05:28 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net> Message-ID: <390F5F38.DD76CAF4@lemburg.com> Paul Prescod wrote: > > Combining characters are a whole 'nother level of complexity. Charater > sets are hard. I don't accept that the argument that "Unicode itself has > complexities so that gives us license to introduce even more > complexities at the character representation level." > > > FYI: Normalization is needed to make comparing Unicode > > strings robust, e.g. u"�" should compare equal to u"e\u0301". > > That's a whole 'nother debate at a whole 'nother level of abstraction. I > think we need to get the bytes/characters level right and then we can > worry about display-equivalent characters (or leave that to the Python > programmer to figure out...). I just wanted to point out that the argument "slicing doesn't work with UTF-8" is moot. I do see a point against UTF-8 auto-conversion given the example that Guido mailed me: """ s = 'ab\341\210\264def' # == str(u"ab\u1234def") s.find(u"def") This prints 3 -- the wrong result since "def" is found at s[5:8], not at s[3:6]. """ -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim_one@email.msn.com Wed May 3 03:20:20 2000 From: tim_one@email.msn.com (Tim Peters) Date: Tue, 2 May 2000 22:20:20 -0400 Subject: [Python-Dev] Cannot declare the largest integer literal. In-Reply-To: <20000502134717.A16825@activestate.com> Message-ID: <000001bfb4a6$21da7900$922d153f@tim> [Trent Mick] > >>> i = -2147483648 > OverflowError: integer literal too large > >>> i = -2147483648L > >>> int(i) # it *is* a valid integer literal > -2147483648 Python's grammar is such that negative integer literals don't exist; what you actually have there is the unary minus operator applied to positive integer literals; indeed, >>> def f(): return -42 >>> import dis >>> dis.dis(f) 0 SET_LINENO 1 3 SET_LINENO 2 6 LOAD_CONST 1 (42) 9 UNARY_NEGATIVE 10 RETURN_VALUE 11 LOAD_CONST 0 (None) 14 RETURN_VALUE >>> Note that, at runtime, the example loads +42, then negates it: this wart has deep roots! > ... > And was the effect on functions like PyOS_strtol() down the pipe > missed? More that it was considered an inconsequential endcase. It's sure not worth changing the grammar for . I'd rather see Python erase the visible distinction between ints and longs. From guido@python.org Wed May 3 03:31:21 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 02 May 2000 22:31:21 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200." <390F60A9.A3AA53A9@lemburg.com> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us> > Guido van Rossum wrote: > > > > So what do you think of my new proposal of using ASCII as the default > > > > "encoding"? [MAL] > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) > > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. No, the backslash should mean itself when encoding from ASCII to Unicode. --Guido van Rossum (home page: http://www.python.org/~guido/) From esr@thyrsus.com Wed May 3 04:22:20 2000 From: esr@thyrsus.com (Eric S. Raymond) Date: Tue, 2 May 2000 23:22:20 -0400 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <390EFE8C.4C10473C@prescod.net>; from paul@prescod.net on Tue, May 02, 2000 at 11:13:00AM -0500 References:

<200005021231.IAA24249@eric.cnri.reston.va.us> <200005021421.KAA24526@eric.cnri.reston.va.us> <390EFE8C.4C10473C@prescod.net> Message-ID: <20000502232220.B18638@thyrsus.com> Paul Prescod : > Where are we going? What's our long-range vision? > > Three years from now where will we be? > > 1. How will we handle characters? > 2. How will we handle bytes? > 3. What will unadorned literal strings "do"? > 4. Will literal strings be the same type as byte arrays? > > I don't see how we can make decisions today without a vision for the > future. I think that this is the central point in our disagreement. Some > of us are aiming for as much compatibility with where we think we should > be going and others are aiming for as much compatibility as possible > with where we came from. And *that* is the most insightful statement I have seen in this entire foofaraw (which I have carefully been staying right the hell out of). Everybody meditate on the above, please. Then declare your objectives *at this level* so our Fearless Leader can make an informed decision *at this level*. Only then will it make sense to argue encoding theology... -- Eric S. Raymond "Extremism in the defense of liberty is no vice; moderation in the pursuit of justice is no virtue." -- Barry Goldwater (actually written by Karl Hess) From tim_one@email.msn.com Wed May 3 06:05:59 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 01:05:59 -0400 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us> Message-ID: <000301bfb4bd$463ec280$622d153f@tim> [Guido] > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII > bytes in either should make the comparison fail; when ordering is > important, we can make an arbitrary choice e.g. "\377" < u"\200". [Toby] > I assume 'fail' means 'non-equal', rather than 'raises an exception'? [Guido] > Yes, sorry for the ambiguity. Huh! You sure about that? If we're setting up a case where meaningful comparison is impossible, isn't an exception more appropriate? The current >>> 83479278 < "42" 1 >>> probably traps more people than it helps. From tim_one@email.msn.com Wed May 3 06:19:28 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 01:19:28 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid> Message-ID: <000401bfb4bf$27ec1600$622d153f@tim> [Fredrik Lundh] > ... > (if you like, I can post more "fun with unicode" messages ;-) By all means! Exposing a gotcha to ridicule does more good than a dozen abstract arguments. But next time stoop to explaining what it is that's surprising . From just@letterror.com Wed May 3 07:47:07 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 07:47:07 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390F5F38.DD76CAF4@lemburg.com> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net> Message-ID: [MAL vs. PP] >> > FYI: Normalization is needed to make comparing Unicode >> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301". >> >> That's a whole 'nother debate at a whole 'nother level of abstraction. I >> think we need to get the bytes/characters level right and then we can >> worry about display-equivalent characters (or leave that to the Python >> programmer to figure out...). > >I just wanted to point out that the argument "slicing doesn't >work with UTF-8" is moot. And failed... I asked two Unicode guru's I happen to know about the normalization issue (which is indeed not relevant to the current discussion, but it's fascinating nevertheless!). (Sorry about the possibly wrong email encoding... "=E8" is u"\350", "=F6" is u"\366") John Jenkins replied: """ Well, I'm not sure you want to hear the answer -- but it really depends on what the language is attempting to do. By and large, Unicode takes the position that "e`" should always be treated the same as "=E8". This is a *semantic* equivalence -- that is, they *mean* the same thing -- and doesn't depend on the display engine to be true. Unicode also provides a default collation algorithm (http://www.unicode.org/unicode/reports/tr10/). At the same time, the standard acknowledges that in real life, string comparison and collation are complicated, language-specific problems requiring a lot of work and interaction with the user to do right. >From the perspective of a programming language, it would best be served IMH= O by implementing the contents of TR10 for string comparison and collation. That would make "e`" and "=E8" come out as equivalent. """ Dave Opstad replied: """ Unicode talks about "canonical decomposition" in order to make it easier to answer questions like yours. Specifically, in the Unicode 3.0 standard, rule D24 in section 3.6 (page 44) states that: "Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. For example, the sequences and <=F6> are canonical equivalents. Canonical equivalence is a Unicode propert. It should not be confused with language-specific collation or matching, which may add additional equivalencies." So they still have language-specific differences, even if Unicode sees them as canonically equivalent. You might want to check this out: http://www.unicode.org/unicode/reports/tr15/tr15-18.html It's the latest technical report on these issues, which may help clarify things further. """ It's very deep stuff, which seems more appropriate for an extension than for builtin comparisons to me. Just From tim_one@email.msn.com Wed May 3 06:47:37 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 01:47:37 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Message-ID: <000501bfb4c3$16743480$622d153f@tim> [Moshe Zadka] > ... > I'd much prefer Python to reflect a fundamental truth about Unicode, > which at least makes sure binary-goop can pass through Unicode and > remain unharmed, then to reflect a nasty problem with UTF-8 (not > everything is legal). Then you don't want Unicode at all, Moshe. All the official encoding schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of Unicode not yet having assigned a character to this position, it's that the standard explicitly makes this sequence illegal and guarantees it will always be illegal! the other place this comes up is with surrogates, where what's legal depends on both parts of a character pair; and, again, the illegalities here are guaranteed illegal for all time). UCS-4 is the closest thing to binary-transparent Unicode encodings get, but even there the length of a thing is contrained to be a multiple of 4 bytes. Unicode and binary goop will never coexist peacefully. From ping@lfw.org Wed May 3 06:56:12 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Tue, 2 May 2000 22:56:12 -0700 (PDT) Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: <000301bfb4bd$463ec280$622d153f@tim> Message-ID: On Wed, 3 May 2000, Tim Peters wrote: > [Toby] > > I assume 'fail' means 'non-equal', rather than 'raises an exception'? > > [Guido] > > Yes, sorry for the ambiguity. > > Huh! You sure about that? If we're setting up a case where meaningful > comparison is impossible, isn't an exception more appropriate? The current > > >>> 83479278 < "42" > 1 > > probably traps more people than it helps. Yeah, when i said No automatic conversions between Unicode strings and 8-bit "strings". i was about to say Raise an exception on any operation attempting to combine or compare Unicode strings and 8-bit "strings". ...and then i thought, oh crap, but everything in Python is supposed to be comparable. What happens when you have some lists with arbitrary objects in them and you want to sort them for printing, or to canonicalize them so you can compare? It might be too troublesome for list.sort() to throw an exception because e.g. strings and ints were incomparable, or 8-bit "strings" and Unicode strings were incomparable... So -- what's the philosophy, Guido? Are we committed to "everything is comparable" (well, "all built-in types are comparable") or not? -- ?!ng From tim_one@email.msn.com Wed May 3 07:40:54 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 02:40:54 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Message-ID: <000701bfb4ca$87b765c0$622d153f@tim> [MAL] > I just wanted to point out that the argument "slicing doesn't > work with UTF-8" is moot. [Just] > And failed... He succeeded for me. Blind slicing doesn't always "work right" no matter what encoding you use, because "work right" depends on semantics beyond the level of encoding. UTF-8 is no worse than anything else in this respect. From just@letterror.com Wed May 3 08:50:11 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 08:50:11 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <000701bfb4ca$87b765c0$622d153f@tim> References: Message-ID: [MAL] > I just wanted to point out that the argument "slicing doesn't > work with UTF-8" is moot. [Just] > And failed... [Tim] >He succeeded for me. Blind slicing doesn't always "work right" no matter >what encoding you use, because "work right" depends on semantics beyond the >level of encoding. UTF-8 is no worse than anything else in this respect. But the discussion *was* at the level of encoding! Still it is worse, since an arbitrary utf-8 slice may result in two illegal strings -- slicing "e`" results in two perfectly legal strings, at the encoding level. Had he used surrogates as an example, he would've been right... (But even that is an encoding issue.) Just From tim_one@email.msn.com Wed May 3 08:11:12 2000 From: tim_one@email.msn.com (Tim Peters) Date: Wed, 3 May 2000 03:11:12 -0400 Subject: [Python-Dev] PROPOSAL: exposure of values in limits.h and float.h In-Reply-To: <20000502160322.A19101@activestate.com> Message-ID: <000801bfb4ce$c361ea60$622d153f@tim> [Trent Mick] > I apologize if I am hitting covered ground. What about a module (called > limits or something like that) that would expose some appropriate > #define's > in limits.h and float.h. I personally have little use for these. > For example: > > limits.FLT_EPSILON could expose the C DBL_EPSILON > limits.FLT_MAX could expose the C DBL_MAX Hmm -- all evidence suggests that your "O" and "A" keys work fine, so where did the absurdly abbreviated FLT come from ? > limits.INT_MAX could expose the C LONG_MAX (although that particulay name > would cause confusion with the actual C INT_MAX) That one is available as sys.maxint. > - Does this kind of thing already exist somewhere? Maybe in NumPy. Dunno. I compute the floating-point limits when needed with Python code, and observing what the hardware actually does is a heck of a lot more trustworthy than platform C header files (and especially when cross-compiling). > - If we ever (perhaps in Py3K) turn the basic types into classes > then these could turn into constant attributes of those classes, i.e.: > f = 3.14159 > f.EPSILON = That sounds better. > - I thought of these values being useful when I thought of comparing > two floats for equality. Doing a straight comparison of floats is > dangerous/wrong This is a myth whose only claim to veracity is the frequency and intensity with which it's mechanically repeated <0.6 wink>. It's no more dangerous than adding two floats: you're potentially screwed if you don't know what you're doing in either case, but you're in no trouble at all if you do. > but is it not okay to consider two floats reasonably equal iff: > -EPSILON < float2 - float1 < EPSILON Knuth (Vol 2) gives a reasonable defn of approximate float equality. Yours is measuring absolute error, which is almost never reasonable; relative error is the measure of interest, but then 0.0 is an especially irksome comparand. > ... > I suppose the answer to my question is: "It depends on the situation." Yes. > Could this algorithm for float comparison be a better default than the > status quo? No. > I know that Mark H. and others have suggested that Python should maybe > not provide a float comparsion operator at all to beginners. There's a good case to be made for not exposing *anything* about fp to beginners, but comparisons aren't especially surprising. This usually gets suggested when a newbie is surprised that e.g. 1./49*49 != 1. Telling them they *are* equal is simply a lie, and they'll pay for that false comfort twice over a little bit later down the fp road. For example, int(1./49*49) is 0 on IEEE-754 platforms, which is awfully surprising for an expression that "equals" 1(!). The next suggestion is then to fudge int() too, and so on and so on. It's like the arcade Whack-A-Mole game: each mole you knock into its hole pops up two more where you weren't looking. Before you know it, not even a bona fide expert can guess what code will actually do anymore. the-754-committee-probably-did-the-best-job-of-fixing-binary-fp- that-can-be-done-ly y'rs - tim From Fredrik Lundh" Message-ID: <00b201bfb4d3$07a95420$34aab5d4@hagrid> Ka-Ping Yee wrote: > So -- what's the philosophy, Guido? Are we committed to "everything > is comparable" (well, "all built-in types are comparable") or not? in 1.6a2, obviously not: >>> aUnicodeString < an8bitString Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: unexpected code byte in 1.6a3, maybe. From Fredrik Lundh" Message-ID: <00ce01bfb4d4$0a7d1820$34aab5d4@hagrid> Tim Peters wrote: > [Moshe Zadka] > > ... > > I'd much prefer Python to reflect a fundamental truth about Unicode, > > which at least makes sure binary-goop can pass through Unicode and > > remain unharmed, then to reflect a nasty problem with UTF-8 (not > > everything is legal). >=20 > Then you don't want Unicode at all, Moshe. All the official encoding > schemes for Unicode 3.0 suffer illegal byte sequences (for example, = 0xffff > is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of > Unicode not yet having assigned a character to this position, it's = that the > standard explicitly makes this sequence illegal and guarantees it will > always be illegal! in context, I think what Moshe meant was that with a straight character code mapping, any 8-bit string can always be mapped to a unicode string and back again. given a byte array "b": u =3D unicode(b, "default") assert map(ord, u) =3D=3D map(ord, s) again, this is no different from casting an integer to a long integer and back again. (imaging having to do that on the bits and bytes level!). and again, the internal unicode encoding used by the unicode string type itself, or when serializing that string type, has nothing to do with that. From jack@oratrix.nl Wed May 3 08:58:31 2000 From: jack@oratrix.nl (Jack Jansen) Date: Wed, 03 May 2000 09:58:31 +0200 Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python bltinmodule.c,2.154,2.155 In-Reply-To: Message by bwarsaw@cnri.reston.va.us (Barry A. Warsaw) , Tue, 2 May 2000 15:24:09 -0400 (EDT) , <20000502192409.8C44E6636B@anthem.cnri.reston.va.us> Message-ID: <20000503075832.18574370CF2@snelboot.oratrix.nl> > _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go > ahead and initialize the class-based standard exceptions. If this > fails, we throw a Py_FatalError. Isn't a Py_FatalError overkill? Or will not having the class-based standard exceptions lead to so much havoc later on that it is better than limping on? -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From just@letterror.com Wed May 3 10:03:16 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 10:03:16 +0100 Subject: [Python-Dev] Unicode comparisons & normalization Message-ID: After quickly browsing through the unicode.org URLs I posted earlier, I reach the following (possibly wrong) conclusions: - there is a script and language independent canonical form (but automatic normalization is indeed a bad idea) - ideally, unicode comparisons should follow the rules from http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic for 1.6, if at all...) - this would indeed mean that it's possible for u == v even though type(u) is type(v) and len(u) != len(v). However, I don't see how this would collapse /F's world, as the two strings are at most semantically equivalent. Their physical difference is real, and still follows the a-string-is-a-sequence-of-characters rule (!). - there may be additional customized language-specific sorting rules. I currently don't see how to implement that without some global variable. - the sorting rules are very complicated, and should be implemented by calculating "sort keys". If I understood it correctly, these can take up to 4 bytes per character in its most compact form. Still, for it to be somewhat speed-efficient, they need to be cached... - u.find() may need an alternative API, which returns a (begin, end) tuple, since the match may not have the same length as the search string... (This is tricky, since you need the begin and end indices in the non-canonical form...) Just From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> Message-ID: <013c01bfb4d6$da19fb00$34aab5d4@hagrid> Guido van Rossum wrote: > > What do we do about str( my_unicode_string )? Perhaps escape the = Unicode > > characters with backslashed numbers? >=20 > Hm, good question. Tcl displays unknown characters as \x or \u > escapes. I think this may make more sense than raising an error. but that's on the display side of things, right? similar to repr, in other words. > But there must be a way to turn on Unicode-awareness on e.g. stdout > and then printing a Unicode object should not use str() (as it > currently does). to throw some extra gasoline on this, how about allowing str() to return unicode strings? (extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?) count to ten before replying, please. From ping@lfw.org Wed May 3 09:30:02 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 01:30:02 -0700 (PDT) Subject: [Python-Dev] Unicode comparisons & normalization In-Reply-To: Message-ID: On Wed, 3 May 2000, Just van Rossum wrote: > After quickly browsing through the unicode.org URLs I posted earlier, I > reach the following (possibly wrong) conclusions: > > - there is a script and language independent canonical form (but automatic > normalization is indeed a bad idea) > - ideally, unicode comparisons should follow the rules from > http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic > for 1.6, if at all...) I just looked through this document. Indeed, there's a lot of work to be done if we want to compare strings this way. I thought the most striking feature was that this comparison method does *not* satisfy the common assumption a > b implies a + c > b + d (+ is concatenation) -- in fact, it is specifically designed to allow for cases where differences in the *later* part of a string can have greater influence than differences in an earlier part of a string. It *does* still guarantee that a + b > a and of course we can still rely on the most basic rules such as a > b and b > c implies a > c There are sufficiently many significant transformations described in the UTR 10 document that i'm pretty sure it is possible for two things to collate equally but not be equivalent. (Even after Unicode normalization, there is still the possibility of rearrangement in step 1.2.) This would be another motivation for Python to carefully separate the three types of equality: is identity-equal == value-equal <=> magnitude-equal We currently don't distinguish between the last two; the operator "<=>" is my proposal for how to spell "magnitude-equal", and in terms of outward behaviour you can consider (a <=> b) to be (a <= b and a >= b). I suspect we will find ourselves needing it if we do rich comparisons anyway. (I don't know of any other useful kinds of equality, but if you've run into this before, do pipe up...) -- ?!ng From mal@lemburg.com Wed May 3 09:15:29 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 10:15:29 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net> Message-ID: <390FE021.6F15C1C8@lemburg.com> Just van Rossum wrote: > > [MAL vs. PP] > >> > FYI: Normalization is needed to make comparing Unicode > >> > strings robust, e.g. u"�" should compare equal to u"e\u0301". > >> > >> That's a whole 'nother debate at a whole 'nother level of abstraction. I > >> think we need to get the bytes/characters level right and then we can > >> worry about display-equivalent characters (or leave that to the Python > >> programmer to figure out...). > > > >I just wanted to point out that the argument "slicing doesn't > >work with UTF-8" is moot. > > And failed... Huh ? The pure fact that you can have two (or more) Unicode characters to represent a single character makes Unicode itself have the same problems as e.g. UTF-8. > [Refs about collation and decomposition] > > It's very deep stuff, which seems more appropriate for an extension than > for builtin comparisons to me. That's what I think too; I never argued for making this builtin and automatic (don't know where people got this idea from). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From Fredrik Lundh" Message-ID: <018a01bfb4de$7744cc00$34aab5d4@hagrid> Just van Rossum wrote: > After quickly browsing through the unicode.org URLs I posted earlier, = I > reach the following (possibly wrong) conclusions: here's another good paper that covers this, the universe, and = everything: Character Model for the World Wide Web=20 http://www.w3.org/TR/charmod among many other things, it argues that normalization should be done at the source, and that it should be sufficient to do binary matching to = tell if two strings are identical. ... another very interesting thing from that paper is where they identify = four layers of character support: Layer 1: Physical representation. This is necessary for APIs that expose a physical representation of string data. /.../ To avoid problems with duplicates, it is assumed that the data is normalized /.../=20 Layer 2: Indexing based on abstract codepoints. /.../ This is the highest layer of abstraction that ensures interopera- bility with very low implementation effort. To avoid problems with duplicates, it is assumed that the data is normalized /.../ =20 Layer 3: Combining sequences, user-relevant. /.../ While we think that an exact definition of this layer should be possible, such a definition does not currently exist. Layer 4: Depending on language and operation. This layer is least suited for interoperability, but is necessary for certain operations, e.g. sorting.=20 until now, this discussion has focussed on the boundary between layer 1 and 2. that as many python strings as possible should be on the second layer has always been obvious to me ("a very low implementation effort" is exactly my style ;-), and leave the rest for the app. ...while Guido and MAL has argued that we should stay on level 1 (apparantly because "we've already implemented it" is less effort that "let's change a little bit") no wonder they never understand what I'm talking about... it's also interesting to see that MAL's using layer 3 and 4 issues as an argument to keep Python's string support at layer 1. in contrast, the W3 paper thinks that normalization is a non-issue also on the layer 1 level. go figure. ... btw, how about adopting this paper as the "Character Model for Python"? yes, I'm serious. PS. here's my take on Just's normalization points: > - there is a script and language independent canonical form (but = automatic > normalization is indeed a bad idea) > - ideally, unicode comparisons should follow the rules from > http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly = realistic > for 1.6, if at all...) note that W3 paper recommends early normalization, and binary comparision (assuming the same internal representation of the unicode character codes, of course). > - this would indeed mean that it's possible for u =3D=3D v even though = type(u) > is type(v) and len(u) !=3D len(v). However, I don't see how this would > collapse /F's world, as the two strings are at most semantically > equivalent. Their physical difference is real, and still follows the > a-string-is-a-sequence-of-characters rule (!). yes, but on layer 3 instead of layer 2. > - there may be additional customized language-specific sorting rules. = I > currently don't see how to implement that without some global = variable. layer 4. > - the sorting rules are very complicated, and should be implemented by > calculating "sort keys". If I understood it correctly, these can take = up to > 4 bytes per character in its most compact form. Still, for it to be > somewhat speed-efficient, they need to be cached... layer 4. > - u.find() may need an alternative API, which returns a (begin, end) = tuple, > since the match may not have the same length as the search string... = (This > is tricky, since you need the begin and end indices in the = non-canonical > form...) layer 3. From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid> M.-A. Lemburg wrote: > Guido van Rossum wrote: > >=20 > > > > So what do you think of my new proposal of using ASCII as the = default > > > > "encoding"? >=20 > How about using unicode-escape or raw-unicode-escape as > default encoding ? (They would have to be adapted to disallow > Latin-1 char input, though.) >=20 > The advantage would be that they are compatible with ASCII > while still providing loss-less conversion and since they > use escape characters, you can even read them using an > ASCII based editor. umm. if you disallow latin-1 characters, how can you call this one loss-less? looks like political correctness taken to an entirely new level... From ping@lfw.org Wed May 3 09:50:30 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 01:50:30 -0700 (PDT) Subject: [Python-Dev] Unicode debate In-Reply-To: <013c01bfb4d6$da19fb00$34aab5d4@hagrid> Message-ID: On Wed, 3 May 2000, Fredrik Lundh wrote: > Guido van Rossum wrote: > > But there must be a way to turn on Unicode-awareness on e.g. stdout > > and then printing a Unicode object should not use str() (as it > > currently does). > > to throw some extra gasoline on this, how about allowing > str() to return unicode strings? You still need to *print* them somehow. One way or another, stdout is still just a stream with bytes on it, unless we augment file objects to understand encodings. stdout sends bytes to something -- and that something will interpret the stream of bytes in some encoding (could be Latin-1, UTF-8, ISO-2022-JP, whatever). So either: 1. You explicitly downconvert to bytes, and specify the encoding each time you do. Then write the bytes to stdout (or your file object). 2. The file object is smart and can be told what encoding to use, and Unicode strings written to the file are automatically converted to bytes. Another thread mentioned having separate read/write and binary_read/binary_write methods on files. I suggest doing it the other way, actually: since read/write operate on byte streams now, *they* are the binary operations; the new methods should be the ones that do the extra encoding/decoding work, and could be called uniread/uniwrite, uread/uwrite, textread/textwrite, etc. > (extra questions: how about renaming "unicode" to "string", > and getting rid of "unichr"?) Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128? -- ?!ng From ping@lfw.org Wed May 3 10:32:31 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 02:32:31 -0700 (PDT) Subject: [Python-Dev] Re: Unicode debate In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: On Tue, 2 May 2000, Guido van Rossum wrote: > > P. P. S. If always having to specify encodings is really too much, > > i'd probably be willing to consider a default-encoding state on the > > Unicode class, but it would have to be a stack of values, not a > > single value. > > Please elaborate? On general principle, it seems bad to just have a "set" method that encourages people to set static state in a way that irretrievably loses the current state. For something like this, you want a "push" method and a "pop" method with which to bracket a series of operations, so that you can easily write code which politely leaves other code unaffected. For example: >>> x = unicode("d\351but") # assume Guido-ASCII wins UnicodeError: ASCII encoding error: value out of range >>> x = unicode("d\351but", "latin-1") >>> x u'd\351but' >>> print x.encode("latin-1") # on my xterm with Latin-1 fonts d�but >>> x.encode("utf-8") 'd\303\251but' Now: >>> u"".pushenc("latin-1") # need a better interface to this? >>> x = unicode("d\351but") # okay now >>> x u'd\351but' >>> u"".pushenc("utf-8") >>> x = unicode("d\351but") UnicodeError: UTF-8 decoding error: invalid data >>> x = unicode("d\303\251but") >>> print x.encode("latin-1") d�but >>> str(x) 'd\303\251\but' >>> u"".popenc() # back to the Latin-1 encoding >>> str(x) 'd\351but' . . . >>> u"".popenc() # back to the ASCII encoding Similarly, imagine: >>> x = u"" >>> file = open("foo.jis", "w") >>> file.pushenc("iso-2022-jp") >>> file.uniwrite(x) . . . >>> file.popenc() >>> import sys >>> sys.stdout.write(x) # bad! x contains chars > 127 UnicodeError: ASCII decoding error: value out of range >>> sys.stdout.pushenc("iso-2022-jp") >>> sys.stdout.write(x) # on a kterm with kanji fonts . . . >>> sys.stdout.popenc() The above examples incorporate the Guido-ASCII proposal, which makes a fair amount of sense to me now. How do they look to y'all? This illustrates the remaining wart: >>> sys.stdout.pushenc("iso-2022-jp") >>> print x # still bad! str is still doing ASCII UnicodeError: ASCII decoding error: value out of range >>> u"".pushenc("iso-2022-jp") >>> print x # on a kterm with kanji fonts Writing to files asks the file object to convert from Unicode to bytes, then write the bytes. Printing converts the Unicode to bytes first with str(), then hands the bytes to the file object to write. This wart is really a larger printing issue. If we want to solve it, files have to know what to do with objects, i.e. print x doesn't mean sys.stdout.write(str(x) + "\n") instead it means sys.stdout.printout(x) Hmm. I think this might deserve a separate subject line. -- ?!ng From ping@lfw.org Wed May 3 10:41:20 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 02:41:20 -0700 (PDT) Subject: [Python-Dev] Printing objects on files In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us> Message-ID: The following is all stolen from E: see http://www.erights.org/. As i mentioned in the previous message, there are reasons that we might want to enable files to know what it means to print things on them. print x would mean sys.stdout.printout(x) where sys.stdout is defined something like def __init__(self): self.encs = ["ASCII"] def pushenc(self, enc): self.encs.append(enc) def popenc(self): self.encs.pop() if not self.encs: self.encs = ["ASCII"] def printout(self, x): if type(x) is type(u""): self.write(x.encode(self.encs[-1])) else: x.__print__(self) self.write("\n") and each object would have a __print__ method; for lists, e.g.: def __print__(self, file): file.write("[") if len(self): file.printout(self[0]) for item in self[1:]: file.write(", ") file.printout(item) file.write("]") for floats, e.g.: def __print__(self, file): if hasattr(file, "floatprec"): prec = file.floatprec else: prec = 17 file.write("%%.%df" % prec % self) The passing of control between the file and the objects to be printed enables us to make Tim happy: >>> l = [1/2, 1/3, 1/4] # I can dream, can't i? >>> print l [0.3, 0.33333333333333331, 0.25] >>> sys.stdout.floatprec = 6 >>> print l [0.5, 0.333333, 0.25] Fantasizing about other useful kinds of state beyond "encs" and "floatprec" ("listmax"? "ratprec"?) and managing this namespace is left as an exercise to the reader. -- ?!ng From ht@cogsci.ed.ac.uk Wed May 3 10:59:28 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 03 May 2000 10:59:28 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > Paul, we're both just saying the same thing over and over without > convincing each other. I'll wait till someone who wasn't in this > debate before chimes in. OK, I've never contributed to this discussion, but I have a long history of shipping widely used Python/Tkinter/XML tools (see my homepage). I care _very_ much that heretofore I have been unable to support full XML because of the lack of Unicode support in Python. I've already started playing with 1.6a2 for this reason. I notice one apparent mis-communication between the various contributors: Treating narrow-strings as consisting of UNICODE code points <= 255 is not necessarily the same thing as making Latin-1 the default encoding. I don't think on Paul and Fredrik's account encoding are relevant to narrow-strings at all. I'd rather go right away to the coherent position of byte-arrays, narrow-strings and wide-strings. Encodings are only relevant to conversion between byte-arrays and strings. Decoding a byte-array with a UTF-8 encoding into a narrow string might cause overflow/truncation, just as decoding a byte-array with a UTF-8 encoding into a wide-string might. The fact that decoding a byte-array with a Latin-1 encoding into a narrow-string is a memcopy is just a side-effect of the courtesy of the UNICODE designers wrt the code points between 128 and 255. This is effectively the way our C-based XML toolset (which we embed in Python) works today -- we build an 8-bit version which uses char* strings, and a 16-bit version which uses unsigned short* strings, and convert from/to byte-streams in any supported encoding at the margins. I'd like to keep byte-arrays at the margins in Python as well, for all the reasons advanced by Paul and Fredrik. I think treating existing strings as a sort of pun between narrow-strings and byte-arrays is a recipe for ongoing confusion. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From ping@lfw.org Wed May 3 10:51:30 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 02:51:30 -0700 (PDT) Subject: [Python-Dev] Re: Printing objects on files In-Reply-To: Message-ID: On Wed, 3 May 2000, Ka-Ping Yee wrote: > > Fantasizing about other useful kinds of state beyond "encs" > and "floatprec" ("listmax"? "ratprec"?) and managing this > namespace is left as an exercise to the reader. Okay, i lied. Shortly after writing this i realized that it is probably advisable for all such bits of state to be stored in stacks, so an interface such as this might do: def push(self, key, value): if not self.state.has_key(key): self.state[key] = [] self.state[key].append(value) def pop(self, key): if self.state.has_key(key): if len(self.state[key]): self.state[key].pop() def get(self, key): if not self.state.has_key(key): stack = self.state[key][-1] if stack: return stack[-1] return None Thus: >>> print 1/3 0.33333333333333331 >>> sys.stdout.push("float.prec", 6) >>> print 1/3 0.333333 >>> sys.stdout.pop("float.prec") >>> print 1/3 0.33333333333333331 And once we allow arbitrary strings as keys to the bits of state, the period is a natural separator we can use for managing the namespace. Take the special case for Unicode out of the file object: def printout(self, x): x.__print__(self) self.write("\n") and have the Unicode string do the work: def __printon__(self, file): file.write(self.encode(file.get("unicode.enc"))) This behaves just right if an encoding of None means ASCII. If mucking with encodings is sufficiently common, you could imagine conveniences on file objects such as def __init__(self, filename, mode, encoding=None): ... if encoding: self.push("unicode.enc", encoding) def pushenc(self, encoding): self.push("unicode.enc", encoding) def popenc(self, encoding): self.pop("unicode.enc") -- ?!ng From Fredrik Lundh" Message-ID: <030a01bfb4ea$c2741e40$34aab5d4@hagrid> Ka-Ping Yee wrote: > > to throw some extra gasoline on this, how about allowing > > str() to return unicode strings? >=20 > You still need to *print* them somehow. One way or another, > stdout is still just a stream with bytes on it, unless we > augment file objects to understand encodings. >=20 > stdout sends bytes to something -- and that something will > interpret the stream of bytes in some encoding (could be > Latin-1, UTF-8, ISO-2022-JP, whatever). So either: >=20 > 1. You explicitly downconvert to bytes, and specify > the encoding each time you do. Then write the > bytes to stdout (or your file object). >=20 > 2. The file object is smart and can be told what > encoding to use, and Unicode strings written to > the file are automatically converted to bytes. which one's more convenient? (no, I won't tell you what I prefer. guido doesn't want more arguments from the old "characters are characters" proponents, so I gotta trick someone else to spell them out ;-) > > (extra questions: how about renaming "unicode" to "string", > > and getting rid of "unichr"?) >=20 > Would you expect chr(x) to return an 8-bit string when x < 128, > and a Unicode string when x >=3D 128? that will break too much existing code, I think. but what about replacing 128 with 256? From just@letterror.com Wed May 3 12:41:27 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 12:41:27 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <390FE021.6F15C1C8@lemburg.com> References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net> Message-ID: At 10:15 AM +0200 03-05-2000, M.-A. Lemburg wrote: >Huh ? The pure fact that you can have two (or more) >Unicode characters to represent a single character makes >Unicode itself have the same problems as e.g. UTF-8. It's the different level of abstraction that makes it different. Even if "e`" is _equivalent_ to the combined character, that doesn't mean that it _is_ the combined character, on the level of abstraction we are talking about: it's still 2 characters, and those can be sliced apart without a problem. Slicing utf-8 doesn't work because it yields invalid strings, slicing "e`" does work since both halves are valid strings. The fact that "e`" is semantically equivalent to the combined character doesn't change that. Just From guido@python.org Wed May 3 12:12:44 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 07:12:44 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization In-Reply-To: Your message of "Wed, 03 May 2000 01:30:02 PDT." References: Message-ID: <200005031112.HAA03138@eric.cnri.reston.va.us> [Ping] > This would be another motivation for Python to carefully > separate the three types of equality: > > is identity-equal > == value-equal > <=> magnitude-equal > > We currently don't distinguish between the last two; > the operator "<=>" is my proposal for how to spell > "magnitude-equal", and in terms of outward behaviour > you can consider (a <=> b) to be (a <= b and a >= b). > I suspect we will find ourselves needing it if we do > rich comparisons anyway. I don't think that this form of equality deserves its own operator. The Unicode comparison rules are sufficiently hairy that it seems better to implement them separately, either in a separate module or at least as a Unicode-object-specific method, and let the == operator do what it does best: compare the representations. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed May 3 12:14:54 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 07:14:54 -0400 Subject: [Python-Dev] Unicode comparisons & normalization In-Reply-To: Your message of "Wed, 03 May 2000 11:02:09 +0200." <018a01bfb4de$7744cc00$34aab5d4@hagrid> References: <018a01bfb4de$7744cc00$34aab5d4@hagrid> Message-ID: <200005031114.HAA03152@eric.cnri.reston.va.us> > here's another good paper that covers this, the universe, and everything: Theer's a lot of useful pointers being flung around. Could someone with more spare cycles than I currently have perhaps collect these and produce a little write up "further reading on Unicode comparison and normalization" (or perhaps a more comprehensive title if warranted) to be added to the i18n-sig's home page? --Guido van Rossum (home page: http://www.python.org/~guido/) From just@letterror.com Wed May 3 13:26:50 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 13:26:50 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <030a01bfb4ea$c2741e40$34aab5d4@hagrid> References: Message-ID: [Ka-Ping Yee] > Would you expect chr(x) to return an 8-bit string when x < 128, > and a Unicode string when x >= 128? [Fredrik Lundh] > that will break too much existing code, I think. but what > about replacing 128 with 256? Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-) Just From guido@python.org Wed May 3 13:04:29 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:04:29 -0400 Subject: [Python-Dev] Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 12:31:34 +0200." <030a01bfb4ea$c2741e40$34aab5d4@hagrid> References: <030a01bfb4ea$c2741e40$34aab5d4@hagrid> Message-ID: <200005031204.IAA03252@eric.cnri.reston.va.us> [Ping] > > stdout sends bytes to something -- and that something will > > interpret the stream of bytes in some encoding (could be > > Latin-1, UTF-8, ISO-2022-JP, whatever). So either: > > > > 1. You explicitly downconvert to bytes, and specify > > the encoding each time you do. Then write the > > bytes to stdout (or your file object). > > > > 2. The file object is smart and can be told what > > encoding to use, and Unicode strings written to > > the file are automatically converted to bytes. [Fredrik] > which one's more convenient? Marc-Andre's codec module contains file-like objects that support this (or could easily be made to). However the problem is that print *always* first converts the object using str(), and str() enforces that the result is an 8-bit string. I'm afraid that loosening this will break too much code. (This all really happens at the C level.) I'm also afraid that this means that str(unicode) may have to be defined to yield UTF-8. My argument goes as follows: 1. We want to be able to set things up so that print u"..." does the right thing. (What "the right thing" is, is not defined here, as long as the user sees the glyphs implied by u"...".) 2. print u is equivalent to sys.stdout.write(str(u)). 3. str() must always returns an 8-bit string. 4. So the solution must involve assigning an object to sys.stdout that does the right thing given an 8-bit encoding of u. 5. So we need str(u) to produce a lossless 8-bit encoding of Unicode. 6. UTF-8 is the only sensible candidate. Note that (apart from print) str() is never implicitly invoked -- all implicit conversions when Unicode and 8-bit strings are combined go from 8-bit to Unicode. (There might be an alternative, but it would depend on having yet another hook (similar to Ping's sys.display) that gets invoked when printing an object (as opposed to displaying it at the interactive prompt). I'm not too keen on this because it would break code that temporarily sets sys.stdout to a file of its own choosing and then invokes print -- a common idiom to capture printed output in a string, for example, which could be embedded deep inside a module. If the main program were to install a naive print hook that always sent output to a designated place, this strategy might fail.) > > > (extra questions: how about renaming "unicode" to "string", > > > and getting rid of "unichr"?) > > > > Would you expect chr(x) to return an 8-bit string when x < 128, > > and a Unicode string when x >= 128? > > that will break too much existing code, I think. but what > about replacing 128 with 256? If the 8-bit Unicode proposal were accepted, this would make sense. In my "only ASCII is implicitly convertible" proposal, this would be a mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128). I agree with everyone that things would be much simpler if we had separate data types for byte arrays and 8-bit character strings. But we don't have this distinction yet, and I don't see a quick way to add it in 1.6 without major upsetting the release schedule. So all of my proposals are to be considered hacks to maintain as much b/w compatibility as possible while still supporting some form of Unicode. The fact that half the time 8-bit strings are really being used as byte arrays, while Python can't tell the difference, means (to me) that the default encoding is an important thing to argue about. I don't know if I want to push it out all the way to Py3k, but I just don't see a way to implement "a character is a character" in 1.6 given all the current constraints. (BTW I promise that 1.7 will be speedy once 1.6 is out of the door -- there's a lot else that was put off to 1.7.) Fredrik, I believe I haven't seen your response to my ASCII proposal. Is it just as bad as UTF-8 to you, or could you live with it? On a scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you? Where's my sre snapshot? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed May 3 13:16:56 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:16:56 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "03 May 2000 10:59:28 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us> [Henry S. Thompson] > OK, I've never contributed to this discussion, but I have a long > history of shipping widely used Python/Tkinter/XML tools (see my > homepage). I care _very_ much that heretofore I have been unable to > support full XML because of the lack of Unicode support in Python. > I've already started playing with 1.6a2 for this reason. Thanks for chiming in! > I notice one apparent mis-communication between the various > contributors: > > Treating narrow-strings as consisting of UNICODE code points <= 255 is > not necessarily the same thing as making Latin-1 the default encoding. > I don't think on Paul and Fredrik's account encoding are relevant to > narrow-strings at all. I agree that's what they are trying to tell me. > I'd rather go right away to the coherent position of byte-arrays, > narrow-strings and wide-strings. Encodings are only relevant to > conversion between byte-arrays and strings. Decoding a byte-array > with a UTF-8 encoding into a narrow string might cause > overflow/truncation, just as decoding a byte-array with a UTF-8 > encoding into a wide-string might. The fact that decoding a > byte-array with a Latin-1 encoding into a narrow-string is a memcopy > is just a side-effect of the courtesy of the UNICODE designers wrt the > code points between 128 and 255. > > This is effectively the way our C-based XML toolset (which we embed in > Python) works today -- we build an 8-bit version which uses char* > strings, and a 16-bit version which uses unsigned short* strings, and > convert from/to byte-streams in any supported encoding at the margins. > > I'd like to keep byte-arrays at the margins in Python as well, for all > the reasons advanced by Paul and Fredrik. > > I think treating existing strings as a sort of pun between > narrow-strings and byte-arrays is a recipe for ongoing confusion. Very good analysis. Unfortunately this is where we're stuck, until we have a chance to redesign this kind of thing from scratch. Python 1.5.2 programs use strings for byte arrays probably as much as they use them for character strings. This is because way back in 1990 I when I was designing Python, I wanted to have smallest set of basic types, but I also wanted to be able to manipulate byte arrays somewhat. Influenced by K&R C, I chose to make strings and string I/O 8-bit clean so that you could read a binary "string" from a file, manipulate it, and write it back to a file, regardless of whether it was character or binary data. This model has never been challenged until now. I agree that the Java model (byte arrays and strings) or perhaps your proposed model (byte arrays, narrow and wide strings) looks better. But, although Python has had rudimentary support for byte arrays for a while (the array module, introduced in 1993), the majority of Python code manipulating binary data still uses string objects. My ASCII proposal is a compromise that tries to be fair to both uses for strings. Introducing byte arrays as a more fundamental type has been on the wish list for a long time -- I see no way to introduce this into Python 1.6 without totally botching the release schedule (June 1st is very close already!). I'd like to be able to move on, there are other important things still to be added to 1.6 (Vladimir's malloc patches, Neil's GC, Fredrik's completed sre...). For 1.7 (which should happen later this year) I promise I'll reopen the discussion on byte arrays. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed May 3 13:18:39 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:18:39 -0400 Subject: [Python-Dev] Re: [Python-checkins] CVS: python/dist/src/Python bltinmodule.c,2.154,2.155 In-Reply-To: Your message of "Wed, 03 May 2000 09:58:31 +0200." <20000503075832.18574370CF2@snelboot.oratrix.nl> References: <20000503075832.18574370CF2@snelboot.oratrix.nl> Message-ID: <200005031218.IAA03288@eric.cnri.reston.va.us> > > _PyBuiltin_Init_2(): Don't test Py_UseClassExceptionsFlag, just go > > ahead and initialize the class-based standard exceptions. If this > > fails, we throw a Py_FatalError. > > Isn't a Py_FatalError overkill? Or will not having the class-based standard > exceptions lead to so much havoc later on that it is better than limping on? There will be *no* exception objects -- they will all be NULL pointers. It's not clear that you will be able to limp very far, and it's better to have a clear diagnostic at the source of the problem. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Wed May 3 13:22:57 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 08:22:57 -0400 Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 01:05:59 EDT." <000301bfb4bd$463ec280$622d153f@tim> References: <000301bfb4bd$463ec280$622d153f@tim> Message-ID: <200005031222.IAA03300@eric.cnri.reston.va.us> > [Guido] > > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII > > bytes in either should make the comparison fail; when ordering is > > important, we can make an arbitrary choice e.g. "\377" < u"\200". > > [Toby] > > I assume 'fail' means 'non-equal', rather than 'raises an exception'? > > [Guido] > > Yes, sorry for the ambiguity. [Tim] > Huh! You sure about that? If we're setting up a case where meaningful > comparison is impossible, isn't an exception more appropriate? The current > > >>> 83479278 < "42" > 1 > >>> > > probably traps more people than it helps. Agreed, but that's the rule we all currently live by, and changing it is something for Python 3000. I'm not real strong on this though -- I was willing to live with exceptions from the UTF-8-to-Unicode conversion. If we all agree that it's better for u"\377" == "\377" to raise an precedent-setting exception than to return false, that's fine with me too. I do want u"a" == "a" to be true though (and I believe we all already agree on that one). Note that it's not the first precedent -- you can already define classes whose instances can raise exceptions during comparisons. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Wed May 3 09:56:08 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 10:56:08 +0200 Subject: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <013c01bfb4d6$da19fb00$34aab5d4@hagrid> Message-ID: <390FE9A7.DE5545DA@lemburg.com> Fredrik Lundh wrote: > > Guido van Rossum wrote: > > > What do we do about str( my_unicode_string )? Perhaps escape the Unicode > > > characters with backslashed numbers? > > > > Hm, good question. Tcl displays unknown characters as \x or \u > > escapes. I think this may make more sense than raising an error. > > but that's on the display side of things, right? similar to > repr, in other words. > > > But there must be a way to turn on Unicode-awareness on e.g. stdout > > and then printing a Unicode object should not use str() (as it > > currently does). > > to throw some extra gasoline on this, how about allowing > str() to return unicode strings? > > (extra questions: how about renaming "unicode" to "string", > and getting rid of "unichr"?) > > count to ten before replying, please. 1 2 3 4 5 6 7 8 9 10 ... ok ;-) Guido's problem with printing Unicode can easily be solved using the standard codecs.StreamRecoder class as I've done in the example I posted some days ago. Basically, what the stdout wrapper would do is take strings as input, converting them to Unicode and then writing them encoded to the original stdout. For Unicode objects the conversion can be skipped and the encoded output written directly to stdout. This can be done for any encoding supported by Python; e.g. you could do the indirection in site.py and then have Unicode printed as Latin-1 or UTF-8 or one of the many code pages supported through the mapping codec. About having str() return Unicode objects: I see str() as constructor for string objects and under that assumption str() will always have to return string objects. unicode() does the same for Unicode objects, so renaming it to something else doesn't really help all that much. BTW, __str__() has to return strings too. Perhaps we need __unicode__() and a corresponding slot function too ?! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed May 3 14:06:27 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 May 2000 15:06:27 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid> Message-ID: <39102453.6923B10@lemburg.com> Fredrik Lundh wrote: > > M.-A. Lemburg wrote: > > Guido van Rossum wrote: > > > > > > > > So what do you think of my new proposal of using ASCII as the default > > > > > "encoding"? > > > > How about using unicode-escape or raw-unicode-escape as > > default encoding ? (They would have to be adapted to disallow > > Latin-1 char input, though.) > > > > The advantage would be that they are compatible with ASCII > > while still providing loss-less conversion and since they > > use escape characters, you can even read them using an > > ASCII based editor. > > umm. if you disallow latin-1 characters, how can you call this > one loss-less? [Guido didn't like this one, so its probably moot investing any more time on this...] I meant that the unicode-escape codec should only take ASCII characters as input and disallow non-escaped Latin-1 characters. Anyway, I'm out of this discussion... I'll wait a week or so until things have been sorted out. Have fun, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From ping@lfw.org Wed May 3 14:09:59 2000 From: ping@lfw.org (Ka-Ping Yee) Date: Wed, 3 May 2000 06:09:59 -0700 (PDT) Subject: [Python-Dev] Unicode debate In-Reply-To: <200005031204.IAA03252@eric.cnri.reston.va.us> Message-ID: On Wed, 3 May 2000, Guido van Rossum wrote: > (There might be an alternative, but it would depend on having yet > another hook (similar to Ping's sys.display) that gets invoked when > printing an object (as opposed to displaying it at the interactive > prompt). I'm not too keen on this because it would break code that > temporarily sets sys.stdout to a file of its own choosing and then > invokes print -- a common idiom to capture printed output in a string, > for example, which could be embedded deep inside a module. If the > main program were to install a naive print hook that always sent > output to a designated place, this strategy might fail.) I know this is not a small change, but i'm pretty convinced the right answer here is that the print hook should call a *method* on sys.stdout, whatever sys.stdout happens to be. The details are described in the other long message i wrote ("Printing objects on files"). Here is an addendum that might actually make that proposal feasible enough (compatibility-wise) to fly in the short term: print x does, conceptually: try: sys.stdout.printout(x) except AttributeError: sys.stdout.write(str(x)) sys.stdout.write("\n") The rest can then be added, and the change in 'print x' will work nicely for any file objects, but will not break on file-like substitutes that don't define a 'printout' method. Any reactions to the other benefit of this proposal -- namely, the ability to control the printing parameters of object components as they're being traversed for printing? That was actually the original motivation for doing the file.printout thing: it gives you some of the effect of "passing down str-ness" that we were discussing so heatedly a little while ago. The other thing that just might justify this much of a change is that, as you reasoned clearly in your other message, without adequate resolution to the printing problem we may have painted ourselves into a corner with regard to str(u"") conversion, and i don't like the look of that corner much. *Even* if we were to get people to agree that it's okay for str(u"") to produce UTF-8, it still seems pretty hackish to me that we're forced to choose this encoding as a way of working around that fact that we can't simply give the file the thing we want to print. -- ?!ng From Moshe Zadka Wed May 3 14:55:37 2000 From: Moshe Zadka (Moshe Zadka) Date: Wed, 3 May 2000 16:55:37 +0300 (IDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <000501bfb4c3$16743480$622d153f@tim> Message-ID: On Wed, 3 May 2000, Tim Peters wrote: [Moshe Zadka] > ... > I'd much prefer Python to reflect a fundamental truth about Unicode, > which at least makes sure binary-goop can pass through Unicode and > remain unharmed, then to reflect a nasty problem with UTF-8 (not > everything is legal). [Tim Peters] > Then you don't want Unicode at all, Moshe. All the official encoding > schemes for Unicode 3.0 suffer illegal byte sequences Of course I don't, and of course you're right. But what I do want is for my binary goop to pass unharmed through the evil Unicode forest. Which is why I don't want it to interpret my goop as a sequence of bytes it tries to decode, but I want the numeric values of my bytes to pass through to Unicode uharmed -- that means Latin-1 because of the second design decision of the horribly western-specific unicdoe - the first 256 characters are the same as Latin-1. If it were up to me, I'd use Latin-3, but it wasn't, so it's not. > (for example, 0xffff > is illegal in UTF-16 (whether BE or LE) Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE -- isn't it. -- Moshe Zadka http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com From akuchlin@mems-exchange.org Wed May 3 15:12:06 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Wed, 3 May 2000 10:12:06 -0400 (EDT) Subject: [Python-Dev] Unicode debate In-Reply-To: <200005031216.IAA03274@eric.cnri.reston.va.us> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <14608.13238.339572.202494@amarok.cnri.reston.va.us> Guido van Rossum writes: >been on the wish list for a long time -- I see no way to introduce >this into Python 1.6 without totally botching the release schedule >(June 1st is very close already!). I'd like to be able to move on, My suggested criterion is that 1.6 not screw things up in a way that we'll regret when 1.7 rolls around. UTF-8 probably does back us into a corner that (And can we choose a mailing list for discussing this and stick to it? This is being cross-posted to three lists: python-dev, i18-sig, and xml-sig! i18-sig only, maybe? Or string-sig?) -- A.M. Kuchling http://starship.python.net/crew/amk/ Chess! I'm tormented by thoughts of strip chess. Pure mind just isn't enough, Mallah. I long for a body. -- The Brain, in DOOM PATROL #34 From akuchlin@mems-exchange.org Wed May 3 15:15:18 2000 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Wed, 3 May 2000 10:15:18 -0400 (EDT) Subject: [Python-Dev] Unicode debate In-Reply-To: <14608.13238.339572.202494@amarok.cnri.reston.va.us> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <14608.13238.339572.202494@amarok.cnri.reston.va.us> Message-ID: <14608.13430.92985.717058@amarok.cnri.reston.va.us> Andrew M. Kuchling writes: >Guido van Rossum writes: >My suggested criterion is that 1.6 not screw things up in a way that >we'll regret when 1.7 rolls around. UTF-8 probably does back us into >a corner that Doh! To complete that paragraph: Magic conversions assuming UTF-8 does back us into a corner that is hard to get out of later. Magic conversions assuming Latin1 or ASCII are a bit better, but I'd lean toward the draconian solution: we don't know what we're doing, so do nothing and require the user to explicitly convert between Unicode and 8-bit strings in a user-selected encoding. --amk From guido@python.org Wed May 3 16:48:32 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 11:48:32 -0400 Subject: [Python-Dev] Unicode debate In-Reply-To: Your message of "Wed, 03 May 2000 10:15:18 EDT." <14608.13430.92985.717058@amarok.cnri.reston.va.us> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <14608.13238.339572.202494@amarok.cnri.reston.va.us> <14608.13430.92985.717058@amarok.cnri.reston.va.us> Message-ID: <200005031548.LAA03595@eric.cnri.reston.va.us> > >Guido van Rossum writes: > >My suggested criterion is that 1.6 not screw things up in a way that > >we'll regret when 1.7 rolls around. UTF-8 probably does back us into > >a corner that > Andrew M. Kuchling writes: > Doh! To complete that paragraph: Magic conversions assuming UTF-8 > does back us into a corner that is hard to get out of later. Magic > conversions assuming Latin1 or ASCII are a bit better, but I'd lean > toward the draconian solution: we don't know what we're doing, so do > nothing and require the user to explicitly convert between Unicode and > 8-bit strings in a user-selected encoding. GvR responds: That's what Ping suggested. My reason for proposing default conversions from ASCII is that there is much code that deals with character strings in a fairly abstract sense and that would work out of the box (or after very small changes) with Unicode strings. This code often uses some string literals containing ASCII characters. An arbitrary example: code to reformat a text paragraph; another: an XML parser. These look for certain ASCII characters given as literals in the code (" ", "<" and so on) but the algorithm is essentially independent of what encoding is used for non-ASCII characters. (I realize that the text reformatting example doesn't work for all Unicode characters because its assumption that all characters have equal width is broken -- but at the very least it should work with Latin-1 or Greek or Cyrillic stored in Unicode strings.) It's the same as for ints: a function to calculate the GCD works with ints as well as long ints without change, even though it references the int constant 0. In other words, we want string-processing code to be just as polymorphic as int-processing code. --Guido van Rossum (home page: http://www.python.org/~guido/) From just@letterror.com Wed May 3 20:55:24 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 3 May 2000 20:55:24 +0100 Subject: [Python-Dev] Unicode strings: an alternative Message-ID: Today I had a relatively simple idea that unites wide strings and narrow strings in a way that is more backward comatible at the C level. It's quite possible this has already been considered and rejected for reasons that are not yet obvious to me, but I'll give it a shot anyway. The main concept is not to provide a new string type but to extend the existing string object like so: - wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character. - there's a flag that specifies whether the string is narrow or wide. - the ob_size field is the _physical_ length of the data; if the string is wide, len(s) will return ob_size/2, all other string operations will have to do similar things. - there can possibly be an encoding attribute which may specify the used encoding, if known. Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO _big_ advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for __str__() & str() and friends to work with unicode strings without any change. Any thoughts? Just From tree@basistech.com Wed May 3 21:19:05 2000 From: tree@basistech.com (Tom Emerson) Date: Wed, 3 May 2000 16:19:05 -0400 (EDT) Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative In-Reply-To: References: Message-ID: <14608.35257.729641.178724@cymru.basistech.com> Just van Rossum writes: > The main concept is not to provide a new string type but to extend the > existing string object like so: This is the most logical thing to do. > - wide strings are stored as if they were narrow strings, simply using two > bytes for each Unicode character. I disagree with you here... store them as UTF-8. > - there's a flag that specifies whether the string is narrow or wide. Yup. > - the ob_size field is the _physical_ length of the data; if the string is > wide, len(s) will return ob_size/2, all other string operations will have > to do similar things. Is it possible to add a logical length field too? I presume it is too expensive to recalculate the logical (character) length of a string each time len(s) is called? Doing this is only slightly more time consuming than a normal strlen: really just O(n) + c, where 'c' is the constant time needed for table lookup (to get the number of bytes in the UTF-8 sequence given the start character) and the pointer manipulation (to add that length to your span pointer). > - there can possibly be an encoding attribute which may specify the used > encoding, if known. So is this used to handle the case where you have a legacy encoding (ShiftJIS, say) used in your existing strings, so you flag that 8-bit ("narrow" in a way) string as ShiftJIS? If wide strings are always Unicode, why do you need the encoding? > Admittedly, this is tricky and involves quite a bit of effort to implement, > since all string methods need to have narrow/wide switch. To make it worse, > it hardly offers anything the current solution doesn't. However, it offers > one IMHO _big_ advantage: C code that just passes strings along does not > need to change: wide strings can be seen as narrow strings without any > loss. This allows for __str__() & str() and friends to work with unicode > strings without any change. If you store wide strings as UCS2 then people using the C interface lose: strlen() stops working, or will return incorrect results. Indeed, any of the str*() routines in the C runtime will break. This is the advantage of using UTF-8 here --- you can still use strcpy and the like on the C side and have things work. > Any thoughts? I'm doing essentially what you suggest in my Unicode enablement of MySQL. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From skip@mojam.com (Skip Montanaro) Wed May 3 21:51:49 2000 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Wed, 3 May 2000 15:51:49 -0500 (CDT) Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative In-Reply-To: <14608.35257.729641.178724@cymru.basistech.com> References: <14608.35257.729641.178724@cymru.basistech.com> Message-ID: <14608.37223.787291.236623@beluga.mojam.com> Tom> Is it possible to add a logical length field too? I presume it is Tom> too expensive to recalculate the logical (character) length of a Tom> string each time len(s) is called? Doing this is only slightly more Tom> time consuming than a normal strlen: ... Note that currently the len() method doesn't call strlen() at all. It just returns the ob_size field. Presumably, with Just's proposal len() would simply return ob_size/width. If you used a variable width encoding, Just's plan wouldn't work. (I don't know anything about string encodings - is UTF-8 variable width?) From guido@python.org Wed May 3 22:22:59 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 17:22:59 -0400 Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative In-Reply-To: Your message of "Wed, 03 May 2000 20:55:24 BST." References: Message-ID: <200005032122.RAA05150@eric.cnri.reston.va.us> > Today I had a relatively simple idea that unites wide strings and narrow > strings in a way that is more backward comatible at the C level. It's quite > possible this has already been considered and rejected for reasons that are > not yet obvious to me, but I'll give it a shot anyway. > > The main concept is not to provide a new string type but to extend the > existing string object like so: > - wide strings are stored as if they were narrow strings, simply using two > bytes for each Unicode character. > - there's a flag that specifies whether the string is narrow or wide. > - the ob_size field is the _physical_ length of the data; if the string is > wide, len(s) will return ob_size/2, all other string operations will have > to do similar things. > - there can possibly be an encoding attribute which may specify the used > encoding, if known. > > Admittedly, this is tricky and involves quite a bit of effort to implement, > since all string methods need to have narrow/wide switch. To make it worse, > it hardly offers anything the current solution doesn't. However, it offers > one IMHO _big_ advantage: C code that just passes strings along does not > need to change: wide strings can be seen as narrow strings without any > loss. This allows for __str__() & str() and friends to work with unicode > strings without any change. This seems to have some nice properties, but I think it would cause problems for existing C code that tries to *interpret* the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes). --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Wed May 3 23:05:39 2000 From: tree@basistech.com (Tom Emerson) Date: Wed, 3 May 2000 18:05:39 -0400 (EDT) Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative In-Reply-To: <14608.37223.787291.236623@beluga.mojam.com> References: <14608.35257.729641.178724@cymru.basistech.com> <14608.37223.787291.236623@beluga.mojam.com> Message-ID: <14608.41651.781464.747522@cymru.basistech.com> Skip Montanaro writes: > Note that currently the len() method doesn't call strlen() at all. It just > returns the ob_size field. Presumably, with Just's proposal len() would > simply return ob_size/width. If you used a variable width encoding, Just's > plan wouldn't work. (I don't know anything about string encodings - is > UTF-8 variable width?) Yes, technically from 1 - 6 bytes per character, though in practice for Unicode it's 1 - 3. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@python.org Thu May 4 01:52:39 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 20:52:39 -0400 Subject: [Python-Dev] weird bug in test_winreg Message-ID: <200005040052.UAA07874@eric.cnri.reston.va.us> I just noticed a weird traceback in test_winreg. When I import test.autotest on Windows, I get a "test failed" notice for test_winreg. When I run it by itself the test succeeds. But when I first import test.autotest and then import test.test_winreg (which should rerun the latter, since test.regrtest unloads all test modules after they have run), I get an AttributeError telling me that 'None' object has no attribute 'get'. This is in encodings.__init__.py in the first call to _cache.get() in search_function. Somehow this is called by SetValueEx() in WriteTestData() in test/test_winreg.py. But inspection of the encodings module shows that _cache is {}, not None, and the source shows no evidence of how this could have happened. Any suggestions? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Thu May 4 01:57:50 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 03 May 2000 20:57:50 -0400 Subject: [Python-Dev] weird bug in test_winreg In-Reply-To: Your message of "Wed, 03 May 2000 20:52:39 EDT." <200005040052.UAA07874@eric.cnri.reston.va.us> References: <200005040052.UAA07874@eric.cnri.reston.va.us> Message-ID: <200005040057.UAA07966@eric.cnri.reston.va.us> > I just noticed a weird traceback in test_winreg. When I import > test.autotest on Windows, I get a "test failed" notice for > test_winreg. When I run it by itself the test succeeds. But when I > first import test.autotest and then import test.test_winreg (which > should rerun the latter, since test.regrtest unloads all test modules > after they have run), I get an AttributeError telling me that 'None' > object has no attribute 'get'. This is in encodings.__init__.py in > the first call to _cache.get() in search_function. Somehow this is > called by SetValueEx() in WriteTestData() in test/test_winreg.py. But > inspection of the encodings module shows that _cache is {}, not None, > and the source shows no evidence of how this could have happened. I may have sounded confused: the problem is not caused by the reload(). The test fails the first time around when run by test.autotest. My suspicion is that another test somehow overwrites encodings._cache? --Guido van Rossum (home page: http://www.python.org/~guido/) From mhammond@skippinet.com.au Thu May 4 02:20:24 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Thu, 4 May 2000 11:20:24 +1000 Subject: [Python-Dev] FW: weird bug in test_winreg Message-ID: Oops - I didnt notice the CC - a copy of what I sent to Guido: -----Original Message----- From: Mark Hammond [mailto:mhammond@skippinet.com.au] Sent: Thursday, 4 May 2000 11:13 AM To: Guido van Rossum Subject: RE: weird bug in test_winreg Hah - I was just thinking about this this myself. If I wasnt waiting 24 hours, I would have beaten you to the test_fork1 patch :-) However, there is something bad going on. If you remove your test_fork1 patch, and run it from regrtest (_not_ stand alone) you will see the children threads die with: File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f alive[id] = os.getpid() AttributeError: 'None' object has no attribute 'getpid' Note the error - os is None! [The reason is only happens as part of the test is because the children are created before the main thread fails with the attribute error] Similarly, I get spurious: Traceback (most recent call last): File ".\test_thread.py", line 103, in task2 mutex.release() AttributeError: 'None' object has no attribute 'release' (Only rarely, and never when run stand-alone - the test_fork1 exception happens 100% of the time from the test suite) And of course the test_winreg one. test_winreg, I guessed, may be caused by the import lock (but its certainly not obvious how or why!?). However, that doesnt explain the others. I also saw these _before_ I applied the threading patches (and after!) So I think the problem may be a little deeper? Mark. From just@letterror.com Thu May 4 08:42:00 2000 From: just@letterror.com (Just van Rossum) Date: Thu, 4 May 2000 08:42:00 +0100 Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative In-Reply-To: <200005032122.RAA05150@eric.cnri.reston.va.us> References: Your message of "Wed, 03 May 2000 20:55:24 BST." Message-ID: (Thanks for all the comments. I'll condense my replies into one post.) [JvR] > - wide strings are stored as if they were narrow strings, simply using two > bytes for each Unicode character. [Tom Emerson wrote] >I disagree with you here... store them as UTF-8. Erm, utf-8 in a wide string? This makes no sense... [Skip Montanaro] >Presumably, with Just's proposal len() would >simply return ob_size/width. Right. And if you would allow values for width other than 1 and 2, it opens the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and "only" width==1 needs to be special-cased for speed. >If you used a variable width encoding, Just's plan wouldn't work. Correct, but nor does the current unicode object. Variable width encodings are too messy to see as strings at all: they are only useful as byte arrays. [GvR] >This seems to have some nice properties, but I think it would cause >problems for existing C code that tries to *interpret* the bytes of a >string: it could very well do the wrong thing for wide strings (since >old C code doesn't check for the "wide" flag). I'm not sure how much >C code there is that merely passes strings along... Most C code using >strings makes use of the strings (e.g. open() falls in this category >in my eyes). There are probably many cases that fall into this category. But then again, these cases, especially those that potentially can deal with other encodings than ascii, are not much helped by a default encoding, as /F showed. My idea arose after yesterday's discussions. Some quotes, plus comments: [GvR] >However the problem is that print *always* first converts the object >using str(), and str() enforces that the result is an 8-bit string. >I'm afraid that loosening this will break too much code. (This all >really happens at the C level.) Guido goes on to explain that this means utf-8 is the only sensible default in this case. Good reasoning, but I think it's backwards: - str(unicodestring) should just return unicodestring - it is important that stdout receives the original unicode object. [MAL] >BTW, __str__() has to return strings too. Perhaps we >need __unicode__() and a corresponding slot function too ?! This also seems backwards. If it's really too hard to change Python so that __str__ can return unicode objects, my solution may help. [Ka-Ping Yee] >Here is an addendum that might actually make that proposal >feasible enough (compatibility-wise) to fly in the short term: > > print x > >does, conceptually: > > try: > sys.stdout.printout(x) > except AttributeError: > sys.stdout.write(str(x)) > sys.stdout.write("\n") That stuff like this is even being *proposed* (not that it's not smart or anything...) means there's a terrible bottleneck somewhere which needs fixing. My proposal seems to do does that nicely. Of course, there's no such thing as a free lunch, and I'm sure there are other corners that'll need fixing, but it appears having to write if (!PyString_Check(doc) && !PyUnicode_Check(doc)) ... in all places that may accept unicode strings is no fun either. Yes, some code will break if you throw a wide string at it, but I think that code is easier repaired with my proposal than with the current implementation. It's a big advantage to have only one string type; it makes many problems we've been discussing easier to talk about. Just From Fredrik Lundh" Message-ID: <002d01bfb59c$cf482280$34aab5d4@hagrid> Ka-Ping Yee wrote: > I know this is not a small change, but i'm pretty convinced the > right answer here is that the print hook should call a *method* > on sys.stdout, whatever sys.stdout happens to be. The details > are described in the other long message i wrote ("Printing objects > on files"). >=20 > Here is an addendum that might actually make that proposal > feasible enough (compatibility-wise) to fly in the short term: >=20 > print x >=20 > does, conceptually: >=20 > try: > sys.stdout.printout(x) > except AttributeError: > sys.stdout.write(str(x)) > sys.stdout.write("\n") >=20 > The rest can then be added, and the change in 'print x' will > work nicely for any file objects, but will not break on file-like > substitutes that don't define a 'printout' method. another approach is (simplified): try: sys.stdout.write(x.encode(sys.stdout.encoding)) except AttributeError: sys.stdout.write(str(x)) or, if str is changed to return any kind of string: x =3D str(x) try: x =3D x.encode(sys.stdout.encoding) except AttributeError: pass sys.stdout.write(x) From ht@cogsci.ed.ac.uk Thu May 4 09:51:39 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 04 May 2000 09:51:39 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > My ASCII proposal is a compromise that tries to be fair to both uses > for strings. Introducing byte arrays as a more fundamental type has > been on the wish list for a long time -- I see no way to introduce > this into Python 1.6 without totally botching the release schedule > (June 1st is very close already!). I'd like to be able to move on, > there are other important things still to be added to 1.6 (Vladimir's > malloc patches, Neil's GC, Fredrik's completed sre...). > > For 1.7 (which should happen later this year) I promise I'll reopen > the discussion on byte arrays. I think I hear a moderate consensus developing that the 'ASCII proposal' is a reasonable compromise given the time constraints. But let's not fail to come back to this ASAP -- it _really_ narcs me that every time I load XML into my Python-based editor I'm going to convert large amounts of wide-string data into UTF-8 just so Tk can convert it back to wide-strings in order to display it! ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From just@letterror.com Thu May 4 12:27:45 2000 From: just@letterror.com (Just van Rossum) Date: Thu, 4 May 2000 12:27:45 +0100 Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative In-Reply-To: References: <200005032122.RAA05150@eric.cnri.reston.va.us> Your message of "Wed, 03 May 2000 20:55:24 BST." Message-ID: I wrote: >It's a big advantage to have only one string type; it makes many problems >we've been discussing easier to talk about. I think I should've been more explicit about what I meant here. I'll try to phrase it as an addendum to my proposal -- which suddenly is no longer just a narrow/wide string unification but narrow/wide/ultrawide, to really be ready for the future... As someone else suggested in the discussion, I think it's good if we separate the encoding from the data type. Meaning that wide strings are no longer tied to Unicode. This allows for double-byte encodings other than UCS-2 as well as for safe passing-through of binary goop, but that's not the main point. The main point is that this will make the behavior of (wide) strings more understandable and consistent. The extended string type is simply a sequence of code points, allowing for 0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for ultra-wide strings. Upcasting is always safe, downcasting may raise OverflowError. Depending on the used encoding, this comes as close as possible to the sequence-of-characters model. The default character set should of course be Unicode -- and it should be obvious that this implies Latin-1 for narrow strings. (Additionally: an encoding attribute suddenly makes a whole lot of sense again.) Ok, y'all can shoot me now ;-) Just From guido@python.org Thu May 4 13:40:35 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 04 May 2000 08:40:35 -0400 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "04 May 2000 09:51:39 BST." References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us> > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. But > let's not fail to come back to this ASAP -- it _really_ narcs me that > every time I load XML into my Python-based editor I'm going to convert > large amounts of wide-string data into UTF-8 just so Tk can convert it > back to wide-strings in order to display it! Thanks -- but that's really Tcl's fault, since the only way to get character data *into* Tcl (or out of it) is through the UTF-8 encoding. And is your XML really stored on disk in its 16-bit format? --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Thu May 4 14:21:25 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 4 May 2000 15:21:25 +0200 Subject: [Python-Dev] Re: Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Guido van Rossum wrote: > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new object or modify an existing object to hold a copy of the Unicode string given by unicode and numChars. (Tcl_UniChar* is currently the same thing as Py_UNICODE*) From guido@python.org Thu May 4 18:03:58 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 04 May 2000 13:03:58 -0400 Subject: [Python-Dev] FW: weird bug in test_winreg In-Reply-To: Your message of "Thu, 04 May 2000 11:20:24 +1000." References: Message-ID: <200005041703.NAA13471@eric.cnri.reston.va.us> Mark Hammond: > However, there is something bad going on. If you remove your test_fork1 > patch, and run it from regrtest (_not_ stand alone) you will see the > children threads die with: > > File "L:\src\Python-cvs\Lib\test\test_fork1.py", line 30, in f > alive[id] = os.getpid() > AttributeError: 'None' object has no attribute 'getpid' > > Note the error - os is None! > > [The reason is only happens as part of the test is because the children are > created before the main thread fails with the attribute error] I don't get this one -- maybe my machine is too slow. (130 MHz Pentium.) > Similarly, I get spurious: > > Traceback (most recent call last): > File ".\test_thread.py", line 103, in task2 > mutex.release() > AttributeError: 'None' object has no attribute 'release' > > (Only rarely, and never when run stand-alone - the test_fork1 exception > happens 100% of the time from the test suite) > > And of course the test_winreg one. > > test_winreg, I guessed, may be caused by the import lock (but its certainly > not obvious how or why!?). However, that doesnt explain the others. > > I also saw these _before_ I applied the threading patches (and after!) > > So I think the problem may be a little deeper? It's Vladimir's patch which, after each tests, unloads all modules that were loaded by that test. If I change this to only unload modules whose name starts with "test.", the test_winreg problem goes away, and I bet yours go away too. The real reason must be deeper -- there's also the import lock and the fact that if a submodule of package "test" tries to import "os", a search for "test.os" is made and if it doesn't find it it sticks None in sys.modules['test.os']. but I don't have time to research this further. I'm tempted to apply the following change to regrtest.py. This should still unload the test modules (so you can rerun an individual test) but it doesn't touch other modules. I'll wait 24 hours. :-) *** regrtest.py 2000/04/21 21:35:06 1.15 --- regrtest.py 2000/05/04 16:56:26 *************** *** 121,127 **** skipped.append(test) # Unload the newly imported modules (best effort finalization) for module in sys.modules.keys(): ! if module not in save_modules: test_support.unload(module) if good and not quiet: if not bad and not skipped and len(good) > 1: --- 121,127 ---- skipped.append(test) # Unload the newly imported modules (best effort finalization) for module in sys.modules.keys(): ! if module not in save_modules and module.startswith("test."): test_support.unload(module) if good and not quiet: if not bad and not skipped and len(good) > 1: --Guido van Rossum (home page: http://www.python.org/~guido/) From gvwilson@nevex.com Thu May 4 20:03:54 2000 From: gvwilson@nevex.com (gvwilson@nevex.com) Date: Thu, 4 May 2000 15:03:54 -0400 (EDT) Subject: [Python-Dev] Minimal (single-file) Python? Message-ID: Hi. Has anyone ever built, or thought about building, a single-file Python, in which all the "basic" capabilities are included in a single executable (where "basic" means "can do as much as the Bourne shell")? Some of the entries in the Software Carpentry competition would like to be able to bootstrap from as small a starting point as possible. Thanks, Greg p.s. I don't think this is the same problem as moving built-in features of Python into optionally-loaded libraries, as some of the things in the 'sys', 'string', and 'os' modules would have to move in the other direction to ensure Bourne shell equivalence. From just@letterror.com Thu May 4 22:22:38 2000 From: just@letterror.com (Just van Rossum) Date: Thu, 4 May 2000 22:22:38 +0100 Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative Message-ID: (Boy, is it quiet here all of a sudden ;-) Sorry for the duplication of stuff, but I'd like to reiterate my points, to separate them from my implementation proposal, as that's just what it is: an implementation detail. These things are important to me: - get rid of the Unicode-ness of wide strings, in order to - make narrow and wide strings as similar as possible - implicit conversion between narrow and wide strings should happen purely on the basis of the character codes; no assumption at all should be made about the encoding, ie. what the character code _means_. - downcasting from wide to narrow may raise OverflowError if there are characters in the wide string that are > 255 - str(s) should always return s if s is a string, whether narrow or wide - file objects need to be responsible for handling wide strings - the above two points should make it possible for - if no encoding is known, Unicode is the default, whether narrow or wide The above points seem to have the following consequences: - the 'u' in \uXXXX notation no longer makes much sense, since it is not neccesary for the character to be a Unicode code point: it's just a 2-byte int. \wXXXX might be an option. - the u"" notation is no longer neccesary: if a string literal contains a character > 255 the string should automatically become a wide string. - narrow strings should also have an encode() method. - the builtin unicode() function might be redundant if: - it is possible to specify a source encoding. I'm not sure if this is best done through an extra argument for encode() or that it should be a new method, eg. transcode(). - s.encode() or s.transcode() are allowed to output a wide string, as in aNarrowString.encode("UCS-2") and s.transcode("Mac-Roman", "UCS-2"). My proposal to extend the "old" string type to be able to contain wide strings is of course largely unrelated to all this. Yet it may provide some additional C compatibility (especially now that silent conversion to utf-8 is out) as well as a workaround for the str()-having-to-return-a-narrow-string bottleneck. Just From skip@mojam.com (Skip Montanaro) Thu May 4 21:43:42 2000 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 4 May 2000 15:43:42 -0500 (CDT) Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative In-Reply-To: References: Message-ID: <14609.57598.738381.250872@beluga.mojam.com> Just> Sorry for the duplication of stuff, but I'd like to reiterate my Just> points, to separate them from my implementation proposal, as Just> that's just what it is: an implementation detail. Just> These things are important to me: ... For the encoding-challenged like me, does it make sense to explicitly state that you can't mix character widths within a single string, or is that just so obvious that I deserve a head slap just for mentioning it? -- Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/ "We have become ... the stewards of life's continuity on earth. We did not ask for this role... We may not be suited to it, but here we are." - Stephen Jay Gould From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us> Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid> Henry S. Thompson wrote: > I think I hear a moderate consensus developing that the 'ASCII > proposal' is a reasonable compromise given the time constraints. agreed. (but even if we settle for "7-bit unicode" in 1.6, there are still a few issues left to sort out before 1.6 final. but it might be best to get back to that after we've added SRE and GC to 1.6a3. we might all need a short break...) > But let's not fail to come back to this ASAP first week in june, promise ;-) From mhammond@skippinet.com.au Fri May 5 00:55:15 2000 From: mhammond@skippinet.com.au (Mark Hammond) Date: Fri, 5 May 2000 09:55:15 +1000 Subject: [Python-Dev] FW: weird bug in test_winreg In-Reply-To: <200005041703.NAA13471@eric.cnri.reston.va.us> Message-ID: > It's Vladimir's patch which, after each tests, unloads all modules > that were loaded by that test. If I change this to only unload > modules whose name starts with "test.", the test_winreg problem goes > away, and I bet yours go away too. They do indeed! > The real reason must be deeper -- there's also the import lock and the > fact that if a submodule of package "test" tries to import "os", a > search for "test.os" is made and if it doesn't find it it sticks None > in sys.modules['test.os']. > > but I don't have time to research this further. I started to think about this. The issue is simply that code which blithely wipes sys.modules[] may cause unexpected results. While the end result is a bug, the symptoms are caused by extreme hackiness. Seeing as my time is also limited, I say we forget it! > I'm tempted to apply the following change to regrtest.py. This should > still unload the test modules (so you can rerun an individual test) > but it doesn't touch other modules. I'll wait 24 hours. :-) The 24 hour time limit is only supposed to apply to _my_ patches - you can check yours straight in (and if anyone asks, just tell them I said it was OK) :-) Mark. From ht@cogsci.ed.ac.uk Fri May 5 09:19:07 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:19:07 +0100 Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> Message-ID: Guido van Rossum writes: > > I think I hear a moderate consensus developing that the 'ASCII > > proposal' is a reasonable compromise given the time constraints. But > > let's not fail to come back to this ASAP -- it _really_ narcs me that > > every time I load XML into my Python-based editor I'm going to convert > > large amounts of wide-string data into UTF-8 just so Tk can convert it > > back to wide-strings in order to display it! > > Thanks -- but that's really Tcl's fault, since the only way to get > character data *into* Tcl (or out of it) is through the UTF-8 > encoding. > > And is your XML really stored on disk in its 16-bit format? No, I have no idea what encoding it's in, my XML parser supports over a dozen encodings, and quite sensibly always delivers the content, as per the XML REC, as wide-strings. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From ht@cogsci.ed.ac.uk Fri May 5 09:21:41 2000 From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Date: 05 May 2000 09:21:41 +0100 Subject: [Python-Dev] Re: [XML-SIG] Re: Unicode debate In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200" References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <200005031216.IAA03274@eric.cnri.reston.va.us> <200005041240.IAA08277@eric.cnri.reston.va.us> <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com> Message-ID: "Fredrik Lundh" writes: > Guido van Rossum wrote: > > Thanks -- but that's really Tcl's fault, since the only way to get > > character data *into* Tcl (or out of it) is through the UTF-8 > > encoding. > > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm > > Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars) > > Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new > object or modify an existing object to hold a copy of the > Unicode string given by unicode and numChars. > > (Tcl_UniChar* is currently the same thing as Py_UNICODE*) > Any way this can be exploited in Tkinter? ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ From just@letterror.com Fri May 5 10:25:37 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 5 May 2000 10:25:37 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <007701bfb60c$1543f060$34aab5d4@hagrid> References: <200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><200005031216.IAA03274@eric.cnri.reston.va.us>