From brett at python.org Fri Jun 1 00:51:54 2007 From: brett at python.org (Brett Cannon) Date: Thu, 31 May 2007 15:51:54 -0700 Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere? Message-ID: I vaguely remember a discussion about the str/unicode unification and whether there was going to be standardization on the internal representation of Unicode or not. I don't remember the outcome, but I am curious as to whether it will lead to the removal of --enable-unicode or not. Reason I ask is that the OS X extension modules do not like it when you compile with UCS-4 (see http://www.python.org/sf/763708). If the option is not going to go away I am going to try to lean on someone to address this as Unicode is obviously going to play a bigger role in Python come 3.0. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070531/043991b9/attachment.html From guido at python.org Fri Jun 1 00:58:56 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Jun 2007 06:58:56 +0800 Subject: [Python-3000] [Python-Dev] PEP 367: New Super In-Reply-To: <20070531170734.273393A40AA@sparrow.telecommunity.com> References: <001101c79aa7$eb26c130$0201a8c0@mshome.net> <002d01c79f6d$ce090de0$0201a8c0@mshome.net> <003f01c79fd9$66948ec0$0201a8c0@mshome.net> <009c01c7a04f$7e348460$0201a8c0@mshome.net> <20070531170734.273393A40AA@sparrow.telecommunity.com> Message-ID: Ouch. You're right. Class methods are broken by this patch. I don't have time right now to look into a fix (thanks for the various suggstions) but if somebody doesn't get to it first I'll look into this in-depth on Monday. class C: @classmethod def cm(cls): return cls.__name__ class D(C): pass print(D.cm(), D().cm()) This prints "C C" with the patch, but "D D" without it. Clearly this shouldn't change. --Guido On 6/1/07, Phillip J. Eby wrote: > At 07:48 PM 5/31/2007 +0800, Guido van Rossum wrote: > >I've updated the patch; the latest version now contains the grammar > >and compiler changes needed to make super a keyword and to > >automatically add a required parameter 'super' when super is used. > >This requires the latest p3yk branch (r55692 or higher). > > > >Comments anyone? What do people think of the change of semantics for > >the im_class field of bound (and unbound) methods? > > Please correct me if I'm wrong, but just looking at the patch it > seems to me that the descriptor protocol is being changed as well -- > i.e., the 'type' argument is now the found-in-type in the case of an > instance __get__ as well as class __get__. > > It would seem to me that this change would break classmethods both on > the instance and class level, since the 'cls' argument is supposed to > be the derived class, not the class where the method was > defined. There also don't seem to be any tests for the use of super > in classmethods. > > This would seem to make the change unworkable, unless we are also > getting rid of classmethods, or further change the descriptor > protocol to add another argument. However, by the time we get to > that point, it seems like making 'super' a cell variable might be a > better option. > > Here's a strategy that I think could resolve your difficulties with > the cell variable approach: > > First, when a class is encountered during the symbol setup pass, > allocate an extra symbol for the class as a cell variable with a > generated name (e.g. $1, $2, etc.), and keep a pointer to this name > in the class state information. > > Second, when generating code for 'super', pull out the generated > variable name of the nearest enclosing class, and use it as if it had > been written in the code. > > Third, change the MAKE_FUNCTION for the BUILD_CLASS to a > MAKE_CLOSURE, and add code after BUILD_CLASS to also store a super > object in the special variable. Maybe something like: > > ... > BUILD_CLASS > ... apply decorators ... > DUP_TOP > STORE_* classname > ... generate super object ... > STORE_DEREF $n > > Fourth, make sure that the frame initialization code can deal with a > code object that has a locals dictionary *and* cell variables. For > Python 2.5, this constraint is already met as long as CO_OPTIMIZED > isn't set, and that should already be true for the relevant cases > (module-level code and class bodies), so we really just need to > ensure that CO_OPTIMIZED doesn't get set as a side-effect of adding > cell variables. > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Fri Jun 1 01:27:37 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 01 Jun 2007 11:27:37 +1200 Subject: [Python-3000] Lines breaking In-Reply-To: <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de> <465CD016.7050002@canterbury.ac.nz> <87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407@canterbury.ac.nz> <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <465F59E9.4030702@canterbury.ac.nz> Stephen J. Turnbull wrote: > *Python* does the right thing: it leaves the line break character(s) > in place. It's not Python's problem if programmers go around > stripping characters just because they happen to be at the end of the > line. But currently you *know* that, e.g. string.strip() will only ever remove whitespace and \n characters, so if those don't matter to you, it's safe to use it. I would be worried if it started removing characters that it didn't remove before, because that could alter the semantics of my code. > Those characters are > mandatory breaks because the expectation is *very* consistent (they > say). I object to being told by the Unicode committee what semantics I should be using for ASCII characters that pre-date unicode by a long way. -- Greg From guido at python.org Fri Jun 1 01:50:29 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Jun 2007 07:50:29 +0800 Subject: [Python-3000] __debug__ In-Reply-To: References: Message-ID: Making __debug__ another keyword atom sounds great to me. On 6/1/07, Thomas Heller wrote: > Brett Cannon schrieb: > > On 5/31/07, Georg Brandl wrote: > >> > >> Guido just fixed a case in the py3k branch where you could assign to > >> "None" in a function call. > >> > >> __debug__ has similar problems: it can't be assigned to normally, but via > >> keyword arguments it is possible. > >> > >> This should be fixed; or should __debug__ be thrown out anyway? > > > > > > > > I never use the flag, personally. When I am debugging I have an > > app-specific flag I set. I am +1 on ditching it. > > > > -Brett > > > > > > I would very much wish that __debug__ stays, because I use it it nearly every larger > program that I later wish to freeze and distribute. > > "if __debug__: ..." blocks have the advantage that *no* bytecode is generated > when run or frozen with -O or -OO, so the modules imported in these blocks > are not pulled in by modulefinder. You cannot get this effect (AFAIK) with > app-specific flags. > > Thanks, > Thomas > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Fri Jun 1 01:55:22 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Jun 2007 07:55:22 +0800 Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere? In-Reply-To: References: Message-ID: I don't know exactly what that option does; it won't be possible to disable unicode in 3.0, but I fully plan to continue supporting both 2-byte and 4-byte storage. 4-byte storage is broken on OSX it ought to be fixed (unless it's a platform policy not to support it, as appears to be the case on Windows). --Guido On 6/1/07, Brett Cannon wrote: > I vaguely remember a discussion about the str/unicode unification and > whether there was going to be standardization on the internal representation > of Unicode or not. I don't remember the outcome, but I am curious as to > whether it will lead to the removal of --enable-unicode or not. > > Reason I ask is that the OS X extension modules do not like it when you > compile with UCS-4 (see http://www.python.org/sf/763708). If the option is > not going to go away I am going to try to lean on someone to address this as > Unicode is obviously going to play a bigger role in Python come 3.0. > > -Brett > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Fri Jun 1 03:59:19 2007 From: brett at python.org (Brett Cannon) Date: Thu, 31 May 2007 18:59:19 -0700 Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere? In-Reply-To: References: Message-ID: On 5/31/07, Guido van Rossum wrote: > > I don't know exactly what that option does; It specifies how Unicode is stored internally in the interpreter (I believe). it won't be possible to > disable unicode in 3.0, but I fully plan to continue supporting both > 2-byte and 4-byte storage. 4-byte storage is broken on OSX it ought to > be fixed (unless it's a platform policy not to support it, as appears > to be the case on Windows). It's broken in the Mac extension modules that are auto-generated. Otherwise it's fine. -Brett --Guido > > On 6/1/07, Brett Cannon wrote: > > I vaguely remember a discussion about the str/unicode unification and > > whether there was going to be standardization on the internal > representation > > of Unicode or not. I don't remember the outcome, but I am curious as to > > whether it will lead to the removal of --enable-unicode or not. > > > > Reason I ask is that the OS X extension modules do not like it when you > > compile with UCS-4 (see http://www.python.org/sf/763708). If the option > is > > not going to go away I am going to try to lean on someone to address > this as > > Unicode is obviously going to play a bigger role in Python come 3.0. > > > > -Brett > > > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: > > http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070531/89b2d1c1/attachment.htm From stephen at xemacs.org Fri Jun 1 05:23:54 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 01 Jun 2007 12:23:54 +0900 Subject: [Python-3000] Lines breaking In-Reply-To: <465F59E9.4030702@canterbury.ac.nz> References: <465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de> <465CD016.7050002@canterbury.ac.nz> <87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407@canterbury.ac.nz> <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp> <465F59E9.4030702@canterbury.ac.nz> Message-ID: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Stephen J. Turnbull wrote: > > > *Python* does the right thing: it leaves the line break character(s) > > in place. It's not Python's problem if programmers go around > > stripping characters just because they happen to be at the end of the > > line. > > But currently you *know* that, e.g. string.strip() will > only ever remove whitespace and \n characters, so if > those don't matter to you, it's safe to use it. Yes. Both FF and VT *are* whitespace, AFAIK that has universal agreement, and in particular they *are* removed by string.strip(). I don't understand what you're worried about; nothing changes with respect to handling of generic whitespace. The *only* thing that adoption of the Unicode recommendation for line breaking changes is that "\x0c\n" is now two empty lines with well- defined semantics instead of some number of lines with you-won't-know- until-you-ask-the-implementation semantics. > > Those characters are mandatory breaks because the expectation is > > *very* consistent (they say). > I object to being told by the Unicode committee what > semantics I should be using for ASCII characters that > pre-date unicode by a long way. The ASCII standard, at least as codified in ISO 646, agrees with Unicode, by referring to ECMA-48/ISO-6249 for the definition of the 32 C0 characters. I suspect that the ANSI standard semantics of FF and VT haven't changed since ANSI_X3.4-1963. You just object to adopting a standard, period, because it might force you to change your practices. That's reasonable, changing working software is expensive. But interoperability is an important goal too. From hagenf at CoLi.Uni-SB.DE Fri Jun 1 07:17:48 2007 From: hagenf at CoLi.Uni-SB.DE (=?ISO-8859-1?Q?Hagen_F=FCrstenau?=) Date: Fri, 01 Jun 2007 07:17:48 +0200 Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere? In-Reply-To: References: Message-ID: <465FABFC.5040804@coli.uni-saarland.de> Hi, I've been reading the list for a couple of weeks, but this is my first post. Guido van Rossum wrote: > I don't know exactly what that option does; it won't be possible to > disable unicode in 3.0, but I fully plan to continue supporting both > 2-byte and 4-byte storage. Does this still include the possibility of switching between 1-, 2- and 4-byte storage internally? I think you mentioned this in your Google talk and I thought it was a very good compromise - and much better than a compile-time switch. - Hagen From martin at v.loewis.de Fri Jun 1 08:05:16 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 Jun 2007 08:05:16 +0200 Subject: [Python-3000] Is the --enable-unicode configure arg going anywhere? In-Reply-To: <465FABFC.5040804@coli.uni-saarland.de> References: <465FABFC.5040804@coli.uni-saarland.de> Message-ID: <465FB71C.5060201@v.loewis.de> > Guido van Rossum wrote: >> I don't know exactly what that option does; it won't be possible to >> disable unicode in 3.0, but I fully plan to continue supporting both >> 2-byte and 4-byte storage. > > Does this still include the possibility of switching between 1-, 2- and > 4-byte storage internally? I think you mentioned this in your Google > talk and I thought it was a very good compromise - and much better > than a compile-time switch. In the current py3k-struni branch, it's still a compile time option. I doubt that will change unless somebody contributes code to make it change. The current compile-time option is between 2-byte and 4-byte representation; 1-byte representation is not supported. Regards, Martin From jason.orendorff at gmail.com Fri Jun 1 18:05:32 2007 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Fri, 1 Jun 2007 12:05:32 -0400 Subject: [Python-3000] map, filter, reduce Message-ID: PEP 3100 still isn't clear on the fate of these guys, except that reduce() is gone. How about moving all three to the functools module instead? -j From steven.bethard at gmail.com Fri Jun 1 18:24:12 2007 From: steven.bethard at gmail.com (Steven Bethard) Date: Fri, 1 Jun 2007 10:24:12 -0600 Subject: [Python-3000] map, filter, reduce In-Reply-To: References: Message-ID: On 6/1/07, Jason Orendorff wrote: > PEP 3100 still isn't clear on the fate of these guys, except that > reduce() is gone. > > How about moving all three to the functools module instead? The itertools module already has imap() and ifilter(). They can just be renamed to map() and filter() and left where they are. STeVe -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy From tjreedy at udel.edu Fri Jun 1 19:12:00 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 1 Jun 2007 13:12:00 -0400 Subject: [Python-3000] map, filter, reduce References: Message-ID: "Jason Orendorff" wrote in message news:bb8868b90706010905p2dae12b7qc538cf25190c7127 at mail.gmail.com... | PEP 3100 still isn't clear on the fate of these guys, except that | reduce() is gone. | | How about moving all three to the functools module instead? The current reduce is broken due to being a mashing together of two versions of the function (one 2 params, the other 3), with the 3-param one having an ill-formed signature (inconsistent parameter order) to allow the mashing that should not have been done. (The ill-formed signature is hard to remember and is responsible for part of some peoples' dislike of reduce.) I would like a proper 3-param version in functools, but have not writen the exact proposal yet since library changes have been put off. I am also thinking about an ireduce, but need to make sure it cannot be easily done with current itertools. Terry Jan Reedy From janssen at parc.com Fri Jun 1 19:51:17 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 1 Jun 2007 10:51:17 PDT Subject: [Python-3000] Lines breaking In-Reply-To: <873b1c287v.fsf@uwakimon.sk.tsukuba.ac.jp> References: <465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de> <465CD016.7050002@canterbury.ac.nz> <87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407@canterbury.ac.nz> <07May31.074957pdt."57996"@synergy1.parc.xerox.com> <873b1c287v.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <07Jun1.105118pdt."57996"@synergy1.parc.xerox.com> > I agree that that looks nice in my editor, but it is not Unicode- > conforming practice, and I suspect that if you experiment with any > printer you'll discover that you get an empty line at the top of the > page. This seems to me to be a non-issue; most "text" files are actually data files (think about it), and were never intended to be printed. > I also suspect that any program that currently is used to process > those files' content by lines probably simply treats the FF as > whitespace, and throws away empty lines. Nope. At least, my program doesn't. And I don't think it's an appropriate assumption, either. Many programs are written to ignore empty lines in their input, but many, maybe more, are not. Blank lines convey critical information in many contexts. > If so, it will still work > with FF treated as a hard line break in line-processing mode, since > the trailing NL will now generate a (superfluous) empty line. Nope. The line-breaking is actually used (and this is common in data represented as text files) as part of the parsing process, so by turning it into two lines you've broken the program logic. Bill From janssen at parc.com Fri Jun 1 20:14:32 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 1 Jun 2007 11:14:32 PDT Subject: [Python-3000] Lines breaking In-Reply-To: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> References: <465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de> <465CD016.7050002@canterbury.ac.nz> <87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407@canterbury.ac.nz> <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp> <465F59E9.4030702@canterbury.ac.nz> <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <07Jun1.111434pdt."57996"@synergy1.parc.xerox.com> > The *only* thing that adoption of the Unicode recommendation for line > breaking changes is that "\x0c\n" is now two empty lines with well- > defined semantics instead of some number of lines with you-won't-know- > until-you-ask-the-implementation semantics. Well, that's just the way text is. > The ASCII standard, at least as codified in ISO 646, agrees with > Unicode, by referring to ECMA-48/ISO-6249 for the definition of the 32 > C0 characters. I suspect that the ANSI standard semantics of FF and > VT haven't changed since ANSI_X3.4-1963. > > You just object to adopting a standard, period, because it might force > you to change your practices. That's reasonable, changing working > software is expensive. But interoperability is an important goal too. Where, specifically, are the breakdowns in interoperability manifesting themselves? I'm sort of amazed at the turn of this argument. Greg is arguing that it might be arbitrarily expensive to make this change, because of the way that text is used to store data by many programs, and because it's been the way it's been for 15 years of Python history. So the cost of "changing working software" could run into billions; we have no way to know. But Stephen is arguing that we need to do it anyway to conform to the dictates of some post-facto standards committee (yes, I know, I usually *like* that argument :-). Yesterday at Google Developer's Day, Alex Martelli told me that Python is about pragmatics; I think I know which side the pragmatic comes down on in this case. How about a subtype of File which supports this behavior? Bill From collinw at gmail.com Fri Jun 1 21:08:32 2007 From: collinw at gmail.com (Collin Winter) Date: Fri, 1 Jun 2007 12:08:32 -0700 Subject: [Python-3000] map, filter, reduce In-Reply-To: References: Message-ID: <43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com> On 6/1/07, Jason Orendorff wrote: > PEP 3100 still isn't clear on the fate of these guys, except that > reduce() is gone. I'm not sure what isn't clear: reduce() is listed as "to be removed", and since map() and filter() aren't mentioned as "to be removed", they're presumably not going to be removed. What's tripping you up? Collin Winter From g.brandl at gmx.net Fri Jun 1 21:17:35 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Jun 2007 21:17:35 +0200 Subject: [Python-3000] map, filter, reduce In-Reply-To: References: Message-ID: Terry Reedy schrieb: > "Jason Orendorff" wrote in message > news:bb8868b90706010905p2dae12b7qc538cf25190c7127 at mail.gmail.com... > | PEP 3100 still isn't clear on the fate of these guys, except that > | reduce() is gone. > | > | How about moving all three to the functools module instead? > > The current reduce is broken due to being a mashing together of two > versions of the function (one 2 params, the other 3), with the 3-param one > having an ill-formed signature (inconsistent parameter order) to allow the > mashing that should not have been done. (The ill-formed signature is hard > to remember and is responsible for part of some peoples' dislike of > reduce.) I would like a proper 3-param version in functools, but have not > writen the exact proposal yet since library changes have been put off. > > I am also thinking about an ireduce, but need to make sure it cannot be > easily done with current itertools. How should an "ireduce" work? The result is not a sequence which could be returned lazily. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From g.brandl at gmx.net Fri Jun 1 22:14:38 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 01 Jun 2007 22:14:38 +0200 Subject: [Python-3000] Error in PEP 3115? Message-ID: In PEP 3115 (the new metaclasses PEP), there is an example metaclass: # The metaclass class OrderedClass(type): # The prepare function @classmethod def __prepare__(metacls, name, bases): # No keywords in this case return member_table() # The metaclass invocation def __init__(self, name, bases, classdict): # Note that we replace the classdict with a regular # dict before passing it to the superclass, so that we # don't continue to record member names after the class # has been created. result = type(name, bases, dict(classdict)) result.member_names = classdict.member_names return result Shouldn't __init__ be __new__? Also, if type(...) and not type.__new__(self, ...) is called, the type of a class using this metaclass will be type, not OrderedClass, but this may be intended. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From jason.orendorff at gmail.com Fri Jun 1 22:55:56 2007 From: jason.orendorff at gmail.com (Jason Orendorff) Date: Fri, 1 Jun 2007 16:55:56 -0400 Subject: [Python-3000] map, filter, reduce In-Reply-To: <43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com> References: <43aa6ff70706011208x7db0393ewbd84d874eed9a333@mail.gmail.com> Message-ID: On 6/1/07, Collin Winter wrote: > On 6/1/07, Jason Orendorff wrote: > > PEP 3100 still isn't clear on the fate of these guys, except that > > reduce() is gone. > > I'm not sure what isn't clear: reduce() is listed as "to be removed", > and since map() and filter() aren't mentioned as "to be removed", > they're presumably not going to be removed. What's tripping you up? "I think these features should be cut from Python 3000. [...] I think dropping filter() and map() is pretty uncontroversial [...]." http://www.artima.com/weblogs/viewpost.jsp?thread=98196 I know it was two years ago, and just a blog post for crying out loud, but apparently it was pretty traumatic for some people, because I still hear people whinge about it. I would like to have an authoritative document to point those people toward. Perhaps PEP 3099? -j From tjreedy at udel.edu Fri Jun 1 23:44:48 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 1 Jun 2007 17:44:48 -0400 Subject: [Python-3000] map, filter, reduce References: Message-ID: "Georg Brandl" wrote in message news:f3prce$9c$1 at sea.gmane.org... | How should an "ireduce" work? The result is not a sequence which could be | returned lazily. It would generate the sequence of partial reductions (potentially indefinately). list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10] This is obviously *not* the same as a reduce() which only returns the final value without the intermediate values. Terry Jan Reedy From alexandre at peadrop.com Sat Jun 2 00:57:41 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Fri, 1 Jun 2007 18:57:41 -0400 Subject: [Python-3000] Handling of wide Unicode characters Message-ID: Hi, I was doing some testing on the new _string_io module, since I was slightly skeptical on my handling of wide Unicode characters (32-bit of length, instead of the usual 16-bit in UTF-16). So, I ran this little test: >>> s = _string_io.StringIO() >>> s.write(u'??') >>> s.tell() 2 Like I expected, wide Unicode characters count for two. However, I was surprised that Python treats them as two characters as well: >>> len(u'??') 2 >>> u'??' u'\ud87e\udccd' Is it a bug, or only an implementation choice? Cheers, -- Alexandre From jcarlson at uci.edu Sat Jun 2 01:11:35 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 01 Jun 2007 16:11:35 -0700 Subject: [Python-3000] Handling of wide Unicode characters In-Reply-To: References: Message-ID: <20070601160728.6EF5.JCARLSON@uci.edu> "Alexandre Vassalotti" wrote: > Hi, > > I was doing some testing on the new _string_io module, since I was > slightly skeptical on my handling of wide Unicode characters (32-bit > of length, instead of the usual 16-bit in UTF-16). So, I ran this > little test: > > >>> s = _string_io.StringIO() > >>> s.write(u'????') > >>> s.tell() > 2 > > Like I expected, wide Unicode characters count for two. However, I was > surprised that Python treats them as two characters as well: > > >>> len(u'????') > 2 > >>> u'????' > u'\ud87e\udccd' > > Is it a bug, or only an implementation choice? If your Python is compiled as a UTF-16 build, then any character in the extended plane will be seen as two characters by Python. If you are using a UCS-4 build (it's the same as UTF-32), then you should be seeing the single wide character as a single wide character. The only exception to this rule is if you enter the wide character as a surrogate pair, in which case Python doesn't normalize it into the single wide character. To get a real wide character, you would need to use a proper escape, or decode from an encoded string. - Josiah From guido at python.org Sat Jun 2 01:11:29 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Jun 2007 07:11:29 +0800 Subject: [Python-3000] map, filter, reduce In-Reply-To: References: Message-ID: I see no benefit in ireduce(), just more ways to write obfuscated code. Regarding map() and filter(), I don't see what's unclear about PEP 3100: """ * Make built-ins return an iterator where appropriate (e.g. ``range()``, ``zip()``, ``map()``, ``filter()``, etc.) [zip and range: done] """ --Guido On 6/2/07, Terry Reedy wrote: > > "Georg Brandl" wrote in message > news:f3prce$9c$1 at sea.gmane.org... > | How should an "ireduce" work? The result is not a sequence which could be > | returned lazily. > > It would generate the sequence of partial reductions (potentially > indefinately). > list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10] > > This is obviously *not* the same as a reduce() which only returns the final > value without the intermediate values. > > Terry Jan Reedy > > > > > > > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Sat Jun 2 01:18:47 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Jun 2007 07:18:47 +0800 Subject: [Python-3000] Error in PEP 3115? In-Reply-To: References: Message-ID: You're right. Fixed now. I also fixed dict.setitem (should be dict.__setitem__). Thanks for noticing! --Guido On 6/2/07, Georg Brandl wrote: > In PEP 3115 (the new metaclasses PEP), there is an example metaclass: > > # The metaclass > class OrderedClass(type): > > # The prepare function > @classmethod > def __prepare__(metacls, name, bases): # No keywords in this case > return member_table() > > # The metaclass invocation > def __init__(self, name, bases, classdict): > # Note that we replace the classdict with a regular > # dict before passing it to the superclass, so that we > # don't continue to record member names after the class > # has been created. > result = type(name, bases, dict(classdict)) > result.member_names = classdict.member_names > return result > > Shouldn't __init__ be __new__? Also, if type(...) and not > type.__new__(self, ...) is called, the type of a class using this > metaclass will be type, not OrderedClass, but this may be intended. > > Georg > > -- > Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. > Four shall be the number of spaces thou shalt indent, and the number of thy > indenting shall be four. Eight shalt thou not indent, nor either indent thou > two, excepting that thou then proceed to four. Tabs are right out. > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Sat Jun 2 01:24:57 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Jun 2007 07:24:57 +0800 Subject: [Python-3000] Handling of wide Unicode characters In-Reply-To: <20070601160728.6EF5.JCARLSON@uci.edu> References: <20070601160728.6EF5.JCARLSON@uci.edu> Message-ID: What he said. IOW, we're treating each half of a surrogate as a "character", at least for purposes of counting items in a string. (Otherwise operations like len() and indexing/slicing would no longer be O(1).) --Guido On 6/2/07, Josiah Carlson wrote: > > "Alexandre Vassalotti" wrote: > > Hi, > > > > I was doing some testing on the new _string_io module, since I was > > slightly skeptical on my handling of wide Unicode characters (32-bit > > of length, instead of the usual 16-bit in UTF-16). So, I ran this > > little test: > > > > >>> s = _string_io.StringIO() > > >>> s.write(u'??') > > >>> s.tell() > > 2 > > > > Like I expected, wide Unicode characters count for two. However, I was > > surprised that Python treats them as two characters as well: > > > > >>> len(u'??') > > 2 > > >>> u'??' > > u'\ud87e\udccd' > > > > Is it a bug, or only an implementation choice? > > If your Python is compiled as a UTF-16 build, then any character in the > extended plane will be seen as two characters by Python. If you are > using a UCS-4 build (it's the same as UTF-32), then you should be seeing > the single wide character as a single wide character. The only > exception to this rule is if you enter the wide character as a surrogate > pair, in which case Python doesn't normalize it into the single wide > character. To get a real wide character, you would need to use a proper > escape, or decode from an encoded string. > > > - Josiah > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Sat Jun 2 01:49:01 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Fri, 1 Jun 2007 19:49:01 -0400 Subject: [Python-3000] Handling of wide Unicode characters In-Reply-To: <20070601160728.6EF5.JCARLSON@uci.edu> References: <20070601160728.6EF5.JCARLSON@uci.edu> Message-ID: Thanks for explanation. Anyway, it certainly much simpler to deal with surrogate pairs than with variable-width characters. On 6/1/07, Josiah Carlson wrote: > > "Alexandre Vassalotti" wrote: > > Hi, > > > > I was doing some testing on the new _string_io module, since I was > > slightly skeptical on my handling of wide Unicode characters (32-bit > > of length, instead of the usual 16-bit in UTF-16). So, I ran this > > little test: > > > > >>> s = _string_io.StringIO() > > >>> s.write(u'??') > > >>> s.tell() > > 2 > > > > Like I expected, wide Unicode characters count for two. However, I was > > surprised that Python treats them as two characters as well: > > > > >>> len(u'??') > > 2 > > >>> u'??' > > u'\ud87e\udccd' > > > > Is it a bug, or only an implementation choice? > > If your Python is compiled as a UTF-16 build, then any character in the > extended plane will be seen as two characters by Python. If you are > using a UCS-4 build (it's the same as UTF-32), then you should be seeing > the single wide character as a single wide character. The only > exception to this rule is if you enter the wide character as a surrogate > pair, in which case Python doesn't normalize it into the single wide > character. To get a real wide character, you would need to use a proper > escape, or decode from an encoded string. > > > - Josiah > > -- Alexandre Vassalotti From greg.ewing at canterbury.ac.nz Sat Jun 2 02:09:29 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Jun 2007 12:09:29 +1200 Subject: [Python-3000] Lines breaking In-Reply-To: <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> References: <465B814D.2060101@canterbury.ac.nz> <465BB994.9050309@v.loewis.de> <465CD016.7050002@canterbury.ac.nz> <87ps4j0zi6.fsf@uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407@canterbury.ac.nz> <8764691mpq.fsf@uwakimon.sk.tsukuba.ac.jp> <465F59E9.4030702@canterbury.ac.nz> <87y7j4z7at.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4660B539.4080109@canterbury.ac.nz> Stephen J. Turnbull wrote: > Both FF and VT *are* whitespace, AFAIK that has universal > agreement, and in particular they *are* removed by string.strip(). You're right, strip() wasn't a good example, and I withdraw it. However, there's a big difference between being a whitespace character and being a line break character. Programs that currently deal with FF and VT chars would have their behaviour changed, because they would suddenly start seeing lines broken in unexpected (to them) places, and getting lines that don't end with \n which aren't at the end of the file. -- Greg From greg.ewing at canterbury.ac.nz Sat Jun 2 02:15:24 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 02 Jun 2007 12:15:24 +1200 Subject: [Python-3000] map, filter, reduce In-Reply-To: References: Message-ID: <4660B69C.5010806@canterbury.ac.nz> Terry Reedy wrote: > It would generate the sequence of partial reductions (potentially > indefinately). > list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10] > > This is obviously *not* the same as a reduce() which only returns the final > value without the intermediate values. It's sufficiently different that I think calling it 'ireduce' would just be confusing. It's more like a 'running_reduce' or something. -- Greg From jcarlson at uci.edu Sat Jun 2 02:44:19 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Fri, 01 Jun 2007 17:44:19 -0700 Subject: [Python-3000] Handling of wide Unicode characters In-Reply-To: References: <20070601160728.6EF5.JCARLSON@uci.edu> Message-ID: <20070601174032.6EFB.JCARLSON@uci.edu> "Alexandre Vassalotti" wrote: > Thanks for explanation. Anyway, it certainly much simpler to deal with > surrogate pairs than with variable-width characters. I don't know, I really liked my tree overlay that could handle variable-width characters of any internal encoding (utf-7, utf-8, utf-16). Of course it takes an extra O(n/logn) space and O(logn) time to access arbitrary characters in the worst case, but such is the case with time/space tradeoffs. - Josiah > On 6/1/07, Josiah Carlson wrote: > > > > "Alexandre Vassalotti" wrote: > > > Hi, > > > > > > I was doing some testing on the new _string_io module, since I was > > > slightly skeptical on my handling of wide Unicode characters (32-bit > > > of length, instead of the usual 16-bit in UTF-16). So, I ran this > > > little test: > > > > > > >>> s = _string_io.StringIO() > > > >>> s.write(u'????') > > > >>> s.tell() > > > 2 > > > > > > Like I expected, wide Unicode characters count for two. However, I was > > > surprised that Python treats them as two characters as well: > > > > > > >>> len(u'????') > > > 2 > > > >>> u'????' > > > u'\ud87e\udccd' > > > > > > Is it a bug, or only an implementation choice? > > > > If your Python is compiled as a UTF-16 build, then any character in the > > extended plane will be seen as two characters by Python. If you are > > using a UCS-4 build (it's the same as UTF-32), then you should be seeing > > the single wide character as a single wide character. The only > > exception to this rule is if you enter the wide character as a surrogate > > pair, in which case Python doesn't normalize it into the single wide > > character. To get a real wide character, you would need to use a proper > > escape, or decode from an encoded string. > > > > > > - Josiah > > > > > > > -- > Alexandre Vassalotti From stephen at xemacs.org Sat Jun 2 06:03:11 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 2 Jun 2007 13:03:11 +0900 (JST) Subject: [Python-3000] Lines breaking In-Reply-To: <07Jun1.111434pdt."57996"@synergy1.parc.xerox.com> References: Message-ID: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp> <465B814D.2060101 at canterbury.ac.nz> <465BB994.9050309 at v.loewis.de> <465CD016.7050002 at canterbury.ac.nz> <87ps4j0zi6.fsf at uwakimon.sk.tsukuba.ac.jp> <465E37C3.9070407 at canterbury.ac.nz> <8764691mpq.fsf at uwakimon.sk.tsukuba.ac.jp> <465F59E9.4030702 at canterbury.ac.nz> <87y7j4z7at.fsf at uwakimon.sk.tsukuba.ac.jp> <07Jun1.111434pdt."57996"@synergy1.parc.xerox.com> X-Mailer: VM 7.17 under 21.5 (beta27) "fiddleheads" (+CVS-20070324) XEmacs Lucid Date: Sat, 02 Jun 2007 13:03:10 +0900 Message-ID: <87tztrypdt.fsf at uwakimon.sk.tsukuba.ac.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Bill Janssen writes: > > The *only* thing that adoption of the Unicode recommendation for line > > breaking changes is that "\x0c\n" is now two empty lines with well- > > defined semantics instead of some number of lines with you-won't-know- > > until-you-ask-the-implementation semantics. > > Well, that's just the way text is. If it were *text*, it wouldn't matter, you say yourself. People would be able to live with an empty first line. The issue arises because you've defined a *formal data format* embedded in text which conflicts with long-established standards for text. Now we have an attempt to define a universal standard for text, which conflicts with your practice. > > You just object to adopting a standard, period, because it might force > > you to change your practices. That's reasonable, changing working > > software is expensive. But interoperability is an important goal too. > > Where, specifically, are the breakdowns in interoperability > manifesting themselves? That's not the point; this is like the logical operations on decimal thing. Adopting a standard in full is reassuring to potential users, who won't complain, they just go away. > I'm sort of amazed at the turn of this argument. Greg is arguing that > it might be arbitrarily expensive to make this change, Which I've acknowledged. But we have no data at all. We're talking about Python 3000, and we know that many programs will require porting effort anyway. How expensive is it? "Arbitrarily" is FUD. > But Stephen is arguing that we need to do it anyway to conform to > the dictates of some post-facto standards committee (yes, I know, I > usually *like* that argument :-). And you should. We only *need* to do it if we want to claim Unicode conformance in this area. I think that is desirable; readline functionality is very basic to a text-processing language. > How about a subtype of File which supports this behavior? We're talking about Python 3000, right? If we're going to claim conformance, it should be default. If it's not going to be default, there's no need to talk about it until somebody writes the module and submits it for inclusion in the stdlib. From rauli.ruohonen at gmail.com Sat Jun 2 06:14:21 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sat, 2 Jun 2007 07:14:21 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 5/27/07, Stephen J. Turnbull wrote: >James Y Knight writes: >> a 'pyidchar.txt' file with a list of character ranges, and now that >> pyidchar.txt file is going to have separate sections based on module >> name? Sorry, but are you !@# kidding me?!? > >The scalability issue was raised by Guido, not the ASCII advocates. He did not say that such files or command-line options would be scalable either. They are fine tools for auditing, but not for using finished products. One should provide both auditing tools and ease of use of already audited code. One possibility for providing both: (1) Add a mandatory ASCII-only special comment at the beginning of each module. The comment would continue until the first empty line and would contain only valid directives matching some regular expression. Only whitespace is allowed before the comment. Anything else is a syntax error. (2) Allow directives in the special comment to change encoding and tab/space rules. Also allow them to restrict the identifier character set and the string character set. (3) Defaults: utf-8 encoding, no mixed tabs and spaces, identifier and string content is not restricted. (beyond the restrictions in PEP 3131 etc. which the user can't lift, of course) One could change these in site.py, but the directives in (2) override the defaults, so they can't be used for B&D. (4) Have a command line parameter for restricting the character sets of all modules. Every module must satisfy both this and its own directives simultaneously. A default value for this could be set in site.py, but it must be immutable after first assignment. This way everything "just works" for quick hacks and for naive users who only run code they trust. For real projects it's easy to add a couple of lines in modules to enforce project policy. When you see code that doesn't specify a character set you trust, then you know you may have to be careful. If you don't want to be careful, then you can set the command line parameter default to e.g. ascii in site.py and nothing using non-ascii identifiers will work for you. If you're fine with explicit charset tags but not implicit ones, then you can set the defaults for tagless modules to ascii in site.py. Example 1 (the defaults, implicit): #!/usr/bin/env python # Real code starts here. This comment is not special and you # can even us? wh?t?v?r ch?r?ct?rs y?? w?nt t? h?r?. Example 2 (the defaults, explicit): #!/usr/bin/env python # # coding: utf-8 # identifier_charset: 0-1fffff # string_charset: 0-1fffff # indentation: unmixed # Real code. Example 3 (strawman for some Japanese code): # identifier_charset: # 0-7f 3b1-3c9 "Hiragana" "Katakana" "CJK Unified Ideographs" # "CJK Unified Ideographs Extension A" # "CJK Unified Ideographs Extension B" # The range 3b1-3c9 is lowercase Greek, which is often used in math. Example 3 (inclusion from a file, similar to import): # identifier_charset: fooproject.codingstyle.identifier_charset From gproux+py3000 at gmail.com Sat Jun 2 07:15:31 2007 From: gproux+py3000 at gmail.com (Guillaume Proux) Date: Sat, 2 Jun 2007 14:15:31 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <19dd68ba0706012215y6af9044bw98a9c7a3119795a6@mail.gmail.com> On 6/2/07, Rauli Ruohonen wrote: > (1) Add a mandatory ASCII-only special comment at the beginning of > each module. The comment would continue until the first empty > line and would contain only valid directives matching some > regular expression. Only whitespace is allowed before the > comment. Anything else is a syntax error. Interesting proposal. I really like it indeed. I wonder how people against "magic" will like it. Although there is a precedent with the first line giving the path to the interpreter. .. but it sounds quite a fair proposal and would solve some of the issues raised here before But i wonder if you would see the security issue with some person sending you diff file that would (among other changes...) do something like this " 1a2 > " (inserted a blank line) Then you would be in trouble again... An alternative would be to require for any comment before any code that purposes to set identifier encoding restrictions etc... to be preceded by a specific character (just like the unix convention with the " ! "). Any comment line starting with this character would be restricted to be ascii only. "%" sounds like a good character for this purpose. Like for example... #!/usr/bin/env python # # %coding: utf-8 # %identifier_charset: 0-1fffff # %string_charset: 0-1fffff # %indentation: unmixed Regards, Guillaume From jcarlson at uci.edu Sat Jun 2 09:14:58 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 02 Jun 2007 00:14:58 -0700 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20070601235750.6F01.JCARLSON@uci.edu> "Rauli Ruohonen" wrote: > On 5/27/07, Stephen J. Turnbull wrote: > >James Y Knight writes: > >> a 'pyidchar.txt' file with a list of character ranges, and now that > >> pyidchar.txt file is going to have separate sections based on module > >> name? Sorry, but are you !@# kidding me?!? > > > >The scalability issue was raised by Guido, not the ASCII advocates. > > He did not say that such files or command-line options would be > scalable either. They are fine tools for auditing, but not for using > finished products. One should provide both auditing tools and ease > of use of already audited code. > > One possibility for providing both: > > (1) Add a mandatory ASCII-only special comment at the beginning of > each module. The comment would continue until the first empty > line and would contain only valid directives matching some > regular expression. Only whitespace is allowed before the > comment. Anything else is a syntax error. """ If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration; the first group of this expression names the encoding of the source code file. """ Your suggestion would unnecessarily change the semantics of the encoding declarations. I would call this gratuitous breakage. > (2) Allow directives in the special comment to change encoding and > tab/space rules. Also allow them to restrict the identifier > character set and the string character set. Sounds like the application of vim settings as a solution to a whole bunch of completely unrelated "problems" in Python (especially with 4 space indents being the "one true way to indent" and the encoding declaration already being established). Please keep your vim out of my Python ;) . > (3) Defaults: utf-8 encoding, no mixed tabs and spaces, identifier > and string content is not restricted. All except for the identifier content is already going to be the default with Python 3.0 . I've never heard a particularly good reason to allow for mixing tabs and spaces, and the current encoding declaration works just fine (except for the whole unicode character thing). And as stated by basically everyone, the only *sane* default is ascii identifiers. Since the vast majority of users will have no use for unicode identifiers in the short or long term, making them the default is overzealous at best. > (4) Have a command line parameter for restricting the character sets > of all modules. Every module must satisfy both this and its own > directives simultaneously. A default value for this could be set > in site.py, but it must be immutable after first assignment. > Example 3 (inclusion from a file, similar to import): > > # identifier_charset: fooproject.codingstyle.identifier_charset I really don't like the idea of adding a *different* import-like thing. We already have imports (that are evaluated at run time, not compile time), and due to their semantics, can't use a mechanism like the above. Obviously I'm overall -1 . I don't see this as a good solution to the character set problem. And I think its a step back regarding encodings, indentation, etc. - Josiah From g.brandl at gmx.net Sat Jun 2 09:24:02 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 Jun 2007 09:24:02 +0200 Subject: [Python-3000] map, filter, reduce In-Reply-To: <4660B69C.5010806@canterbury.ac.nz> References: <4660B69C.5010806@canterbury.ac.nz> Message-ID: Greg Ewing schrieb: > Terry Reedy wrote: >> It would generate the sequence of partial reductions (potentially >> indefinately). >> list(ireduce(summer, 0, range(5)) = [0, 1, 3, 6, 10] >> >> This is obviously *not* the same as a reduce() which only returns the final >> value without the intermediate values. > > It's sufficiently different that I think calling it > 'ireduce' would just be confusing. > > It's more like a 'running_reduce' or something. ISTM that this application is even more suited for a plain old `for` loop. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From talin at acm.org Sat Jun 2 09:38:40 2007 From: talin at acm.org (Talin) Date: Sat, 02 Jun 2007 00:38:40 -0700 Subject: [Python-3000] Updating PEP 3101 In-Reply-To: <465E6D13.2030606@acm.org> References: <465E6D13.2030606@acm.org> Message-ID: <46611E80.5010002@acm.org> Some more thoughts on this, and some questions. PEP 3101 defines two layers of APIs for string formatting: a low-level formatting engine, and a high-level set of convenience methods (primarily str.format). Both layers have grown complex due to the desire to satisfy feature requests that have been asked for by various folks. What I would like to do is move the design back to a more OOWTDI style. The way I propose to do this is to redesign the low-level engine as a class, called Formatter, with overridable methods. To support the high-level API, there will be a single, built-in global singleton instance of Formatter. Calls to str.format will simply be routed to this singleton instance. So for example, when you call: "The value is {0}".format(1) This will call: builtin_formatter.format("The value is {0}", 1) I'm not sure that it makes any sense to allow the built-in formatter instance to be replaceable or mutable, since that would cause all string formatting behavior to change. Also, there's no way to negotiate conflicts between various library modules that might want different behavior. Fortunately, the base formatter has no state, so all we have to worry about is preventing it from being replaced. Rather, I think it makes more sense to allow people to create their own Formatter instances and use them directly. This does mean, however, that people who want to use their own custom Formatter instance won't be able to use the high-level convenience methods. The Formatter class has at least three overridable methods: 1) The method that parses a format string into constant characters and replacement fields. 2) A method that retrieves a field value given a field name or index. 3) A method that formats an individual replacement field, given a value and a conversion specifier string. So: -- If you want a different syntax for format strings, you override method #1. This satisfies the feature requests of people who wanted variations in the format string syntax. -- If you want to be able to change the way that field values are accessed, you override #2. This satisfies the desire of people who want to have it automatically access locals() or globals(). You can do this via passing in those namespaces as a constructor parameter, or if you want to get fancy, you can do look at the stack frames and figure it out automatically. The main point is that this functionality won't be built in by default, but it could be a cookbook recipe. Another reason to override this method is to change the rules for tracking what field names are legal. The built-in method does not allow fields beginning with an underscore to be used as attributes, i.e. you cannot say "{0._index}" as a format string. If you override the field value method, however, you can change this behavior. Similarly, if you want to add/remove functionality to insure that all positional arguments are used, or change the way errors are handled, you can do that here as well. -- If you want to change the way that built-in types are converted to string form, you override #3. (For non-builtin types you can just add a __format__ special method to the type.) The main point is, however, that none of these overrides affect the behavior of the built-in string.format function. Now, in the current version of the PEP, all of the things that I just mentioned can be changed on a per-call basis by passing in specially-named parameters, i.e.: "The name is {0._index}".format(1, flags=ALLOW_LEADING_UNDERSCORES) I'm proposing to eliminate all of that extra flexibility, and instead say that if you want to be able to do that, use a custom formatter class, but without the syntactical convenience of str.format. So my first question is to get a sense of how many people would find that agreeable. In other words, is it reasonable to require people to give up the syntactical convenience of "string".format() when they want to do custom formatting? My second question deals with implementation. Because 'str' is a built-in type, all of its methods must be built-in as well, and therefore implemented in C. If 'str' depends on a built-in formatter singleton instance, that singleton instance must also be implemented in C, and must be initialized in the Parser before any calls to str.format. Since I am not an expert in the internals of the Python interpreter C code, I would ask how feasible is this? -- Talin From eric+python-dev at trueblade.com Sat Jun 2 13:46:06 2007 From: eric+python-dev at trueblade.com (Eric V. Smith) Date: Sat, 02 Jun 2007 07:46:06 -0400 Subject: [Python-3000] Updating PEP 3101 In-Reply-To: <46611E80.5010002@acm.org> References: <465E6D13.2030606@acm.org> <46611E80.5010002@acm.org> Message-ID: <4661587E.3090807@trueblade.com> Talin wrote: > Some more thoughts on this, and some questions. > > PEP 3101 defines two layers of APIs for string formatting: a low-level > formatting engine, and a high-level set of convenience methods > (primarily str.format). > > Both layers have grown complex due to the desire to satisfy feature > requests that have been asked for by various folks. What I would like to > do is move the design back to a more OOWTDI style. > > The way I propose to do this is to redesign the low-level engine as a > class, called Formatter, with overridable methods. I think this is a good idea, in order to keep "str".format() really simple, and thereby increase its usage. > To support the high-level API, there will be a single, built-in global > singleton instance of Formatter. Calls to str.format will simply be > routed to this singleton instance. I'm not so sure this is actually required, see below. > So my first question is to get a sense of how many people would find > that agreeable. In other words, is it reasonable to require people to > give up the syntactical convenience of "string".format() when they want > to do custom formatting? I like keeping "str".format() simple, because I see it's main use as a slightly more flexible version of '%' for strings. I hope it's usage will be ubiquitous. It's especially handy for i18n. > My second question deals with implementation. Because 'str' is a > built-in type, all of its methods must be built-in as well, and > therefore implemented in C. If 'str' depends on a built-in formatter > singleton instance, that singleton instance must also be implemented in > C, and must be initialized in the Parser before any calls to str.format. I don't think there actually needs to be a built-in singleton formatter, but it just needs to appear "as-if" there is one. Not having a singleton makes hiding the singleton a non-issue. It also simplifies the C code. str.format could be implemented in C, using the existing code in the sandbox implementation. The Formatter class could be written in C or Python, and would call some of the existing code in the sandbox implementation, or refactored versions if we need to expose anything else (which I don't think we do). The only real work we'd need to do to the sandbox code is to strip out some of the code that implements the additional options (such as multiple syntaxes) and hook up str.format. Once we reach a consensus, I'm ready to put some time into this. Then we'd have to implement Formatter, of course. But it shouldn't be too hard. One comment I'd like to make on your prior email is that I'd like to see this implemented in 2.6. To my knowledge, we're not removing any functionality in 3.0 that will be replaced by str.format, so I can't argue that it will make it easier to have code that runs in both 2.6. and 3.0. But it seems to me that the fewer new feature that exist only in 3.0, the easier it will be to wrap you head around 3.0. Eric. From talin at acm.org Sat Jun 2 17:19:08 2007 From: talin at acm.org (Talin) Date: Sat, 02 Jun 2007 08:19:08 -0700 Subject: [Python-3000] Updating PEP 3101 In-Reply-To: <4661583B.5020708@trueblade.com> References: <465E6D13.2030606@acm.org> <46611E80.5010002@acm.org> <4661583B.5020708@trueblade.com> Message-ID: <46618A6C.60704@acm.org> Eric V. Smith wrote: > One comment I'd like to make on your prior email is that I'd like to see > this implemented in 2.6. To my knowledge, we're not removing any > functionality in 3.0 that will be replaced by str.format, so I can't > argue that it will make it easier to have code that runs in both 2.6. > and 3.0. But it seems to me that the fewer new feature that exist only > in 3.0, the easier it will be to wrap you head around 3.0. I think that supporting it in 2.6 is fine. Now, the PEP in your sandbox also talks about making an external module for versions earlier than 2.6. My feeling on that is if someone wants to do that, fine, but it doesn't need to be part of the PEP. -- Talin From rauli.ruohonen at gmail.com Sat Jun 2 18:19:14 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sat, 2 Jun 2007 19:19:14 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070601235750.6F01.JCARLSON@uci.edu> References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <20070601235750.6F01.JCARLSON@uci.edu> Message-ID: On 6/2/07, Josiah Carlson wrote: > """ > If a comment in the first or second line of the Python script matches > the regular expression coding[=:]\s*([-\w.]+), this comment is processed > as an encoding declaration; the first group of this expression names the > encoding of the source code file. > """ > > Your suggestion would unnecessarily change the semantics of the encoding > declarations. I would call this gratuitous breakage. Depending on what the regular expression for the declarations is, the difference may not be big. Current code can also reliably be converted with an automated tool, so this isn't a big deal for py3k. It may be that the change is unnecessary. Reading Guido's writings, he seems to be of the opinion that the Java way (no restrictions at all) is right here, and anything else can be delegated to pylint and similar tools. > Sounds like the application of vim settings as a solution to a whole > bunch of completely unrelated "problems" in Python (especially with 4 > space indents being the "one true way to indent" and the encoding > declaration already being established). Please keep your vim out of my > Python ;) . The encoding declaration stays mostly the same, I'm just suggesting adding similar declarations for the identifier/string character sets and making them deception-proof. You're probably right about the indentation stuff. If you got rid of all indentation-related options and simply forbade mixture of tabs and spaces, I'd just say good riddance. > And as stated by basically everyone, the only *sane* default is ascii > identifiers. Since the vast majority of users will have no use for > unicode identifiers in the short or long term, making them the default > is overzealous at best. "Basically everyone" is not true, because it does not include Guido, who matters the most. Some quotes from his latest posts on the topic: Guido van Rossum (May 25): :I still think such a command-line switch (or switches) is the wrong :approach. What if I have *one* module that uses Cyrillic legitimately. :A command-line switch would enable Cyrillic in *all* modules. Guido van Rossum (May 25): :On 5/24/07, Josiah Carlson wrote: :> Where else in Python have we made the default :> behavior only desired or useful to 5% of our users? : :Where are you getting that statistic? This seems an extremely :backwards, US-centric worldview. Guido van Rossum (May 25): :A more useful approach would seem to be a set of auditing tools that :can be applied routinely to all new contributions (e.g. as a :pre-commit hook when using a source control system), or to all code in :a given directory, download, etc. I don't see this as all that :different from using e.g. PyChecker of PyLint. : :While I routinely perform visual code inspections [...], I certainly don't see :this as a security audit [...]. Scanning for stray non-ASCII characters is best :left to automated tools. Guido van Rossum (May 23): :In particular very helpful was a couple of reports from the Java :world, where Unicode letters in identifiers have been legal for a long :time now. (JavaScript also supports this BTW.) The Java world has not fallen apart, Guido van Rossum (May 17): :As I mentioned before, I don't expect either of these will be much of :a concern. I guess tools like pylint could optionally warn if :non-ascii characters are used. : :On 5/16/07, Jim Jewett wrote: :> (1) Security concerns. :> (2) Obscure bugs. Summary of what I think Guido's saying (involves some interpretation): - always having no restrictions (the Java way) is not a problem in practice - because having no restrictions has worked well with Java, Python should follow - any concerns can be adequately dealt solely with external tools - command line switches are a bad implementation of restriction management It is the last one of these that I was addressing, as there was some demand for restriction management (despite Guido's leave-it-to-pylint stance) but no adequate proposal. The defaults are easily changed in any case. > > # identifier_charset: fooproject.codingstyle.identifier_charset > > I really don't like the idea of adding a *different* import-like thing. > We already have imports (that are evaluated at run time, not compile > time), and due to their semantics, can't use a mechanism like the above. I agree that import is problematic. This part could be omitted with the rationale that it's more trouble than it's worth, and anyone who needs something complicated can use pylint or similar. In the end, something like this is what you'd have most of the time in practice when you care about character sets: # identifier_charset: 0-7f # Real code. When you have a file with Cyrillic, then it'd allow Cyrillic too. For quick hacks you could use this and everything would just work: #!/usr/bin/env python # Real code. This isn't really anything more than a countermeasure against Ka-Ping's tricky.py -exploit and addition of a real charset restriction method instead of abusing the coding declaration for that (that would force you to use legacy codings just to restrict the charsets, as pointed out a lot earlier here). One more thing which might be removed from the suggestion is the command line option and its associated site.py default. Such checking is more appropriate for pylint, and is probably of little use anyway. Either you trust the files you're importing in which case the characters they use does not make any difference, or you don't, in which case you shouldn't be importing them at all and checking their character sets will not help you at all. For audit purposes the comment directives are enough as they can't deceive, and if you want to be extra paranoid you can use pylint to catch any surreptitious patches like in Guillaume's post. From jcarlson at uci.edu Sat Jun 2 19:48:49 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 02 Jun 2007 10:48:49 -0700 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070601235750.6F01.JCARLSON@uci.edu> Message-ID: <20070602095920.6F04.JCARLSON@uci.edu> "Rauli Ruohonen" wrote: > On 6/2/07, Josiah Carlson wrote: > > """ > > If a comment in the first or second line of the Python script matches > > the regular expression coding[=:]\s*([-\w.]+), this comment is processed > > as an encoding declaration; the first group of this expression names the > > encoding of the source code file. > > """ > > > > Your suggestion would unnecessarily change the semantics of the encoding > > declarations. I would call this gratuitous breakage. > > Depending on what the regular expression for the declarations is, the > difference may > not be big. Current code can also reliably be converted with an automated tool, > so this isn't a big deal for py3k. Whether or not there exists a tool to convert from Python 2.6 to Python 3.0 (2to3), every tool that currently handles Python source code encodings via the method specified in the documentation (just about every Python-centric editor I know) would need to be changed. Further, not all code will be passed through the 2.6 to 3.0 converter, as the tool is meant as a sort of "I don't want to go through all the trouble of converting yet, but I want to support Python 3.0". And even if it *were* all passed through, the output of the converter is not meant for future editing and consumption; it is meant as a stopgap. People who really want to support Python 3.0 should be doing the conversion by hand, possibly with guidance from the converter. > It may be that the change is unnecessary. Reading Guido's writings, he seems > to be of the opinion that the Java way (no restrictions at all) is > right here, and > anything else can be delegated to pylint and similar tools. Perhaps, but there is a growing contingent here that are of the opposite opinion. And even though this contingent is of differing opinions on whether unicode identifiers should even be allowed, we all agree that if they are allowed, they shouldn't be the default. > > Sounds like the application of vim settings as a solution to a whole > > bunch of completely unrelated "problems" in Python (especially with 4 > > space indents being the "one true way to indent" and the encoding > > declaration already being established). Please keep your vim out of my > > Python ;) . > > The encoding declaration stays mostly the same, I'm just suggesting adding > similar declarations for the identifier/string character sets and making them > deception-proof. You're probably right about the indentation stuff. If > you got rid > of all indentation-related options and simply forbade mixture of tabs and > spaces, I'd just say good riddance. Python 2.x has a -t option that warns people about inconsistent tab/space usage. In 3.0, from what I understand, that option is automatically enabled and may result in errors instead of warnings. > > And as stated by basically everyone, the only *sane* default is ascii > > identifiers. Since the vast majority of users will have no use for > > unicode identifiers in the short or long term, making them the default > > is overzealous at best. > > "Basically everyone" is not true, because it does not include Guido, who > matters the most. Some quotes from his latest posts on the topic: Guido doesn't always overrule everyone. There is quite a long history of him changing his mind after having seen good reasoning about an issue. Most recently, see the dynamic attribute access thread about the o.{a} syntax. And when I say "basically everyone", I'm offering everyone the opportunity who has offered their opinion recently to be in that camp. Please see the writings of Baptiste Carvello, Jim Jewett, Ka-Ping Yee, Stephen Howell, Ivan Krstic, and myself. If you want to completely ignore the general consensus was reached from people on both sides of the issue, that's fine. But pardon me if I ignore you from here on out. > Guido van Rossum (May 25): > :I still think such a command-line switch (or switches) is the wrong > :approach. What if I have *one* module that uses Cyrillic legitimately. > :A command-line switch would enable Cyrillic in *all* modules. I'm not personally a really big fan of the command-line argument approach, but that doesn't mean that the only two solutions are in-module with your syntax and command-line. There are other solutions (global registry of individual module allowed identifiers, in-module with a different syntax, etc.). I'm just saying that I don't like *your* solution. > Guido van Rossum (May 25): > :On 5/24/07, Josiah Carlson wrote: > :> Where else in Python have we made the default > :> behavior only desired or useful to 5% of our users? > : > :Where are you getting that statistic? This seems an extremely > :backwards, US-centric worldview. You will note that I actually responded to this, as have others. The use of unicode identifiers will be rare, and your pressure to try to make them the default won't change that; but it will confuse the hell out of the large numbers of users who have no use for unicode, and whose tools are not prepared for unicode. > Guido van Rossum (May 25): > :A more useful approach would seem to be a set of auditing tools that > :can be applied routinely to all new contributions (e.g. as a > :pre-commit hook when using a source control system), or to all code in > :a given directory, download, etc. I don't see this as all that > :different from using e.g. PyChecker of PyLint. > : > :While I routinely perform visual code inspections [...], I certainly don't see > :this as a security audit [...]. Scanning for stray non-ASCII characters is best > :left to automated tools. Others have also responded to this. Adding a tool to an arbitrarily large or small previously existing toolchain, so that the majority of users can verify that their code doesn't contain characters that shouldn't be allowed in the first place, isn't a very good solution. > Guido van Rossum (May 23): > :In particular very helpful was a couple of reports from the Java > :world, where Unicode letters in identifiers have been legal for a long > :time now. (JavaScript also supports this BTW.) The Java world has not > :fallen apart, And we reported about this. They are rarely used, and the far vast majority of code that *does* have unicode identifiers is closed-source. As someone else has discussed this, do we want to encourage open source (with which the only sane identifiers are ascii), or do we want to encourage closed-source and the 'ghettoization' of Python source code? > Guido van Rossum (May 17): > :As I mentioned before, I don't expect either of these will be much of > :a concern. I guess tools like pylint could optionally warn if > :non-ascii characters are used. > : > :On 5/16/07, Jim Jewett wrote: > :> (1) Security concerns. > :> (2) Obscure bugs. > > Summary of what I think Guido's saying (involves some interpretation): > - always having no restrictions (the Java way) is not a problem in practice > - because having no restrictions has worked well with Java, Python > should follow Only because it is so rarely used that no one really runs into unicode identifiers. As such, the only sane position is to require the explicit enabling of unicode identifiers. Also please see Nich Coghlan's discussion about *why* this isn't as much an issue with statically typed declarative languages as it is with Python. > - any concerns can be adequately dealt solely with external tools And having to rely on *additional* tools to verify that what the vast majority of users want is actually happening is silly. I'll ask again, because you don't seem to have been paying attention to the messages you cited, but where else in Python has the tiny minority defined the defaults for the vast majority of users? > - command line switches are a bad implementation of restriction management That's the only argument that is worth listening to. But command line switches aren't our only option here. [snip] > This isn't really anything more than a countermeasure against Ka-Ping's > tricky.py -exploit and addition of a real charset restriction method instead of > abusing the coding declaration for that (that would force you to use legacy > codings just to restrict the charsets, as pointed out a lot earlier here). Thankfully, no one who has bothered to think for more than a few minutes about this issue has seriously considered using legacy encodings. So it's a non-issue. > One more thing which might be removed from the suggestion is the command > line option and its associated site.py default. Such checking is more > appropriate > for pylint, and is probably of little use anyway. Either you trust the > files you're > importing in which case the characters they use does not make any difference, > or you don't, in which case you shouldn't be importing them at all and checking > their character sets will not help you at all. For audit purposes the comment > directives are enough as they can't deceive, and if you want to be > extra paranoid > you can use pylint to catch any surreptitious patches like in Guillaume's post. Adding Pylint to verify that I don't have characters that shouldn't be allowed in the first place, when Python should tell me the *the moment* modules are being compiled, is silly. Now, you have had the opportunity to go through the hundreds of posts on the matter and compose a message, yet you still don't understand that ascii is the only sane default. Please read posts in the 3131 thread from the authors I list above, and please try to inform yourself on the content of postings from people that are not Guido. - Josiah From rauli.ruohonen at gmail.com Sat Jun 2 22:39:53 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sat, 2 Jun 2007 23:39:53 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070602095920.6F04.JCARLSON@uci.edu> References: <20070601235750.6F01.JCARLSON@uci.edu> <20070602095920.6F04.JCARLSON@uci.edu> Message-ID: On 6/2/07, Josiah Carlson wrote: > Whether or not there exists a tool to convert from Python 2.6 to > Python 3.0 (2to3), every tool that currently handles Python source > code encodings via the method specified in the documentation > (just about every Python-centric editor I know) would need to be > changed. How so? The old regexp can still match the encoding tag unless the user insists on using it in an incompatible way. As syntax changes go, this one causes little trouble for editors. > Guido doesn't always overrule everyone. Yet he makes the decisions. That's why i used his latest comments on the topic to set the defaults in the suggestion. These are easily changed when necessary, and the whole issue of defaults is quite minor. What matters more is having a convenient way of setting the character set restrictions of a module. The reason I quoted him at such length was that I thought that you might have missed some of his posts because you simply ignored what he had to say (and no, I generally don't remember people's names). > There are other solutions (global registry of individual module > allowed identifiers, in-module with a different syntax, etc.). These are more to the point. Do you have anything concrete? A global registry sounds unwieldy and most would probably enable everything instead of going through the trouble of using it. What kind of in-module syntax would you use? > Adding a tool to an arbitrarily large or small previously existing > toolchain, so that the majority of users can verify that their code > doesn't contain characters that shouldn't be allowed in the first > place, isn't a very good solution. I doubt the majority of users care, so the verifiers would be a minority. You're exaggerating the amount of work caused by Guido's solution. I made my suggestion because in my opinion it or something like it is a more convenient solution for most cases, but Guido's isn't as bad as you make it out to be. > Only because it is so rarely used that no one really runs into > unicode identifiers. It doesn't really matter why they're not a problem in practice, just that they aren't. A non-issue is a non-issue, no matter why. > As such, the only sane position is to require > the explicit enabling of unicode identifiers. Neither default would cause big problems, so there are at least two sane positions. One may be better than the other or they may be equally good, it's hard to say which. > where else in Python has the tiny minority defined the defaults for > the vast majority of users? I'm sure you will find tinier minorities if you search for them, but most users don't use extended slice notation to its full extent, yet it's enabled by default even though it silently accepts a probable typo. Confusing non-ascii characters are also accepted by default in strings, even though only a tiny minority uses those particular characters in strings (I'm sure you've seen the examples). > yet you still don't understand that ascii is the only sane default. It is not the default in Java, which is a major language, and I don't hear constant complaints about it having to be changed, so there are quite many people who think that the above statement is not true for programming languages in general. The claim that static typing makes a big enough difference here is less than convincing. From martin at v.loewis.de Sun Jun 3 01:08:01 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 03 Jun 2007 01:08:01 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6p540n4.fsf@uwakimon.sk.tsukuba.ac.jp> <87646g3u9q.fsf@uwakimon.sk.tsukuba.ac.jp> <87veee2wj4.fsf@uwakimon.sk.tsukuba.ac.jp> <43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com> Message-ID: <4661F851.4020403@v.loewis.de> > Sincere question: if these characters aren't needed, why are they > provided? From what I can tell by googling, they're needed when, e.g., > Arabic is embedded in an otherwise left-to-right script. Do I have > that right? I think not. In principle, each character has a directionality (available through unicodedata.bidirectional), and a rendering algorithm should be able detect runs of characters that differ in directionality from the surrounding text, rendering it properly. As a special case, certain characters are declared "neutral", extending the run across, say, spaces. So embedding Arabic in an LTR text *alone* makes no requirement for these control characters. I'm unsure whether there are cases where the standard BIDI algorithm would produce incorrect results; it's certainly the case that not all tools implement it correctly, so the control characters can help those tools (assuming the tool implements the control character at least). Regards, Martin From jcarlson at uci.edu Sun Jun 3 02:59:27 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Sat, 02 Jun 2007 17:59:27 -0700 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070602095920.6F04.JCARLSON@uci.edu> Message-ID: <20070602174905.6F13.JCARLSON@uci.edu> "Rauli Ruohonen" wrote: > > On 6/2/07, Josiah Carlson wrote: > > Whether or not there exists a tool to convert from Python 2.6 to > > Python 3.0 (2to3), every tool that currently handles Python source > > code encodings via the method specified in the documentation > > (just about every Python-centric editor I know) would need to be > > changed. > > How so? The old regexp can still match the encoding tag unless > the user insists on using it in an incompatible way. As syntax > changes go, this one causes little trouble for editors. As per the spec, only the first two lines need to be scanned. By your change, any editor of Python that wanted to follow the spec (like Vim and Emacs, which helped *define* the spec), would need to scan until comments stopped being found at the beginning of the source file. Further, some editors that don't even understand Python are currently able to handle alternate encodings precisely because there is exactly one true way to define encodings: the way Emacs and Vim have defined it, which Python adopted. > > Guido doesn't always overrule everyone. > > Yet he makes the decisions. That's why i used his latest comments > on the topic to set the defaults in the suggestion. These are > easily changed when necessary, and the whole issue of > defaults is quite minor. What matters more is having a convenient > way of setting the character set restrictions of a module. The > reason I quoted him at such length was that I thought that you > might have missed some of his posts because you simply ignored > what he had to say (and no, I generally don't remember people's > names). Guido last replied before some 30+ messages more or less closed out the discussion, of which were replies that addressed precisely those issues that you quoted to bring up as "proof". If you aren't even going to be bothered to read the thread, I'm not going to bother replying to you. As I said before, and as I'm saying again, read the thread. Until then, you aren't bringing up anything new to the discussion and are just wasting everyone's time. - Josiah From jimjjewett at gmail.com Sun Jun 3 03:31:11 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sat, 2 Jun 2007 21:31:11 -0400 Subject: [Python-3000] Lines breaking In-Reply-To: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp> References: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/2/07, Stephen J. Turnbull wrote: > That's not the point; this is like the logical operations on decimal > thing. Adopting a standard in full is reassuring to potential users, > who won't complain, they just go away. ... > We only *need* to do it if we want to claim Unicode conformance in > this area. I think that is desirable; readline functionality is very > basic to a text-processing language. Even then, I don't think we *need* to do it. Unicode generally allows tailoring (so long as you specify), and the entirety of chapter 5 (Implementation Guidelines) is explicitly non-normative. That said, it might be a sensible change anyhow, particularly if we treat it like the CRLF combination, so that a Form Feed at the the end of a line doesn't force splitlines to produce an empty line. -jJ From jimjjewett at gmail.com Sun Jun 3 05:14:38 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sat, 2 Jun 2007 23:14:38 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <20070601235750.6F01.JCARLSON@uci.edu> Message-ID: On 6/2/07, Rauli Ruohonen wrote: > On 6/2/07, Josiah Carlson wrote: > > Your suggestion would unnecessarily change the semantics of the encoding > > declarations. I would call this gratuitous breakage. > Depending on what the regular expression for the declarations is, the > difference may not be big. I suspect that if coding were always still first, and the identifier charset followed it (or were on the same line), that would take care of this objection. > something like this is what you'd > have most of the time in practice when you care about character sets: > # identifier_charset: 0-7f Why not ASCII? Why not be more specific, with 0x30-0x39, 0x41-0x5a, 0x5f, 0x61-0x7a When adding characters, this isn't such a problem. When restricting them, a standard spelling is more important. > For quick hacks you could use this and everything would just work: > #!/usr/bin/env python > > # Real code. > This isn't really anything more than a countermeasure against Ka-Ping's > tricky.py -exploit uhh... I don't see any charset comment there, so his coding: with a non-ASCII letter in "coding" would still work. -jJ From rauli.ruohonen at gmail.com Sun Jun 3 10:31:30 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 3 Jun 2007 11:31:30 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <20070601235750.6F01.JCARLSON@uci.edu> Message-ID: On 6/3/07, Jim Jewett wrote: > On 6/2/07, Rauli Ruohonen wrote: > > # identifier_charset: 0-7f > > Why not ASCII? > Why not be more specific, with 0x30-0x39, 0x41-0x5a, 0x5f, 0x61-0x7a > > When adding characters, this isn't such a problem. When restricting > them, a standard spelling is more important. I followed Stephen Turnbull's convention of only adding additional restrictions to those already provided by PEP 3131. Here 0-7f would block out all non-7-bit characters, and within that range the PEP rule is "Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.5." > > #!/usr/bin/env python > > > > # Real code. > > > This isn't really anything more than a countermeasure against Ka-Ping's > > tricky.py -exploit > > uhh... I don't see any charset comment there, so his coding: with a > non-ASCII letter in "coding" would still work. If it came in the comments before the first empty line, then it would cause a syntax error, because non-ASCII wouldn't be allowed there to prevent such trickery. The "first empty line" rule was there to make the safe area visually clear to the reader. From stephen at xemacs.org Sun Jun 3 14:48:38 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 03 Jun 2007 21:48:38 +0900 Subject: [Python-3000] Lines breaking In-Reply-To: References: <20070602040311.C82AA1A25CB@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87lkf1yzix.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > Even then, I don't think we *need* to do it. Unicode generally allows > tailoring (so long as you specify), and the entirety of chapter 5 > (Implementation Guidelines) is explicitly non-normative. "Non-normative" in this case means you can claim Unicode conformance without conforming to UAX#14. However, that means we have to deny that we conform to UAX#14. If we want to claim conformance, we have no choice about FORM FEED; the "bk" class is not tailorable and "support" for FORM FEED is not optional. :-( (I don't understand why they did that; to me Bill's example is compelling.) > That said, it might be a sensible change anyhow, particularly if we > treat it like the CRLF combination, so that a Form Feed at the the end > of a line doesn't force splitlines to produce an empty line. I don't think that's conformant, but it might be a good enough compromise to be conformant*, and is Pythonic (ie, similar to CRLF). From rauli.ruohonen at gmail.com Sun Jun 3 15:12:20 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 3 Jun 2007 16:12:20 +0300 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <46371BD2.7050303@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> Message-ID: (sorry about replying to so old mail, but I didn't find a better place to put this) On 5/1/07, "Martin v. L?wis" wrote: > All identifiers are converted into the normal form NFC while parsing; Actually, shouldn't the whole file be converted to NFC, instead of only identifiers? If you have decomposable characters in strings and your editor decides to normalize them to a different form than in the original source, the meaning of the code will change when you save without you noticing anything. It's always better to be explicit when you want to make invisible distinctions. In the rare cases anything but NFC is really needed you can do explicit conversion or use escapes. Having to add normalization calls around all unicode strings to code defensively is neither convenient nor obvious. From turnbull at sk.tsukuba.ac.jp Sun Jun 3 15:42:23 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sun, 03 Jun 2007 22:42:23 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > He did not say that such files or command-line options would be > scalable either. They are fine tools for auditing, but not for using > finished products. One should provide both auditing tools and ease > of use of already audited code. Ease of use of audited code is trivial; turn the checks off. The question is how to do that. > (1) Add a mandatory ASCII-only special comment at the beginning of > each module. The comment would continue until the first empty > line and would contain only valid directives matching some > regular expression. Only whitespace is allowed before the > comment. Anything else is a syntax error. -1 You still need command-line options or local configuration files to decide *what* to audit. We *don't* trust the file! Just because it audits to having the character sets it claims doesn't mean it doesn't use constructs we want to prohibit. Merely to define those is non-trivial, and it is absolutely out of the question to expect that the average Python user will know what the character set "strictly-conforms-to-UTR39-restrictions-allows-confusables" is. So those character sets are basically meaningless for ease of use; ease of use is "globally restrict to what my students can read = ASCII + Japanese". Now, the same code that would be needed to audit the declarations you propose could easily be generalized to *generate* them. Once you've got that, who needs the auditing code in the Python translator? AIUI the implementation of PEP 263, you could just substitute an auditing UTF-8 codec based on that code for the PEP 263 standard UTF-8 codec. This codec is Python code, and thus could be configured using a file, which could be generated by the codec and compared with the old version; the possibilities are endless ... and in no way need to be defined in the language if I'm correct about the implementation.[1] The reason I favor the single command line flag (perhaps even restricted to the binary choice of compatibility ASCII vs. PEP 3131 Unicode) is as a transition strategy. I do not agree with Ka-Ping inter alia that there are bogeymen under the bed, but then I live in Japan, and there *is* no "under the bed" (we sleep on mats on the floor). I think it's quite reasonable to provide a non-invasive, *simple* auditing facility for those who want it. When you're talking about security holes, the burden of proof should *not* be on the paranoid, especially when the backward-compatibility cost of security is *zero* (there are *no* Python programs containing non-ASCII identifiers in the wild yet!) As James Knight says, the "configure the world in one file" strategy that jJ and I were batting around is a bit nuts, but it might not be a bad strategy for configuring a loadable auditing codec or external utility; I don't think that's wasted mental effort at all. Footnotes: [1] Caveat, the implementation will be much more heavyweight than a standard codec since it must contain a Python parser. From turnbull at sk.tsukuba.ac.jp Sun Jun 3 15:51:06 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Sun, 03 Jun 2007 22:51:06 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070601235750.6F01.JCARLSON@uci.edu> References: <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <20070601235750.6F01.JCARLSON@uci.edu> Message-ID: <87ira5ywmt.fsf@uwakimon.sk.tsukuba.ac.jp> Josiah Carlson writes: > And as stated by basically everyone, the only *sane* default is ascii > identifiers. That's a misrepresentation. I prefer the full range of PEP 3131 as the default for use by consenting adults. But you should have the right to unilaterally refuse to grant that consent, yet still enjoy the benefits of the rest of Python. From rauli.ruohonen at gmail.com Sun Jun 3 17:21:43 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 3 Jun 2007 18:21:43 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/3/07, Stephen J. Turnbull wrote: > Merely to define those is non-trivial, and it is absolutely out > of the question to expect that the average Python user will know > what the character set "strictly-conforms-to-UTR39-restrictions- > allows-confusables" is. This is a bit of a strawman, as most of the time the charset would be ascii or everything, which are much easier concepts. Point taken about trying anything more complex, as the reader will generally no longer understand that anyway. A special-purpose tool can handle the complex cases much better. > ease of use is "globally restrict to what my students can read = > ASCII + Japanese". I prefer your first definition of ease of use: > Ease of use of audited code is trivial; turn the checks off. This along with your another idea sounds fairly good, actually: > The reason I favor the single command line flag (perhaps even > restricted to the binary choice of compatibility ASCII vs. PEP > 3131 Unicode) is as a transition strategy. The KISS way of having a single flag for either ASCII or PEP 3131 (if the even simpler way of only PEP 3131 is too simple) should take care of most (all?) of the use cases, and nobody's head will explode. If it's this simple, then it's not a problem to have it on the command line, and my suggestion is unnecessary. > I do not agree with Ka-Ping inter alia that there are bogeymen > under the bed, Looks like the only ones who do agree want pure ASCII, so a binary option is sufficient. You could also argue that it's a choice of old behavior and new behavior, and anything else is unnecessary. You might even use "from __future__ import unicode_identifiers" instead of a command line flag, if you view it like that. > but then I live in Japan, and there *is* no "under > the bed" (we sleep on mats on the floor). ????????????????????? > I think it's quite reasonable to provide a non-invasive, *simple* > auditing facility for those who want it. Emphasis on simple, indeed. If you start adding more complex auditing systems, then it would make sense for the files to declare which specification they conform to. > When you're talking about security holes, the burden of proof > should *not* be on the paranoid The default doesn't really matter much. It's simple to use "#!/usr/bin/env python -U" or whatever in scripts, whether that option selects PEP 3131 or ascii. > As James Knight says, the "configure the world in one file" > strategy that jJ and I were batting around is a bit nuts, but it > might not be a bad strategy for configuring a loadable auditing > codec or external utility; I don't think that's wasted mental > effort at all. True, but such details have clearly gone beyond a "*simple* auditing facility" and sound like a solution looking for a problem. From martin at v.loewis.de Sun Jun 3 19:11:21 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sun, 03 Jun 2007 19:11:21 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> Message-ID: <4662F639.2070806@v.loewis.de> >> All identifiers are converted into the normal form NFC while parsing; > > Actually, shouldn't the whole file be converted to NFC, instead of > only identifiers? If you have decomposable characters in strings and > your editor decides to normalize them to a different form than in the > original source, the meaning of the code will change when you save > without you noticing anything. Sure - but how can Python tell whether a non-normalized string was intentionally put into the source, or as a side effect of the editor modifying it? In most cases, it won't matter. If it does, it should be explicit in the code, e.g. by putting an n() function around the string literal. > It's always better to be explicit when you want to make invisible > distinctions. In the rare cases anything but NFC is really needed you > can do explicit conversion or use escapes. Having to add normalization > calls around all unicode strings to code defensively is neither > convenient nor obvious. However, it typically isn't necessary, either. Also, there is still room for subtle issues, e.g. when concatenating two normalized strings will produce a string that isn't normalized. Also, in many cases, strings come from IO, not from source, so if it is important that they are in NFC, you need to normalize anyway. Regards, Martin From rauli.ruohonen at gmail.com Sun Jun 3 20:30:01 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 3 Jun 2007 21:30:01 +0300 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4662F639.2070806@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> Message-ID: On 6/3/07, "Martin v. L?wis" wrote: > Sure - but how can Python tell whether a non-normalized string was > intentionally put into the source, or as a side effect of the editor > modifying it? It can't, but does it really need to? It could always assume the latter. > In most cases, it won't matter. If it does, it should be explicit > in the code, e.g. by putting an n() function around the string > literal. This is only almost true. Consider these two hypothetical files written by naive newbies: data.py: favorite_colors = {'Martin L?wis': 'blue'} code.py: import data print data.favorite_colors['Martin L?wis'] Now if these are written by two different people using different editors, one might be normalized in a different way than the other, and the code would look all right but mysteriously fail to work. Even more mysteriously, when the files are opened and saved (possibly even automatically) by one of the people without any changes, the code would then start to work. And magically break again when the other person edits one of the files. The most important thing about normalization is that it should be consistent for internal strings. Similarly when reading in a text file, you really should normalize it first, if you're going to handle it as *text*, not binary. The most common normalization is NFC, because it works best everywhere and causes the least amount of surprise. E.g. "L?wis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS), which most naive users won't expect. > Also, there is still room for subtle issues, e.g. when concatenating > two normalized strings will produce a string that isn't normalized. Sure: >>> from unicodedata import normalize as n >>> a=n('NFD', u'?'); n('NFC', a[0])+n('NFC', a[1:]) == n('NFC', a) False But a partial solution is better than no solution. > Also, in many cases, strings come from IO, not from source, so if > it is important that they are in NFC, you need to normalize anyway. Indeed, and it would be best if this happened automatically, like handling of line endings. It doesn't need to always work, just most of the time. I haven't read description of Python's syntax, but this happens with Python 2.5: test.py: a = """ """ print repr(a) Output: '\n' The line ending there is '\r\n', and Python normalizes it when reading in the source code, even though '\r\n' matters even less than doing NFC normalization. From martin at v.loewis.de Sun Jun 3 20:43:03 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sun, 03 Jun 2007 20:43:03 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> Message-ID: <46630BB7.2030205@v.loewis.de> Rauli Ruohonen schrieb: > This is only almost true. Consider these two hypothetical files > written by naive newbies: > > data.py: > > favorite_colors = {'Martin L?wis': 'blue'} > > code.py: > > import data > > print data.favorite_colors['Martin L?wis'] That is an unrealistic example. It's more likely that the second access reads user = find_current_user() print data.favorite_colors[user] To deal with that safely, I would recommend to write favorite_colors = nfc_dict({'Martin L?wis': 'blue'}) > The most important thing about normalization is that it should be > consistent for internal strings. Similarly when reading in a text > file, you really should normalize it first, if you're going to > handle it as *text*, not binary. > > The most common normalization is NFC, because it works best > everywhere and causes the least amount of surprise. E.g. > "L?wis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS), > which most naive users won't expect. Sure. If you think it is worth the effort, write a PEP. PEP 3131 is only about identifiers. Regards, Martin From talin at acm.org Sun Jun 3 21:05:32 2007 From: talin at acm.org (Talin) Date: Sun, 03 Jun 2007 12:05:32 -0700 Subject: [Python-3000] Substantial rewrite of PEP 3101 Message-ID: <466310FC.8020707@acm.org> I've rewritten large portions of PEP 3101, incorporating some material from Patrick Maupin and Eric Smith, as well as rethinking the whole custom formatter design as I discussed earlier. Although it isn't showing up on the web site yet, you can view the copy in subversion (and the diffs) here: http://svn.python.org/view/peps/trunk/pep-3101.txt Please let me know of any errors you find. Thanks. -- Talin From showell30 at yahoo.com Sun Jun 3 21:49:20 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 3 Jun 2007 12:49:20 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070602095920.6F04.JCARLSON@uci.edu> Message-ID: <706536.22850.qm@web33508.mail.mud.yahoo.com> --- Josiah Carlson wrote: > > Perhaps, but there is a growing contingent here that > are of the opposite > opinion. And even though this contingent is of > differing opinions on > whether unicode identifiers should even be allowed, > we all agree that if > they are allowed, they shouldn't be the default. > I have always supported allowing unicode identifiers, but as somebody who now uses ascii identifiers in all the code that I write and all the code that I consume, I am still 60/40 in favor of having ascii-only be the default. It will not the end of the world for me if unicode-friendly turns out to be the default behavior, but it does seem reasonable that *some* concession were made to my general usage, like a simple environment variable that I could set to disable unicode identifiers. In my case, security is not a complete non-issue, but I mainly want this feature from a usability standpoint. I think PEP 3131 could be improved in two ways: 1) In the Objections section, summarize some of the reservations that folks have had about allowing Unicode identifiers into the language, and then address those reservations with the proposed solutions. Rauli's excellent post a few replies back would be a good starting point. 2) Propose an ASCII_ONLY environment variable. ____________________________________________________________________________________ Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC From showell30 at yahoo.com Sun Jun 3 22:18:17 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 3 Jun 2007 13:18:17 -0700 (PDT) Subject: [Python-3000] example Python code under PEP 3131? Message-ID: <315672.7992.qm@web33512.mail.mud.yahoo.com> There has been a lot of interesting debate about PEP 3131, but I think some perspective could be brought to the table by showing actual code examples. Can somebody post a few examples of what Python code would look like under PEP 3131? Maybe 10-to-15 line programs that illustrate the following use cases. 1) Dutch tax lawyer using Dutch identifiers and English reserved words (def, import, if, while, etc.) 2) Japanese student using Japanese identifiers and English reserved words (re, search, match, print, etc.). As somebody who has never worked with a language where I don't know the reserved words, I'm trying to imagine this type of program: 1) English student using English identifiers and Japanese reserved words. My perspective on this issue is limited by the fact that I happen to speak English natively. I often wonder whether I'd be using Python if the keywords were in Dutch, and my identifiers weren't allowed to include certain Anglicisms (say I had to spell "y" as "ij"), but I was allowed to use English in my strings. Then, I wonder how much my decision to use Python would have been influenced by the ability to use identifiers in Dutch. I'm suspecting the answers would be no and no, even though Dutch is fairly closely related to English. I can tell you that it would be a complete showstopper if Matz had written a Python-like language that required Japanese reserved words, even if he had allowed English in other places. Matz wisely internationalized the whole language. (Never mind that I don't use Ruby much anyway--that has more to do with other linguistic issues). ____________________________________________________________________________________ Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz From jimjjewett at gmail.com Mon Jun 4 02:43:44 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Jun 2007 20:43:44 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <4661F851.4020403@v.loewis.de> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6p540n4.fsf@uwakimon.sk.tsukuba.ac.jp> <87646g3u9q.fsf@uwakimon.sk.tsukuba.ac.jp> <87veee2wj4.fsf@uwakimon.sk.tsukuba.ac.jp> <43aa6ff70705271741w2b3eefcbj29921e81822d189@mail.gmail.com> <4661F851.4020403@v.loewis.de> Message-ID: On 6/2/07, "Martin v. L?wis" wrote: > I'm unsure whether there are cases where > the standard BIDI algorithm would produce incorrect results; Yes, but I'm not sure any of those cases are appropriate for programming language identifers. Quoting from introduction to Unicode Annex 9: """ However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text """ Neither the example given (mixed-script part numbers, section 2.2), nor those I could come up with (all involving archaic scripts) were appropriate for variable *names*. > it's certainly the case that not all tools implement it correctly, > so the control characters can help those tools (assuming the > tool implements the control character at least). To be honest, that is probably what I would do; I'm not quite sure I even understand the correct algorithm for numbers. -jJ From stephen at xemacs.org Mon Jun 4 03:29:29 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 04 Jun 2007 10:29:29 +0900 Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <315672.7992.qm@web33512.mail.mud.yahoo.com> References: <315672.7992.qm@web33512.mail.mud.yahoo.com> Message-ID: <87ejkszeva.fsf@uwakimon.sk.tsukuba.ac.jp> Steve Howell writes: > 2) Japanese student using Japanese identifiers and > English reserved words (re, search, match, print, > etc.). I don't have time to cook up something in Python, but I can give an example of working code in Lisp: You probably already know enough Lisp to read this, but if not, here are a few hints. `define-edict-rule' is a factory function, not part of the Lisp language. Comments are prefixed by ";" and run to the end of the line. Strings are delimited by '"' and may contain newlines. Pretty much everything else that is not punctuation is an identifier. "-" may be embedded in an identifier. Note that the rest of the application contains Japanese only in comments. This section deals with de-inflection of Japanese words (ie, deducing dictionary form from words that occur in natural text), and thus needs concepts not available in English, or where available, the English word would not make sense to a Japanese. BTW, I don't know whether the breakage is in ViewCVS, FireFox, or both, but several places in the file result in confusion of content and HTML markup, with more-or-less amusing results. From jimjjewett at gmail.com Mon Jun 4 03:27:06 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 3 Jun 2007 21:27:06 -0400 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) Message-ID: On 6/2/07, Rauli Ruohonen wrote: > and the whole issue of defaults is quite minor. I disagree; the defaults are the most important issue. Those most eager for unicode identifiers are afraid that people (particularly beginning students) won't be able to use local-script identifiers, unless it is the default. My feeling is that the teacher (or the person who pointed them to python) can change the default on a per-install basis, since it can be a one-time change. Those of us most nervous about unicode identifiers are concerned precisely because "anything goes" may become a default. If national characters become the default in Sweden or Japan, that is OK. These national divisions are already there, and probably unavoidable. On the other hand, if "anything from *any* script" becomes the default, even on a single widespread distribution, then the community starts to splinter in a new way. It starts to separate between people who distribute source code (generally ASCII) and people who are effectively distributing binaries (not for human end-users to read). That is bad enough on its own, but even worse because the distinction isn't clearly marked. As the misleading examples have shown, these (effective) binaries can pretend to be regular source code doing one thing, even though they actually do something different. > On 6/2/07, Josiah Carlson wrote: > > Adding a tool to an arbitrarily large or small previously existing > > toolchain, so that the majority of users can verify that their code > > doesn't contain characters that shouldn't be allowed in the first > > place, isn't a very good solution. > I doubt the majority of users care, so the verifiers would be > a minority. Agreed, because the majority of users don't care about security at all. Outside the python context, this is one reason we have so much spam (from compromised computers). To protect the group at large, security has to be the default. Of course, security also has to be non-intrusive, or people will turn it off. A one-time decision to allow your own national characters, which could be rolled into the initial install, or even a local distribution -- that is fairly non-intrusive. > You're exaggerating the amount of work caused [by adding to the toolchain] No, he isn't. My own process is often exactly: (1) Read or skim the code. (2) (a) Download it/save it as text, or (b) Cut and paste the snippet from the webpage (3) Run it. There is no external automated tool in the middle; forcing me to add one would move python from the "things just work, and you can test immediately" category into a compile/build/wait/test language. I have used python this way (when developing for a machine I could not access directly), and ... I don't recommend it. Hopefully, I can set my own python to enforce ASCII IDs (rather than ASCII strings and comments). But if too many people start to assume that distributed code can freely mix other scripts, I'll start to get random failures. I'll probably allow Latin-1. I might end up allowing a few other scripts -- but then how should I say "script X or script Y; not both"? Keeping the default at ASCII for another release or two will provide another release or two to answer this question. > > Only because it is so rarely used that no one really runs into > > unicode identifiers. > It doesn't really matter why they're not a problem in practice, > just that they aren't. A non-issue is a non-issue, no matter why. Of course it matters. If it isn't a problem only because of something that wouldn't apply to python, then we still have to worry. > ... Java, ... don't hear constant complaints They aren't actually a problem because they aren't used; they aren't used because almost no one knows about them. Python would presumably advertise the feature, and see more use. (We shouldn't add it at all *unless* we expect much more usage than unicode IDs have seen in other programming languages.) Also note that Java in particular already has static type checking (which would resolve many of the objections) and is already a compile/build/wait/test language (so the cost of additional tools is less). (I believe that C# is in this category too, but won't swear to it.) Not seeing problems in Lisp would be a valid argument -- except that the internationalized IDs are explicitly marked. Not just the files; the individual IDs. You have to write |lowercase| to get an ID made of unexpected characters (including explicitly lower-case letters). JavaScript would provide a legitimate example of a dynamic language where unicode IDs caused no problem. On the other hand, broken javascript is already so common that I doubt anyone would have noticed; python should (and currently does) meet a higher standard for cross-platform interoperability. In other words, python will be going out on a limb. That doesn't mean we shouldn't allow such Identifiers, but it does mean that we should be cautious. As an analogy, remember that function decorators were added to python in version 2.4. The initial patch would also have handled class decorators. No one came up with a single reason to disallow them that didn't also apply to function decoration -- except one. Guido wasn't *sure* they were needed, and it would be easier to add them later (in 2.6) than it would have been to pull them back out. The same one-step-at-a-time reasoning applies to unicode identifers. Allowing IDs in your native language (or others that you explicitly approve) is probably a good step. Allowing IDs in *any* language by default is probably going too far. -jJ From stephen at xemacs.org Mon Jun 4 03:53:21 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 04 Jun 2007 10:53:21 +0900 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> Message-ID: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > On 6/3/07, "Martin v. L?wis" wrote: > > Sure - but how can Python tell whether a non-normalized string was > > intentionally put into the source, or as a side effect of the editor > > modifying it? > > It can't, but does it really need to? It could always assume the latter. No, it can't. One might want to write Python code that implements normalization algorithms, for example, and there will be "binary strings". Only in the context of Unicode text are you allowed to do those things. This would require Python to internally distinguish between Unicode text files and other files. [example of a dictionary application using Unicode strings] > Now if these are written by two different people using different > editors, one might be normalized in a different way than the other, > and the code would look all right but mysteriously fail to work. It seems to me that once we have a proper separation between bytes objects and unicode objects, that the latter should always be compared internally to the dictionary using the kinds of techniques described in UTS#10 and UTR#30. External normalization is not the right way to handle this issue. > But a partial solution is better than no solution. Not if it leads to unexpected failures that are hard to diagnose, especially in the face of human belief that this problem has been "solved". > The line ending there is '\r\n', and Python normalizes it when > reading in the source code, even though '\r\n' matters even less > than doing NFC normalization. That's not a Python language normalization; that's an artifact of the line-reading function. It's deliberate, of course, but it's not really character-level, it's a line-level transformation. If I start up an interpreter and type >>> a = """^V^M^V^J""" >>> repr(a) "'\\r\\n'" (On my Mac, on other systems the quoting character for key entry of control characters is probably different.) From bjourne at gmail.com Mon Jun 4 03:58:59 2007 From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=) Date: Mon, 4 Jun 2007 03:58:59 +0200 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: References: Message-ID: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> > Those most eager for unicode identifiers are afraid that people > (particularly beginning students) won't be able to use local-script > identifiers, unless it is the default. My feeling is that the teacher > (or the person who pointed them to python) can change the default on a > per-install basis, since it can be a one-time change. What if the person discovers Python by him/herself? > On the other hand, if "anything from *any* script" becomes the > default, even on a single widespread distribution, then the community > starts to splinter in a new way. It starts to separate between people > who distribute source code (generally ASCII) and people who are > effectively distributing binaries (not for human end-users to read). That is FUD. > Hopefully, I can set my own python to enforce ASCII IDs (rather than > ASCII strings and comments). But if too many people start to assume > that distributed code can freely mix other scripts, I'll start to get > random failures. I'll probably allow Latin-1. I might end up > allowing a few other scripts -- but then how should I say "script X or > script Y; not both"? Keeping the default at ASCII for another release > or two will provide another release or two to answer this question. Answer what question? If people will use the feature? Ofcourse they won't if it isn't default. > > ... Java, ... don't hear constant complaints > > They aren't actually a problem because they aren't used; they aren't > used because almost no one knows about them. Python would presumably > advertise the feature, and see more use. (We shouldn't add it at all > *unless* we expect much more usage than unicode IDs have seen in other > programming languages.) Every Swedish book I've read about Java (only 2) mentioned that feature. > The same one-step-at-a-time reasoning applies to unicode identifers. > Allowing IDs in your native language (or others that you explicitly > approve) is probably a good step. Allowing IDs in *any* language by > default is probably going too far. If you set different native languages won't you get the exact same problems that codepages caused and that unicode was invented to solve? -- mvh Bj?rn From showell30 at yahoo.com Mon Jun 4 04:45:57 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 3 Jun 2007 19:45:57 -0700 (PDT) Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> Message-ID: <863853.58342.qm@web33508.mail.mud.yahoo.com> --- BJ?rn Lindqvist wrote: > > Those most eager for unicode identifiers are > afraid that people > > (particularly beginning students) won't be able to > use local-script > > identifiers, unless it is the default. My feeling > is that the teacher > > (or the person who pointed them to python) can > change the default on a > > per-install basis, since it can be a one-time > change. > > What if the person discovers Python by him/herself? > How many people discover Python in a cultural vacuum? People find out about Python from other Python users. There are user groups all over the planet: http://wiki.python.org/moin/LocalUserGroups ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting From stephen at xemacs.org Mon Jun 4 05:45:11 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 04 Jun 2007 12:45:11 +0900 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: References: Message-ID: <87bqfwz8l4.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > > You're exaggerating the amount of work caused [by adding to the toolchain] > > No, he isn't. It is exaggeration. AFAICS the work of auditing character sets can be done by the same codec APIs that implement PEP 263. The only question is whether the additional work of parsing out the identifiers would cause noticable inefficiency in codec operation. AFAIK, parsing out the identifiers is cheap (though possibly several times as expensive as the UTF-8 -> unicode object conversion, if it needs to be done once in the codec and once in the compiler). > Hopefully, I can set my own python to enforce ASCII IDs (rather than > ASCII strings and comments). But if too many people start to assume > that distributed code can freely mix other scripts, I'll start to get > random failures. This is unlikely to be a major problem, IMHO. It definitely is a consideration, though, and some people will face more difficulty than others, perhaps a lot more. > Not seeing problems in Lisp would be a valid argument -- except that > the internationalized IDs are explicitly marked. Not just the files; > the individual IDs. You have to write |lowercase| to get an ID made > of unexpected characters (including explicitly lower-case letters). This is not true of Emacs Lisp, which not only accepts non-ASCII characters, but is case-sensitive. > noticed; python should (and currently does) meet a higher standard for > cross-platform interoperability. As does Emacs. > The same one-step-at-a-time reasoning applies to unicode identifers. > Allowing IDs in your native language (or others that you explicitly > approve) is probably a good step. Allowing IDs in *any* language by > default is probably going too far. I don't really see that distinction. IMO the scenarios where allowing a native language makes sense are (a) localized (like a programming class), and you won't run into anything else anyway, and (b) internationalized, where you'll be sharing with others who have enabled *their* native languages. Those with stricter auditing requirements will be vetting production code with more powerful external tools anyway. From stephen at xemacs.org Mon Jun 4 06:01:08 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 04 Jun 2007 13:01:08 +0900 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> References: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> Message-ID: <87abvgz7uj.fsf@uwakimon.sk.tsukuba.ac.jp> BJ?rn Lindqvist writes: > > On the other hand, if "anything from *any* script" becomes the > > default, even on a single widespread distribution, then the community > > starts to splinter in a new way. It starts to separate between people > > who distribute source code (generally ASCII) and people who are > > effectively distributing binaries (not for human end-users to read). > > That is FUD. Not entirely. XEmacs has found it appropriate to divide its approximation to a standard library into "no-MULE" and "MULE-required" groups of packages (~= Python modules). GNU Emacs did not, and suffered a lot of internal dissension for their decision to impose MULE on all users. Interestingly, they use no non-ASCII identifiers that I know of. (edict.el is not included in GNU Emacs due to an assignment refusenik among the principal authors.) The technology has advanced dramatically since then, but there is real precedent for balkanization. The phrase "effectively distributing binaries (not for human end-users to read)" is over the top, though. Of course they're for end users to read, they still are Python source, etc. > Answer what question? If people will use the feature? Ofcourse they > won't if it isn't default. I assure you, my students will if it is available to my knowledge. > If you set different native languages won't you get the exact same > problems that codepages caused and that unicode was invented to solve? No. There is no confusion of character identity. This is a perfectly legitimate way to support Unicode, as long the subset of Unicode that is allowed is properly declared. It does not violate the principles of Unicode in any way. From martin at v.loewis.de Mon Jun 4 07:12:39 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 04 Jun 2007 07:12:39 +0200 Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <315672.7992.qm@web33512.mail.mud.yahoo.com> References: <315672.7992.qm@web33512.mail.mud.yahoo.com> Message-ID: <46639F47.4070909@v.loewis.de> > Can somebody post a few examples of what Python code > would look like under PEP 3131? Maybe 10-to-15 line > programs that illustrate the following use cases. Anbei eine Klassendefinition, wie sie oft von Studenten in der m?ndlichen Pr?fung vorgeschlagen wird. Mittendrin fragen sie sich dann, ob das ?berhaupt erlaubt ist. # Definition von Element sei gegeben class Liste: def __init__(self): self.erstes_element = None def einf?gen(self, objekt): if not self.erstes_element: self.erstes_element = Element(objekt) else: zeiger = self.erstes_elment while zeiger.n?chstes_element: zeiger = zeiger.n?chstes_element zeiger.n?chstes_element = Element(objekt) def l?schen(self, objekt): if self.erstes_element.wert == objekt: self.erstes_element = self.erstes_element.n?chstes_element else: zeiger = self.erstes_element while zeiger.n?chstes_element: if zeiger.n?chstes_element.wert == objekt: zeiger.n?chstes_element = \ zeiger.n?chstes_element.n?chstes_element return zeiger = zeiger.n?chstes_element Mit freundlichen Gr??en, Martin From martin at v.loewis.de Mon Jun 4 07:26:29 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 04 Jun 2007 07:26:29 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4663A285.4090009@v.loewis.de> Stephen J. Turnbull schrieb: > > > Sure - but how can Python tell whether a non-normalized string was > > > intentionally put into the source, or as a side effect of the editor > > > modifying it? > > > > It can't, but does it really need to? It could always assume the latter. > > No, it can't. One might want to write Python code that implements > normalization algorithms, for example, and there will be "binary > strings". Only in the context of Unicode text are you allowed to do > those things. Of course, such an algorithm really should \u-escape the relevant characters in source, so that editors can't mess them up. > > Now if these are written by two different people using different > > editors, one might be normalized in a different way than the other, > > and the code would look all right but mysteriously fail to work. > > It seems to me that once we have a proper separation between bytes > objects and unicode objects, that the latter should always be compared > internally to the dictionary using the kinds of techniques described > in UTS#10 and UTR#30. External normalization is not the right way to > handle this issue. By default, comparison and dictionary lookup won't do normalization, as that is too expensive and too infrequently needed. In any case, this has nothing to do with PEP 3131. Regards, Martin From rauli.ruohonen at gmail.com Mon Jun 4 07:52:14 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Mon, 4 Jun 2007 08:52:14 +0300 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <87d50czdri.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/4/07, Stephen J. Turnbull wrote: > No, it can't. One might want to write Python code that implements > normalization algorithms, for example, and there will be "binary > strings". Only in the context of Unicode text are you allowed to > do those things. But Python files are text and should be readable to humans. Invisible differences in code that are significant aren't good practice - I think that was well established in the PEP 3131 discussion :-) Is there some reason normalization algorithm implementations can't use escapes (which are ASCII and thus not normalized) for non-NFC strings? Note that editors are allowed to normalize as they will (though the ones I use don't). From the Unicode standard, chapter 3: :C9 A process shall not assume that the interpretations of two : canonical-equivalent character sequences are distinct. : : - The implications of this conformance clause are twofold. First, : a process is never required to give different interpretations : to two different, but canonical-equivalent character sequences. : Second, no process can assume that another process will make : a distinction between two different, but canonical-equivalent : character sequences. As other programs processing Python source code files may not be assumed to distinguish between normalization forms, depending on them to do so (in normalization algorithm source code or elsewhere) is a bit disquieting. > It seems to me that once we have a proper separation between bytes > objects and unicode objects, that the latter should always be > compared internally to the dictionary using the kinds of techniques > described in UTS#10 and UTR#30. This sounds good if it's feasible performance-wise. > External normalization is not the right way to handle this issue. It depends on what problem you're solving. What I'm concerned about most is that there may be rare (because NFC is so ubiquitous) but annoying heisenbugs whose immediate cause is an invisible difference in the source code. Such a class of problems shouldn't exist without a good reason, and the reason "someone might want to write code that depends on invisible easter eggs in the source code" doesn't sound like a good reason to me. Collation also doesn't solve all of the problem for naive users. E.g. is len('???') 3 or 4? It depends on the normalization. Whether each index in it is a hiragana character or not also depends on the normalization. Same for e.g. 'caf?'. > > But a partial solution is better than no solution. > > Not if it leads to unexpected failures that are hard to diagnose, > especially in the face of human belief that this problem has been > "solved". Sure, the concatenation of two normalized strings is not necessarily a normalized string because you can have a string with a combining character at the beginning, but people who deal with such things know (or at least really, really, should!) how to fend for themselves. There's nothing you can do to help them either, except education. There's value in keeping simple things simple and ensuring nothing unexpected happens with simple things. In a large class of use cases you really don't need to care that it's a complex world. This is the case with many legacy encodings (such as Latin-1), and the users of those will surely be surprised if switching to utf-8 causes single characters to sometimes be split to multiple parts depending on the phase of the Moon. > If I start up an interpreter and type > > >>> a = """^V^M^V^J""" > >>> repr(a) > "'\\r\\n'" What the interpreter prompt does is less of an issue, as the code is not long-lived and the programmer is there all the time observing what the code does. Anyway, the deadline for PEPs for py3k has passed and there's no PEP this one would fit in, so I guess this wart will have to stay. It's not a pressing issue, as everyone who's sane uses NFC anyway, and if someone edits your code with a NFD-normalizing editor you can just beat them over the head with a stick and force them to use vim as a penance :-) From eric+python-dev at trueblade.com Mon Jun 4 12:37:34 2007 From: eric+python-dev at trueblade.com (Eric V. Smith) Date: Mon, 04 Jun 2007 06:37:34 -0400 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <466310FC.8020707@acm.org> References: <466310FC.8020707@acm.org> Message-ID: <4663EB6E.4080302@trueblade.com> > Formatter Creation and Initialization > > The Formatter class takes a single initialization argument, 'flags': > > Formatter(flags=0) > > The 'flags' argument is used to control certain subtle behavioral > differences in formatting that would be cumbersome to change via > subclassing. The flags values are defined as static variables > in the "Formatter" class: > > Formatter.ALLOW_LEADING_UNDERSCORES > > By default, leading underscores are not allowed in identifier > lookups (getattr or getitem). Setting this flag will allow > this. > > Formatter.CHECK_UNUSED_POSITIONAL > > If this flag is set, the any positional arguments which are > supplied to the 'format' method but which are not used by > the format string will cause an error. > > Formatter.CHECK_UNUSED_NAME > > If this flag is set, the any named arguments which are > supplied to the 'format' method but which are not used by > the format string will cause an error. I'm not sure I'm wild about these flags which would have to be or'd together, as opposed to discrete parameters. I realize have a single flag field is likely more extensible, but my impression of the standard library is a move away from bitfield flags. Perhaps that's only in my own mind, though! Also, why put this in the base class at all? These could all be implemented in a derived class (or classes), which would leave the base class state-free and therefore without a constructor. > Formatter Methods > > The methods of class Formatter are as follows: > > -- format(format_string, *args, **kwargs) > -- vformat(format_string, args, kwargs) > -- get_positional(args, index) > -- get_named(kwds, name) > -- format_field(value, conversion) I've started a sample implementation to test this API. For starters, I'm writing it in pure Python, but my intention is to use the code in the pep3101 sandbox once I have some tests written and we're happy with the API. From showell30 at yahoo.com Mon Jun 4 12:47:04 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 4 Jun 2007 03:47:04 -0700 (PDT) Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <46639F47.4070909@v.loewis.de> Message-ID: <471885.91417.qm@web33503.mail.mud.yahoo.com> --- "Martin v. L?wis" wrote: > # Definition von Element sei gegeben > > class Liste: > def __init__(self): > self.erstes_element = None > > def einf?gen(self, objekt): > if not self.erstes_element: > self.erstes_element = Element(objekt) > else: > zeiger = self.erstes_elment > while zeiger.n?chstes_element: > zeiger = zeiger.n?chstes_element > zeiger.n?chstes_element = Element(objekt) > > def l?schen(self, objekt): > if self.erstes_element.wert == objekt: > self.erstes_element = > self.erstes_element.n?chstes_element > else: > zeiger = self.erstes_element > while zeiger.n?chstes_element: > if zeiger.n?chstes_element.wert == objekt: > zeiger.n?chstes_element = \ > zeiger.n?chstes_element.n?chstes_element > return > zeiger = zeiger.n?chstes_element > Neat. Danke f?r das Beispiel. (I hope that makes sense.) FWIW I can follow most of the above program, with a tiny bit of help from Babelfish. These were easy for me: Liste = list nachstes = next erstes = first objekt = object These I looked up: einfugen = in joints (????) gegeben = given zeiger = pointer ____________________________________________________________________________________ Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A. http://answers.yahoo.com/dir/?link=list&sid=396545367 From python at zesty.ca Mon Jun 4 13:01:02 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Mon, 4 Jun 2007 06:01:02 -0500 (CDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Fri, 25 May 2007, Guillaume Proux wrote: > If you are really paranoid to see evil chars take over your > python src dir On Sun, 3 Jun 2007, Stephen J. Turnbull wrote: > I do not agree with Ka-Ping inter alia that there are bogeymen > under the bed Sigh. I have lost count of the number of times I (and others by association) have been labelled "paranoid" or something similar in this discussion, and I am now asking you all to put a stop to it. Name-calling isn't going to do us any good. (I am sorry that this is in reply to your message, Stephen -- your message above is one of the gentlest of the lot; it just happens to be the most recent, and I have finally been pushed over the edge into saying something about it.) Please: can we all stick to statements about usage, problems, and solutions, not about the personalities of those who propose them? Here is what I have to say (to everyone in this discussion, not specifically to you, Stephen) in response to said labelling: Many of us value a *predictable* identifier character set. Whether "predictable" means ASCII only, or user-selectable, or restricted by default, I think we all agree in this sentiment: We believe that we should try to make it easier, not harder, for programmers to understand what Python code says. This has many benefits (reliability, readability, transparency, reviewability, debuggability). I consider these core strengths of Python. Python is a source code language. In other languages you share binaries, but in Python you share and directly run source code. This is fundamental to its impact on open source, its impact on education, and its prevalence as an extension language. That is what makes these strengths so important. I hope this helps you understand why these concerns can't and shouldn't be brushed off as "paranoia" -- this really has to do with the core values of the language. -- ?!ng From python at zesty.ca Mon Jun 4 13:08:44 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Mon, 4 Jun 2007 06:08:44 -0500 (CDT) Subject: [Python-3000] PEP 3131 roundup Message-ID: Hi, Here's a summary of some of the remaining open issues and unaddressed arguments regarding PEP 3131. These are the ones I'm familiar with, so I don't claim this to be complete. I hope it helps give some perspective on this huge thread, though. A. Should identifiers be allowed to contain any Unicode letter? Drawbacks of allowing non-ASCII identifiers wholesale: 1. Python will lose the ability to make a reliable round trip to a human-readable display on screen or on paper. http://mail.python.org/pipermail/python-3000/2007-May/007855.html 2. Python will become vulnerable to a new class of security exploits; code and submitted patches will be much harder to inspect. http://mail.python.org/pipermail/python-3000/2007-May/007855.html 3. Humans will no longer be able to validate Python syntax. http://mail.python.org/pipermail/python-3000/2007-May/007855.html 4. Unicode is young; its problems are not yet well understood and solved; tool support is weak. http://mail.python.org/pipermail/python-3000/2007-May/007855.html 5. Languages with non-ASCII identifiers use different character sets and normalization schemes; PEP 3131's choices are non-obvious. http://mail.python.org/pipermail/python-3000/2007-May/007947.html http://mail.python.org/pipermail/python-3000/2007-May/007725.html 6. The Unicode bidi algorithm yields an extremely confusing display order for RTL text when digits or operators are nearby. http://www.w3.org/International/iri-edit/draft-duerst-iri.html#anchor5 http://mail.python.org/pipermail/python-3000/2007-May/007823.html B. Should the default behaviour accept only ASCII identifiers, or should it accept identifiers containing non-ASCII characters? Arguments for ASCII only by default: 1. Non-ASCII identifiers by default makes common practice/assumptions subtly/unknowingly wrong; rarely wrong is worse than obviously wrong. http://mail.python.org/pipermail/python-3000/2007-May/007992.html http://mail.python.org/pipermail/python-3000/2007-May/008009.html http://mail.python.org/pipermail/python-3000/2007-May/007961.html 2. Better to raise a warning than to fail silently when encountering an probably unexpected situation. http://mail.python.org/pipermail/python-3000/2007-May/007993.html http://mail.python.org/pipermail/python-3000/2007-May/007945.html 3. All of current usage is ASCII-only; the vast majority of future usage will be ASCII-only. http://mail.python.org/pipermail/python-3000/2007-May/007952.html http://mail.python.org/pipermail/python-3000/2007-May/007927.html 3. It is the pockets of Unicode adoption that are parochial, not the ASCII advocates. http://mail.python.org/pipermail/python-3000/2007-May/008010.html 4. Python should audit for ASCII-only identifiers for the same reasons that it audits for tab-space consistency http://mail.python.org/pipermail/python-3000/2007-May/007942.html 5. Incremental change is safer. http://mail.python.org/pipermail/python-3000/2007-May/008000.html 6. An ASCII-only default favors open-source development and sharing of source code. http://mail.python.org/pipermail/python-3000/2007-May/007988.html http://mail.python.org/pipermail/python-3000/2007-May/007990.html 7. Existing projects won't have to waste any brainpower worrying about the implications of Unicode identifiers. http://mail.python.org/pipermail/python-3000/2007-May/007957.html C. Should non-ASCII identifiers be optional? Various voices in support of a flag (although there's been debate over which should be the default, no one seems to be saying that there shouldn't be an off switch): http://mail.python.org/pipermail/python-3000/2007-May/007855.html http://mail.python.org/pipermail/python-3000/2007-May/007916.html http://mail.python.org/pipermail/python-3000/2007-May/007923.html http://mail.python.org/pipermail/python-3000/2007-May/007935.html http://mail.python.org/pipermail/python-3000/2007-May/007948.html D. Should the identifier character set be configurable? Various voices proposing and supporting a selectable character set, so that users can get all the benefits of using their own language without the drawbacks of confusable/unfamiliar characters: http://mail.python.org/pipermail/python-3000/2007-May/007890.html http://mail.python.org/pipermail/python-3000/2007-May/007896.html http://mail.python.org/pipermail/python-3000/2007-May/007935.html http://mail.python.org/pipermail/python-3000/2007-May/007950.html http://mail.python.org/pipermail/python-3000/2007-May/007977.html http://mail.python.org/pipermail/python-3000/2007-May/007957.html http://mail.python.org/pipermail/python-3000/2007-May/008038.html http://mail.python.org/pipermail/python-3000/2007-June/008121.html E. Which identifier characters should be allowed? 1. What to do about bidi format control characters? http://mail.python.org/pipermail/python-3000/2007-May/007750.html http://mail.python.org/pipermail/python-3000/2007-May/007823.html http://mail.python.org/pipermail/python-3000/2007-May/007826.html 2. What about other ID_Continue characters? What about characters that look like punctuation? What about other recommendations in UTS #39? What about mixed-script identifiers? http://mail.python.org/pipermail/python-3000/2007-May/007836.html F. Which normalization form should be used, NFC or NFKC? http://mail.python.org/pipermail/python-3000/2007-May/007995.html G. Should source code be required to be in normalized form? http://mail.python.org/pipermail/python-3000/2007-May/007997.html http://mail.python.org/pipermail/python-3000/2007-June/008137.html -- ?!ng From python at zesty.ca Mon Jun 4 13:11:50 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Mon, 4 Jun 2007 06:11:50 -0500 (CDT) Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4662F639.2070806@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> Message-ID: On Sun, 3 Jun 2007, [UTF-8] "Martin v. L??wis" wrote: > >> All identifiers are converted into the normal form NFC while parsing; > > > > Actually, shouldn't the whole file be converted to NFC, instead of > > only identifiers? If you have decomposable characters in strings and > > your editor decides to normalize them to a different form than in the > > original source, the meaning of the code will change when you save > > without you noticing anything. > > Sure - but how can Python tell whether a non-normalized string was > intentionally put into the source, or as a side effect of the editor > modifying it? It seems to me the simplest thing to do is to require that Python source files be normalized. Then the ambiguity just goes away. Everyone knows what form their files should be in, and if you really need to construct a non-normalized string, you can do that explicitly using "\u" notation. -- ?ng From ncoghlan at gmail.com Mon Jun 4 14:12:35 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 04 Jun 2007 22:12:35 +1000 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <4663EB6E.4080302@trueblade.com> References: <466310FC.8020707@acm.org> <4663EB6E.4080302@trueblade.com> Message-ID: <466401B3.3060904@gmail.com> Eric V. Smith wrote: > > Formatter.ALLOW_LEADING_UNDERSCORES > > Formatter.CHECK_UNUSED_POSITIONAL > > Formatter.CHECK_UNUSED_NAME > I'm not sure I'm wild about these flags which would have to be or'd > together, as opposed to discrete parameters. I realize have a single > flag field is likely more extensible, but my impression of the > standard library is a move away from bitfield flags. Perhaps that's > only in my own mind, though! > > Also, why put this in the base class at all? These could all be > implemented in a derived class (or classes), which would leave the > base class state-free and therefore without a constructor. I think the dict/defaultdict cooperative implementation based on the __missing__ method is a good guide to follow here. Instead of having flags to the constructor, instead define methods that the base class invokes to deal with the relevant checks - subclasses can then override them as they see fit. A couple of possible method signatures: def allowed_name(self, name): "Return True if name is allowed, False otherwise" # default implementation return False if name starts with '_' def allow_unused(self, unused_args, unused_kwds): "Return True if unused args/names are allowed, False otherwise" # default implementation always returns True Subclasses can then either return False to get a standard 'disallowed' exception, or else raise their own exception explicitly. A few common alternate implementations of the latter method would be: def allow_unused(self, unused_args, unused_kwds): # All positional arguments must be used return not unused_args def allow_unused(self, unused_args, unused_kwds): # All keyword arguments must be used return not unused_kwds def allow_unused(self, unused_args, unused_kwds): # All arguments must be used return not unused_args and not unused_kwds Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From murman at gmail.com Mon Jun 4 15:04:58 2007 From: murman at gmail.com (Michael Urman) Date: Mon, 4 Jun 2007 08:04:58 -0500 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/4/07, Ka-Ping Yee wrote: > Many of us value a *predictable* identifier character set. > Whether "predictable" means ASCII only, or user-selectable, or > restricted by default, I think we all agree in this sentiment: As someone who would rather see non-ASCII characters gain even ground, even I agree with that sentiment. The rest of your message - stressing that we should make things easier to understand and the importance of source code - strikes a very strong chord with me. However to me it sounds like an argument to allow Unicode identifiers, not one to prevent them. I think that's the biggest problem with this exchange. We have similar goals but disagree about which option does a better job fulfilling those goals. All the rhetoric from all sides about why the shared goals are good won't convince anyone of anything new. The arguments then feel reduced to "Unicode enhances readability" vs. "Unicode impedes readability" and since clearly it does both, how do we make the value judgement about which it does more? How do we weigh the ability to use native language identifiers against the risk that there will be visually indistinguishable differences introduced? Michael -- Michael Urman From showell30 at yahoo.com Mon Jun 4 15:28:56 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 4 Jun 2007 06:28:56 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: Message-ID: <717682.96972.qm@web33510.mail.mud.yahoo.com> --- Michael Urman wrote: > > The arguments then feel reduced to "Unicode enhances > readability" vs. > "Unicode impedes readability" and since clearly it > does both, how do > we make the value judgement about which it does > more? How do we weigh > the ability to use native language identifiers > against the risk that > there will be visually indistinguishable differences > introduced? > I think offering some Unicode examples will enhance the "Unicode enhances readability" argument. Martin recently posted a small example program written in German. As a German non-reader, I still found it pretty easy to read, with a little bit of effort. Interestingly, the one word that I wasn't able to translate, even with the help of Babelfish, was the German word for "insert." It turns out the thing that threw me off was that I omitted the umlaut. That was a bit of an epiphany for me. I'd also be interested in actual testimonials from teachers, Dutch tax lawyers, etc., that they will embrace this feature. I hate to make a decision by majority rule, but I think there is the argument that you need to weigh the population of ascii-literate people vs. ascii-illiterate people. (I don't mean ascii-illiterate as any kind of a slam; I just think that's really the target audience for this feature. I am kanji-illiterate, but I am also not lobbying for any kanji programming languages to be more ascii-friendly.) (I also recognize that Guido did get quite a few testimonials from Unicode users that suggest they embrace this idea, but I haven't seen much in the last couple weeks.) ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting From jimjjewett at gmail.com Mon Jun 4 16:05:13 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 4 Jun 2007 10:05:13 -0400 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> References: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> Message-ID: On 6/3/07, BJ?rn Lindqvist wrote: [Most deleted, Stephen Turnbull already answered better than I knew, let alone could write] > > The same one-step-at-a-time reasoning applies to unicode identifers. > > Allowing IDs in your native language (or others that you explicitly > > approve) is probably a good step. Allowing IDs in *any* language by > > default is probably going too far. > If you set different native languages won't you get the exact same > problems that codepages caused and that unicode was invented to solve? Not at all; if anything, it is the opposite. (1) Those different code pages were mainly used for text, not programming logic. No one has suggested (re-)limiting comments or even (continuing to limit) strings. (2) The biggest problem that I saw in practice was partial overlap; people would assume WYSIWYG, and the different code pages were close enough (mostly overlapping in ASCII) that they didn't usually need to use the same code page -- but then when the differences did bite, they were harder to notice. If you happen to use both Sanskrit and Ethiopic, you can set your own computer to accept both. The only catch is that you probably can't share the Sanskrit with the Coptic community (or vice versa), unless at least one of the following is true: (2a) The code itself (not comments or strings) is in ASCII, so both can read it. Note that this is already the recommened policy for shared code. or (2b) The people you are sharing with trust you enough to add your script as an acceptable alternate. (Again, preferably a simple one-time step -- but an explicit decision.) or (2c) The people you are sharing with have already decided to accept Sanskrit (or Coptic) because other people they trusted were using it, and said it was safe. The existence of 2b and 2c rely on the "consenting adults" policy, but they encourage "informed consent". I wouldn't be surprised to discover that Latin-1, Sanskrit, Coptic, and the Japanese characters were all OK with me. That still wouldn't mean I want to allow Cyrillic (which carries more confusable risk). I already know I don't want to auto-allow the FF10-FF19 (fullwidth ASCII numbers[1]), simply because I don't see any good (non-presentational) reason to use them in place of the normal ASCII numbers -- so the more likely result of using them is confusion. Adding one script (or character range) at a time lets me add things that I (or people I trust) think are reasonable. Turning unicode on or off with a single blunt switch does not. -jJ [1] Yes, the fullwidth ASCII variants are allowed as ID characters according to both the unicode ID_* and XID_ properties, which means they are allowed by the current draft. From talin at acm.org Mon Jun 4 18:34:47 2007 From: talin at acm.org (Talin) Date: Mon, 04 Jun 2007 09:34:47 -0700 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <4663EB44.1010507@trueblade.com> References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com> Message-ID: <46643F27.2040804@acm.org> Eric V. Smith wrote: > > Formatter Creation and Initialization > > > > The Formatter class takes a single initialization argument, 'flags': > > > > Formatter(flags=0) > > > > The 'flags' argument is used to control certain subtle behavioral > > differences in formatting that would be cumbersome to change via > > subclassing. The flags values are defined as static variables > > in the "Formatter" class: > > > > Formatter.ALLOW_LEADING_UNDERSCORES > > > > By default, leading underscores are not allowed in > identifier > > lookups (getattr or getitem). Setting this flag will allow > > this. > > > > Formatter.CHECK_UNUSED_POSITIONAL > > > > If this flag is set, the any positional arguments which are > > supplied to the 'format' method but which are not used by > > the format string will cause an error. > > > > Formatter.CHECK_UNUSED_NAME > > > > If this flag is set, the any named arguments which are > > supplied to the 'format' method but which are not used by > > the format string will cause an error. > > I'm not sure I'm wild about these flags which would have to be or'd > together, as opposed to discrete parameters. I realize have a single > flag field is likely more extensible, but my impression of the > standard library is a move away from bitfield flags. Perhaps that's > only in my own mind, though! Making them separate fields is fine if that's easier. Another possibility is to make them setter methods rather than constructor params. > Also, why put this in the base class at all? These could all be > implemented in a derived class (or classes), which would leave the > base class state-free and therefore without a constructor. My reason for doing this is as follows. Certain kinds of customizations are pretty easy to do via subclassing. For example, supporting a default namespace takes only a few lines of code in a subclass. Other kinds of customization require replacing a much larger chunk of code. Changing the "underscores" and "check-unused" behavior requires overriding 'vformat', which means replacing the entire template string parser. I figured that there would be a lot of people who might want these features, but didn't want to rewrite all of vformat. Now, some of this could be resolved by breaking up vformat into a set of smaller, overridable functions which controlled these behaviors. However, I didn't do this because I didn't want the PEP to micro-manage the implementation of vformat - I wanted to leave you guys some leeway as to design choices. For example, I had thought perhaps to break out a separate method that would just do the parsing of a replacement field (the part inside the brackets) - so in other words, you'd have one function that recognizes the start of a replacement field, which then calls a method which consumes the contents of that field, and so on. You could also break that up into two pieces, one which recognizes the field reference, and one which recognizes the conversion string. However, these various parsing functions aren't entirely isolated from each other. The various parsers would need to pass the current parse position (character iterator or whatever) and other state back and forth. Exposing this requires codifying in the API a lot of the internal state of parsing. Also, the syntax defining the end of a replacement field is a mirror of the syntax that starts one; And conversion specs can contain replacement fields too. Which means that the various parsing methods aren't entirely independent. (Although I think that in your earlier proposal, the syntax for 'internal' replacement fields inside conversion specifiers was always the same, regardless of the markup syntax chosen.) What I wanted to avoid in the PEP was having to specify how all of these different parts fit together and the exact nature of the parameters being passed between them. And I think that even if we do break up vformat this way, we still end up with people having to replace a fairly substantial chunk of code in order to change the behaviors represented by these flags. > > Formatter Methods > > > > The methods of class Formatter are as follows: > > > > -- format(format_string, *args, **kwargs) > > -- vformat(format_string, args, kwargs) > > -- get_positional(args, index) > > -- get_named(kwds, name) > > -- format_field(value, conversion) > > I've started a sample implementation to test this API. For starters, > I'm writing it in pure Python, but my intention is to use the code in > the pep3101 sandbox once I have some tests written and we're happy > with the API. Cool. -- Talin From jcarlson at uci.edu Mon Jun 4 21:43:13 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Mon, 04 Jun 2007 12:43:13 -0700 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <717682.96972.qm@web33510.mail.mud.yahoo.com> References: <717682.96972.qm@web33510.mail.mud.yahoo.com> Message-ID: <20070604090135.6F26.JCARLSON@uci.edu> Steve Howell wrote: > --- Michael Urman wrote: > > > > The arguments then feel reduced to "Unicode enhances > > readability" vs. > > "Unicode impedes readability" and since clearly it > > does both, how do > > we make the value judgement about which it does > > more? How do we weigh > > the ability to use native language identifiers > > against the risk that > > there will be visually indistinguishable differences > > introduced? > > > > I think offering some Unicode examples will enhance > the "Unicode enhances readability" argument. Martin > recently posted a small example program written in > German. As a German non-reader, I still found it > pretty easy to read, with a little bit of effort. > Interestingly, the one word that I wasn't able to > translate, even with the help of Babelfish, was the > German word for "insert." It turns out the thing that > threw me off was that I omitted the umlaut. That was > a bit of an epiphany for me. Maybe I'm worse with languages that other people are; it wouldn't surprise me terribly. I had some difficulty, primarily because I didn't try to translate it (as such would be quite difficult with longer programs and other languages). Here is some code borrowed right from the Python standard library. I've gone ahead and mangled names in a consistant fashion using the tokenize module. Can you guess what it does? class RTrCOlOrB : nBBjIUrB =0 def __init__ (self ,uX ,nBBjIUrB =1 ): self .uX =uX self .nCIZj =[]# KAzWn ezWQ self .rBGBr =0 self .rInC =0 if nBBjIUrB : self .nBBjIUrB =1 self .nCIAC =self .uX .tell () self .XznnCIZj =[]# KAzWn ezWQ def tell (self ): if self .rBGBr >0 : return self .rInCXzn return self .uX .tell ()-self .nCIAC def nBBj (self ,Xzn ,WDBQZB =0 ): DBAB =self .tell () if WDBQZB : if WDBQZB ==1 : Xzn =Xzn +DBAB elif WDBQZB ==2 : if self .rBGBr >0 : Xzn =Xzn +self .rInCXzn else : raise Error ,"ZIQ'C TnB WDBQZB=2 yBC" if not 0 <=Xzn <=DBAB or self .rBGBr >0 and Xzn >self .rInCXzn : raise Error ,'UIe RTrCOlOrB.nBBj() ZIrr' self .uX .seek (Xzn +self .nCIAC ) self .rBGBr =0 self .rInC =0 > I hate to make a decision by majority rule, but I > think there is the argument that you need to weigh the > population of ascii-literate people vs. > ascii-illiterate people. That's a very poor criteria, as not everyone in the world is a potential programmer (despite what the BASIC folks tried to do). Further, of those that become programmers in *any* substantial programming language today, 100% of them learn ascii. Even Java, which has been touted here as being the premier language for allowing unicode identifiers (yes, a bit of hyperbole), requires ascii to access the java libraries. This will be the case for the forseeable future in *any* programming language of substantial use worldwide (regardless of what Python does regarding unicode identifiers). Since the PEP does not discuss the localization of every name in the Python standard library (nor the builtins, __magic__ methods, etc.), people are *still* going to need to learn the latin alphabet, at least as much to distinguish and use Python keywords, builtins, and the standard library. With that said, the only question I believe that really matters in this discussion is: * Where would you use unicode identifiers if they were available in Python? Open source, closed source, personal projects? Since everyone needs to learn ascii to use Python anyways; for the ability to share, ascii will continue to dominate regardless of potentially substantial closed source and personal project use. This has been seen (according to various reports available in this list) in the Java world*. As for closed source or personal projects, as long as we offer people the ability to use unicode identifiers (since PEP 3131 is accepted, this will happen), I don't see that there is any problem being conservative in our choice of default. If we discover that ascii defaults are wrong, we can always add unicode defaults later. The converse is not the case. As I have stated before; offer people the ability to easily add character sets that they want to see and allow to execute (I would be happy to write an internationalizable interactive command-line and wxPython interface for whatever method we choose), and those who want to use non-ascii identifiers can do so. - Josiah * There also seems to be a limited amount of information (available to us) regarding how known Java unicode identifiers are. We hear reports from some that no one knows of unicode identifiers, but then we hear about closed Java source using them in China and abroad, and BJ?rn Lindqvist saying that unicode identifiers were mentioned in the two Sweedish Java books he read. From martin at v.loewis.de Mon Jun 4 21:56:38 2007 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 04 Jun 2007 21:56:38 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> Message-ID: <46646E76.8060804@v.loewis.de> > It seems to me the simplest thing to do is to require that Python > source files be normalized. Then the ambiguity just goes away. > Everyone knows what form their files should be in, and if you really > need to construct a non-normalized string, you can do that explicitly > using "\u" notation. However, what would that mean wrt. non-Unicode source encodings. Say you have a Latin-1-encoded source code. Is that in NFC or not? Regards, Martin From jimjjewett at gmail.com Mon Jun 4 22:50:09 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 4 Jun 2007 16:50:09 -0400 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <46646E76.8060804@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> Message-ID: On 6/4/07, "Martin v. L?wis" wrote: > > It seems to me the simplest thing to do is to require that Python > > source files be normalized. Then the ambiguity just goes away. > > Everyone knows what form their files should be in, and if you really > > need to construct a non-normalized string, you can do that explicitly > > using "\u" notation. > However, what would that mean wrt. non-Unicode source encodings. > Say you have a Latin-1-encoded source code. Is that in NFC or not? Doesn't that depend on whether they happened to ever write some of the combined characters (such as ?) using a two-character form like o?? FWIW, I would prefer "the parser will normalize" to "the parser will reject unnormalized", to support even the dumbest of editors. -jJ From martin at v.loewis.de Mon Jun 4 22:58:11 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 04 Jun 2007 22:58:11 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> Message-ID: <46647CE3.8070200@v.loewis.de> >> Say you have a Latin-1-encoded source code. Is that in NFC or not? > > Doesn't that depend on whether they happened to ever write some of the > combined characters (such as ?) using a two-character form like o?? No. Latin-1 does not support that form; the concept does not exist in that encoding. When converting to an UCS representation, it's the codec's choice to either produce a pre-composed or decomposed form. Regards, Martin From dima at hlabs.spb.ru Mon Jun 4 17:18:38 2007 From: dima at hlabs.spb.ru (Dmitry Vasiliev) Date: Mon, 04 Jun 2007 19:18:38 +0400 Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <46639F47.4070909@v.loewis.de> References: <315672.7992.qm@web33512.mail.mud.yahoo.com> <46639F47.4070909@v.loewis.de> Message-ID: <46642D4E.807@hlabs.spb.ru> Martin v. L?wis wrote: >> Can somebody post a few examples of what Python code >> would look like under PEP 3131? Maybe 10-to-15 line >> programs that illustrate the following use cases. > > class Liste: > def __init__(self): > self.erstes_element = None > > def einf?gen(self, objekt): > if not self.erstes_element: > self.erstes_element = Element(objekt) > else: > zeiger = self.erstes_elment > while zeiger.n?chstes_element: > zeiger = zeiger.n?chstes_element > zeiger.n?chstes_element = Element(objekt) > > def l?schen(self, objekt): > if self.erstes_element.wert == objekt: > self.erstes_element = self.erstes_element.n?chstes_element > else: > zeiger = self.erstes_element > while zeiger.n?chstes_element: > if zeiger.n?chstes_element.wert == objekt: > zeiger.n?chstes_element = \ > zeiger.n?chstes_element.n?chstes_element > return > zeiger = zeiger.n?chstes_element I think the example above isn't so cool because except of three characters with umlauts it's just plain ASCII so you can write almost the same code in the current Python. I guess the following example in Russian is more bright: def ????????_??_???????_?_???????_?????(???_?????): ???? = open(???_?????, "rb") for ?????? in ????: yield ??????.split() While I can understand the code above I have mixed feeling about it, but I think it is better than any code written in a broken English. Many years ago I seen the code with functions named 'wright_*', 'writi_*', 'wrete_*' instead of 'write_*'. -- Dmitry Vasiliev http://hlabs.spb.ru From showell30 at yahoo.com Tue Jun 5 00:28:40 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 4 Jun 2007 15:28:40 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu> Message-ID: <996481.46939.qm@web33502.mail.mud.yahoo.com> --- Josiah Carlson wrote: > Here is some code borrowed right from the Python > standard library. I've > gone ahead and mangled names in a consistant fashion > using the tokenize > module. Can you guess what it does? > > > class RTrCOlOrB : > > nBBjIUrB =0 > > def __init__ (self ,uX ,nBBjIUrB =1 ): > self .uX =uX > self .nCIZj =[]# KAzWn ezWQ > self .rBGBr =0 > self .rInC =0 > if nBBjIUrB : > self .nBBjIUrB =1 > self .nCIAC =self .uX .tell () > self .XznnCIZj =[]# KAzWn ezWQ > > [...] At first glance, no, although obviously it has something to do with randomly accessing a file. If I were trying to reverse engineer this code back to English, the first thing I'd do is use tokenize to mangle the tokens back to consistent, easy to pronounce, relatively meaningless English words like aardvark, bobble, dog_chow, fredness, parplesnarper, etc., as XznnCIZj doesn't have even a false cognate to hook on to in my brain. ____________________________________________________________________________________ Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 From greg.ewing at canterbury.ac.nz Tue Jun 5 01:39:34 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 05 Jun 2007 11:39:34 +1200 Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <471885.91417.qm@web33503.mail.mud.yahoo.com> References: <471885.91417.qm@web33503.mail.mud.yahoo.com> Message-ID: <4664A2B6.903@canterbury.ac.nz> Steve Howell wrote: > einfugen = in joints (????) Maybe "join in" (as a verb)? -- Greg From greg.ewing at canterbury.ac.nz Tue Jun 5 01:50:13 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 05 Jun 2007 11:50:13 +1200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <717682.96972.qm@web33510.mail.mud.yahoo.com> References: <717682.96972.qm@web33510.mail.mud.yahoo.com> Message-ID: <4664A535.1070505@canterbury.ac.nz> Steve Howell wrote: > the one word that I wasn't able to > translate, even with the help of Babelfish, was the > German word for "insert." It turns out the thing that > threw me off was that I omitted the umlaut. Although that probably wouldn't be such a big problem for a native German speaker, who I guess would still be able to recognise what was meant. -- Greg From showell30 at yahoo.com Tue Jun 5 03:34:12 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 4 Jun 2007 18:34:12 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <4664A535.1070505@canterbury.ac.nz> Message-ID: <529517.11234.qm@web33502.mail.mud.yahoo.com> --- Greg Ewing wrote: > Steve Howell wrote: > > the one word that I wasn't able to > > translate, even with the help of Babelfish, was > the > > German word for "insert." It turns out the thing > that > > threw me off was that I omitted the umlaut. > > Although that probably wouldn't be such a big > problem > for a native German speaker, who I guess would still > be able to recognise what was meant. > Sure, but my point was not so much whether the umlaut improved clarity for German readers; my point was that it would also improve clarity for non-German readers aided by Babelfish. But I do think the experiment of me reading the German code was weakened by the similarity of German to English; plus, the code was small enough that the intent of the code was just plain obvious from the overall logical structure. ____________________________________________________________________________________ Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz From showell30 at yahoo.com Tue Jun 5 04:33:46 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 4 Jun 2007 19:33:46 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu> Message-ID: <762386.78677.qm@web33515.mail.mud.yahoo.com> --- Josiah Carlson wrote: > > > I hate to make a decision by majority rule, but I > > think there is the argument that you need to weigh > the > > population of ascii-literate people vs. > > ascii-illiterate people. > > That's a very poor criteria, as not everyone in the > world is a potential > programmer (despite what the BASIC folks tried to > do). I didn't think that I needed to call out the criteria for both groups that potential Python programmers need aptitude/desire to learn programming in general, but of course you're correct. > > Since the PEP does not discuss the localization of > every name in the > Python standard library (nor the builtins, __magic__ > methods, etc.), > people are *still* going to need to learn the latin > alphabet, at least > as much to distinguish and use Python keywords, > builtins, and the > standard library. > I agree with that 100%. Unless you internationlize Python completely for certain languages [1], I think anybody coming to Py3K, even with PEP 3131 accepted, will still need first semester familiarity with English, or at least an English-like language, to be able to use Python effectively. In certain parts of the United States we have the concept of "restaurant Spanish" that native English speakers need to learn when they wait tables. I think there's something like "Python English" that you need to learn to start writing Python, and it's a pretty small subset of the whole language, but the alphabet's a pretty key part of it. Cheers, Steve [1] - ...but regarding fully internationalizing Python in Asia, see this post from Ryan Ginstrom (Japanese-to-English translator): http://mail.python.org/pipermail/python-list/2007-June/443862.html ____________________________________________________________________________________ Get your own web address. Have a HUGE year through Yahoo! Small Business. http://smallbusiness.yahoo.com/domains/?p=BESTDEAL From jimjjewett at gmail.com Tue Jun 5 04:37:31 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 4 Jun 2007 22:37:31 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? Message-ID: Ligatures, such as ? and ? (unicode 0x0132, 0x0133) are considered acceptable identifier characters unless explicitly tailored out. (They appear in both ID and XID) Do we really want this, or should we assume that ? and ij should be equivalent? If so, then we need to enforce this somehow. To me, this suggests that we should use the NFKD form. Examples at http://www.unicode.org/reports/tr15/tr15-28.html show that only the Decomposition forms split ? (ligature 0xFB01) into the constituents f and i. Kompatibility form is needed to merge characters that are "the same" except for some presentational quirk, such as being superscripted or half-width. The PEP assumes NFC, but I haven't really understood why, unless that is required for compatibility with other systems (in which case, it should be made explicit). -jJ From talin at acm.org Tue Jun 5 04:45:51 2007 From: talin at acm.org (Talin) Date: Mon, 04 Jun 2007 19:45:51 -0700 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: References: Message-ID: <4664CE5F.3040204@acm.org> Ka-Ping Yee wrote: > Hi, > > Here's a summary of some of the remaining open issues and unaddressed > arguments regarding PEP 3131. These are the ones I'm familiar with, > so I don't claim this to be complete. I hope it helps give some > perspective on this huge thread, though. Thanks so much for this excellent roundup from the RoundUp Master :) Seriously, I've been staying well away from the PEP 3131 threads, and I was hoping that someone would post a summary of the issues so I could catch up. I'd like to make a couple of modest proposals on the PEP 3131 issue that I'm hoping will short-circuit some parts of this discussion. 1) My first proposal is that someone - one of the PEP 3131 advocates probably - create a set of patches, or possibly a branch, that implements unicode identifiers in whatever manner they think is appropriate. Write some actual code instead of just talking about it. This fork will consist of a Python interpreter with a different name - lets call it 'upython' for 'unicode python'. These same PEP 3131 advocates should also distribute precompiled packages containing the upython interpreter. For simplicity, it is OK to assume that regular Python is already installed as a prerequisite. The 'upython' interpreter can live in the same binary directory as regular python. The students who want to learn Python with Japanese identifiers can easily be taught to run 'upython' instead of 'python'. Since upython runs regular python scripts, they still have access to all of the regular python libraries and extension modules. Once upython becomes available to the public, it will be the goal of the 3131 advocates to get widespread adoption of upython. If there is much adoption, then that makes a strong argument for merging those features into regular python. On the other hand, if there is little adoption, then that's an argument to either maintain it as a fork, or drop it altogether. In other words - instead of endless discussions of hypotheticals, let people vote with their feet. Because I can already tell that as far as this mailing list goes, there will never be a consensus on this issue, due to basic value differences. 2) My second proposal is to drop all discussions of bidirectional support, since I think it's a red herring. So far, I haven't heard anyone whose native language is RTL lobbying for support of their language. Most of the vocal proponents of 3131 have been mainly concerned with asian languages. The people who are mainly bringing up the issue of Bidi are the people arguing against 3131, using it as the basis of an "excluded middle" argument that says that since its too difficult to do Bidi properly, then it's too difficult to do unicode identifiers. Yes, it may be technically "unfair" to certain ethnic groups to not support Bidi, but frankly, I don't see why the python-dev community has to solve all of the world's problems in one go. I would even go so far as to say that its OK to drop support for any languages that are "hard to do". (Note that I've done a fair bit of work supporting Bidi in my previous job, so I at least have a passing familiarity with the issues involved.) -- Talin From foom at fuhm.net Tue Jun 5 04:48:26 2007 From: foom at fuhm.net (James Y Knight) Date: Mon, 4 Jun 2007 22:48:26 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu> References: <717682.96972.qm@web33510.mail.mud.yahoo.com> <20070604090135.6F26.JCARLSON@uci.edu> Message-ID: On Jun 4, 2007, at 3:43 PM, Josiah Carlson wrote: > Here is some code borrowed right from the Python standard library. > I've > gone ahead and mangled names in a consistant fashion using the > tokenize > module. Can you guess what it does? Nope, it's absolutely inscrutable. And actually, after I found this module in the stdlib and read the same excerpt in english, I *still* couldn't figure out what it was doing. (it's in multifile.py, btw). Of course, the given excerpt doesn't really do anything useful (without the rest of the class), which doesn't help things. Anyhow, if it was in a human language, I'd paste it into an online translator. e.g. from another recent message: > def ????????_??_???????_?_???????_????? > (???_?????): > ???? = open(???_?????, "rb") > for ?????? in ????: > yield ??????.split() pasted verbatim right into google translator results in: > def iterator_po_tokenam_v_strokah_fayla (filename) : file = open > (filename, "rb") for strings in the file : stroka.split yield () Not entirely successful -- it's not built to translate code, of course. :) Let's try some of those phrases again: "???????? ?? ??????? ? ??????? ?????" -> "standard for token lines in the file". Hm, I liked "iterator" better than "standard" there, but okay. so, this is supposed to iterate tokens from lines in a file. Okay. "??????" -> "line". All right, I think I've got it. In fact, translation is *much* *easier* when the code in the other language is spelled with the proper characters of that language, instead of some random romanization. I'd have extremely little hope of being able to convert a romanization of russian into real russian in order to be able to translate it into english. So, all things considered, allowing russian identifiers is a huge plus for my ability to read russian code. +1. James From stephen at xemacs.org Tue Jun 5 05:53:16 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 05 Jun 2007 12:53:16 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070604090135.6F26.JCARLSON@uci.edu> References: <717682.96972.qm@web33510.mail.mud.yahoo.com> <20070604090135.6F26.JCARLSON@uci.edu> Message-ID: <87lkezxdjn.fsf@uwakimon.sk.tsukuba.ac.jp> Josiah Carlson writes: > gone ahead and mangled names in a consistant fashion using the tokenize > module. Can you guess what it does? OK, here's your straight line: Throw a lot of "AttributeError: rInCXzn is not defined"? From martin at v.loewis.de Tue Jun 5 06:10:32 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 05 Jun 2007 06:10:32 +0200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: Message-ID: <4664E238.9020700@v.loewis.de> > The PEP assumes NFC, but I haven't really understood why, unless that > is required for compatibility with other systems (in which case, it > should be made explicit). It's because UAX#31 tells us to use NFC, in section 5 "Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate." As Python has case-sensitive identifiers, NFC is appropriate. Regards, Martin From rauli.ruohonen at gmail.com Tue Jun 5 07:21:37 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 5 Jun 2007 08:21:37 +0300 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> Message-ID: On 6/4/07, Jim Jewett wrote: > On 6/4/07, "Martin v. L?wis" wrote: > > However, what would that mean wrt. non-Unicode source encodings. > > > Say you have a Latin-1-encoded source code. Is that in NFC or not? The path of least surprise for legacy encodings might be for the codecs to produce whatever is closest to the original encoding if possible. I.e. what was one code point would remain one code point, and if that's not possible then normalize. I don't know if this is any different from always normalizing (it certainly is the same for Latin-1). Always normalizing would have the advantage of simplicity (no matter what the encoding, the result is the same), and I think that is the real path of least surprise if you sum over all surprises. > FWIW, I would prefer "the parser will normalize" to "the parser will > reject unnormalized", to support even the dumbest of editors. Me too, as simple open-save in a dumb editor wouldn't change the semantics of the code, and if any edits are made where the user expects for some reason that normalization is not done then the first trial run will immediately disabuse them of this notion. The behavior is simple to infer and reliable (at least for "always normalize"). FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize for some reason. Java doesn't even normalize identifiers AFAICS, it's not even mentioned at http://java.sun.com/docs/books/jls/third_edition/html/lexical.html and they even process escapes very early (those should certainly not be normalized, as escapes are the Word of Programmer and meddling with them will incur holy wrath). XML 1.1 says this: :XML processors MUST NOT transform the input to be in fully normalized :form. XML applications that create XML 1.1 output from either XML 1.1 :or XML 1.0 input SHOULD ensure that the output is fully normalized; :it is not necessary for internal processing forms to be fully :normalized. : :The purpose of this section is to strongly encourage XML processors :to ensure that the creators of XML documents have properly normalized :them, so that XML applications can make tests such as identity :comparisons of strings without having to worry about the different :possible "spellings" of strings which Unicode allows. : :When entities are in a non-Unicode encoding, if the processor :transcodes them to Unicode, it SHOULD use a normalizing transcoder. I do not know why they've done this, but XML 1.0 does not mention normalization at all, so perhaps they felt normalization would be too big a change. Some random comments I read mentioned that XML 1.1 is supposed to be independent of changes to Unicode and normalization may change for new code points in new versions, and some said that the inavailability of normalizers to implementors would be a reason. Verification is specified in XML 1.1, though: :However, a document is still well-formed even if it is not fully :normalized. XML processors SHOULD provide a user option to verify :that the document being processed is in fully normalized form, and :report to the application whether it is or not. The option to not :verify SHOULD be chosen only when the input text is certified, as :defined by B Definitions for Character Normalization. Note that all this applies after character entity (=escape) replacement, and applies also to what passes for "identifiers" in XML documents. I still think simply always normalizing the whole source code file to NFC before any processing would be the right thing to do :-) I'm not sure about processing of text files in Python code, it's certainly easy to do the normalization yourself. Still, it's probably what's wanted in most cases where line separators are normalized. From rauli.ruohonen at gmail.com Tue Jun 5 08:16:19 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 5 Jun 2007 09:16:19 +0300 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <4664CE5F.3040204@acm.org> References: <4664CE5F.3040204@acm.org> Message-ID: On 6/5/07, Talin wrote: > Thanks so much for this excellent roundup from the RoundUp Master :) > Seriously, I've been staying well away from the PEP 3131 threads, and I > was hoping that someone would post a summary of the issues so I could > catch up. I agree that the roundup is excellent, but it fails to mention a couple of things, the most important of which is that PEP 3131 has already been accepted. All the discussion is about details such as what's the default, what the normalization should be, etc. A fork is therefore not necessary. From jcarlson at uci.edu Tue Jun 5 09:15:23 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Tue, 05 Jun 2007 00:15:23 -0700 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <4664CE5F.3040204@acm.org> References: <4664CE5F.3040204@acm.org> Message-ID: <20070604235811.6F2B.JCARLSON@uci.edu> Talin wrote: > In other words - instead of endless discussions of hypotheticals, let > people vote with their feet. Because I can already tell that as far as > this mailing list goes, there will never be a consensus on this issue, > due to basic value differences. If the underlying runtime were written to handle unicode identifiers, the Python runtime could be easily modified to discern the command used to execute it. Alternatively, if we went with a command-line option, Python could easily ship with a script called 'upython' (on *nix, upython.bat on Windows) that automatically runs python with the proper arguments. > 2) My second proposal is to drop all discussions of bidirectional > support, since I think it's a red herring. So far, I haven't heard > anyone whose native language is RTL lobbying for support of their > language. Most of the vocal proponents of 3131 have been mainly > concerned with asian languages. The people who are mainly bringing up > the issue of Bidi are the people arguing against 3131, using it as the > basis of an "excluded middle" argument that says that since its too > difficult to do Bidi properly, then it's too difficult to do unicode > identifiers. While there has been discussion about how to handle bidi issues, I don't believe I've read anything saying "since bidi is hard, lets not do unicode at all". - Josiah From martin at v.loewis.de Tue Jun 5 09:54:30 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 09:54:30 +0200 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <4664CE5F.3040204@acm.org> References: <4664CE5F.3040204@acm.org> Message-ID: <466516B6.8060005@v.loewis.de> > 1) My first proposal is that someone - one of the PEP 3131 advocates > probably - create a set of patches, or possibly a branch, that > implements unicode identifiers in whatever manner they think is > appropriate. Write some actual code instead of just talking about it. I'm working on that. I want to base it on the py3k-struni branch, where identifiers need to become Unicode (string) objects first before this can be implemented. Completing that will likely take several weeks. > These same PEP 3131 advocates should also distribute precompiled > packages containing the upython interpreter. For simplicity, it is OK to > assume that regular Python is already installed as a prerequisite. That will likely not work, as the 3k interpreter likely will break with a 2.x installation. > Once upython becomes available to the public, it will be the goal of the > 3131 advocates to get widespread adoption of upython. If there is much > adoption, then that makes a strong argument for merging those features > into regular python. On the other hand, if there is little adoption, > then that's an argument to either maintain it as a fork, or drop it > altogether. That really isn't necessary. The PEP is already approved, so the feature will be implemented in Python 3. Regards, Martin From martin at v.loewis.de Tue Jun 5 10:02:37 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 05 Jun 2007 10:02:37 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> Message-ID: <4665189D.4020301@v.loewis.de> > The path of least surprise for legacy encodings might be for > the codecs to produce whatever is closest to the original encoding > if possible. I.e. what was one code point would remain one code > point, and if that's not possible then normalize. I don't know if > this is any different from always normalizing (it certainly is > the same for Latin-1). Depends on the normalization form. For Latin 1, the straight-forward codec produces output that is not in NFKC, as MICRO SIGN should get normalized to GREEK SMALL LETTER MU. However, it is normalized under NFC. Not sure about other codecs; for the CJK ones, I would expect to see all sorts of issues. > Always normalizing would have the advantage of simplicity (no matter > what the encoding, the result is the same), and I think that is > the real path of least surprise if you sum over all surprises. I'd like to repeat that this is out of scope of this PEP, though. This PEP doesn't, and shouldn't, specify how string literals get from source to execution. > FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize > for some reason. For XML, I believe the reason is performance. It is *fairly* expensive to compute NFC in the general case, and I'm yet uncertain what a good way would be to reduce execution cost in the "common case" (i.e. data is already in NFC). For XML, enforcing this performance hit on top of the already costly processing of XML would be unacceptable. Regards, Martin From stephen at xemacs.org Tue Jun 5 11:19:03 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 05 Jun 2007 18:19:03 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <4664E238.9020700@v.loewis.de> References: <4664E238.9020700@v.loewis.de> Message-ID: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > > The PEP assumes NFC, but I haven't really understood why, unless that > > is required for compatibility with other systems (in which case, it > > should be made explicit). "Martin v. L?wis" writes: > It's because UAX#31 tells us to use NFC, in section 5 > > "Generally if the programming language has case-sensitive identifiers, > then Normalization Form C is appropriate; whereas, if the programming > language has case-insensitive identifiers, then Normalization Form KC is > more appropriate." > > As Python has case-sensitive identifiers, NFC is appropriate. It seems to me that what UAX#31 is saying is "Distinguishing (or not) between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be equivalent to distinguishing (or not) between LATIN CAPITAL LETTER A and LATIN SMALL LETTER A." I don't know that I agree (or disagree) in principle. Here's what UAX#15 has to say: ---------------- Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets, such as in Section 13, Programming Language Identifiers. ---------------- Note that Section 13 == UAX#31 (from which Martin is quoting). I don't see this section as being at all supportive of NFC over NFKC, though. Some detailed observations biased by my personal tastes: It seems to me that while I sometimes find it useful for FOO and foo to be different identifiers, I would almost always consider R3RS and R?RS to be the same identifier. The contrast is just too small to be useful. And I would never distinguish between a three-character fine (fi - n - e) and a four-character fine (f - i - n - e). I'd really love to see the printer's ligatures gone. I'd love to get rid of full-width ASCII and halfwidth kana (via compatibility decomposition). Native Japanese speakers often use them interchangably with the "proper" versions when correcting typos and updating numbers in a series. Ugly, to say the least. I don't think that native Japanese would care, as long as the decomposition is done internally to Python. A scan of the full table for Unicode Version 2.0 (what I have here in print) suggests that problematic decompositions actually are restricted to only a few scripts. LATIN (CAPITAL|SMALL) LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of UAX#31) are compatibility decompositions, unlike almost all other Latin decompositions (which are canonical, and thus get recomposed in NFKC). 'n (Afrikaans), and a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would get lost. The Koreans would lose a truckload of partially composed Hangul and some archaic ones, the Arabic speakers their presentation forms. And that's about it (but I may have missed a bunch because that database doesn't give the character classes, so I guessed for stuff like technical symbols -> not ID characters). I suspect that as long as they have the precomposed Hangul, partial- syllable "ligature" forms won't be an issue for Koreans. I can't even distinguish the archaic versions from their compatibility equivalents by eye, although I'm comfortable with pronouncing Hangul. I have no opinion on the Latin decompositions mentioned above or the Arabic presentation forms. However, of the ones I can judge to some extent (Latin printer's ligatures, width variants, non-syllabic precomposed Korean Jamo), *not one* of the compatibility decompositions would be a loss in my opinion. On the other hand, there are a bunch of cases where NKFC would be a marked improvement. From rauli.ruohonen at gmail.com Tue Jun 5 13:06:53 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 5 Jun 2007 14:06:53 +0300 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/5/07, Stephen J. Turnbull wrote: > I'd love to get rid of full-width ASCII and halfwidth kana (via > compatibility decomposition). If you do forbid compatibility characters in identifiers, then they should be flagged as an error, not converted silently. NFC, on the other hand, should be applied silently. The reason is that character equivalence is the same thing as binary equivalence of the NFC form in Unicode, and adding extra equivalences (whether it's "FoO" == "foo", "??" == "??" or "????" == "A123") is surprising. In short, I would like this function to return 'OK' or be a syntax error, but it should not fail or return something else: def test(): if 'A' == '?': return 'OK' A = 'O' ? = 'K' # as tested above, 'A' and '?' are not the same thing return locals()['A']+locals()['?'] Note that 'A' == '?' should be false (no automatic NFKC for strings, please). From ncoghlan at gmail.com Tue Jun 5 16:32:31 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 06 Jun 2007 00:32:31 +1000 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <46643F27.2040804@acm.org> References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com> <46643F27.2040804@acm.org> Message-ID: <466573FF.8060001@gmail.com> Talin wrote: > What I wanted to avoid in the PEP was having to specify how all of these > different parts fit together and the exact nature of the parameters > being passed between them. > > And I think that even if we do break up vformat this way, we still end > up with people having to replace a fairly substantial chunk of code in > order to change the behaviors represented by these flags. If you make the methods to be overridden simple stateless queries with a True/False return like the two I suggested in my other message, then it becomes easy to tailor these behaviours without replacing the whole parser. For cases where changing the behaviour of those cases isn't enough then you would still have the option of completely overriding vformat. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From eric+python-dev at trueblade.com Tue Jun 5 17:16:14 2007 From: eric+python-dev at trueblade.com (Eric V. Smith) Date: Tue, 05 Jun 2007 11:16:14 -0400 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <46643F27.2040804@acm.org> References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com> <46643F27.2040804@acm.org> Message-ID: <46657E3E.2000508@trueblade.com> Talin wrote: > Other kinds of customization require replacing a much larger chunk of > code. Changing the "underscores" and "check-unused" behavior requires > overriding 'vformat', which means replacing the entire template string > parser. I figured that there would be a lot of people who might want > these features, but didn't want to rewrite all of vformat. Actually you only have to replace get_positional or get_named, I think. And I don't see how the "check-unused" behavior can be written in the base class, in the presence of get_positional and get_named. If the list of identifiers isn't known to the base class (as in your example of NamespaceFormatter), then how can the base class know if they're all used? >> I've started a sample implementation to test this API. For starters, >> I'm writing it in pure Python, but my intention is to use the code in >> the pep3101 sandbox once I have some tests written and we're happy >> with the API. > > Cool. I think we'll know more when I've made some more progress on this. Eric. From jimjjewett at gmail.com Tue Jun 5 17:18:48 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 11:18:48 -0400 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <20070604235811.6F2B.JCARLSON@uci.edu> References: <4664CE5F.3040204@acm.org> <20070604235811.6F2B.JCARLSON@uci.edu> Message-ID: On 6/5/07, Josiah Carlson wrote: > Talin wrote: > > I haven't heard anyone whose native language is RTL > > lobbying for support of their language. ... > I don't believe I've read anything saying "since bidi is hard, > lets not do unicode at all". Not in those exact words, but Tomer did say, effectively bidi is hard -- probably too hard to get right yet. The current situation is better than rushing it. It wouldn't be fair to add support for some languages, but to exclude his. Note, though, that this objection is really only to "unicode as a single-switch". It doesn't argue against letting individuals (or system admins or local redistributors) add one script at a time for local use, and letting each language community work things out for themselves. I expect the issues to be settled more easily in Swedish than in Hebrew or Arabic, but they'll both be supported to the extent that they *can* use their letters if they work out a local agreement on reasonable limits. (Also note that Arabic and probably Hebrew have additional issues to work out beyond bidi, such as whether to allow certain presentational forms. The unicode consortium recommends against them, but they are still included in the ID_ group, as they are technically letters.) -jJ From jimjjewett at gmail.com Tue Jun 5 17:37:48 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 11:37:48 -0400 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4665189D.4020301@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> Message-ID: On 6/5/07, "Martin v. L?wis" wrote: > > Always normalizing would have the advantage of simplicity (no > > matter what the encoding, the result is the same), and I think > > that is the real path of least surprise if you sum over all > > surprises. > I'd like to repeat that this is out of scope of this PEP, though. > This PEP doesn't, and shouldn't, specify how string literals get > from source to execution. I see that as a gray area. Unicode does say pretty clearly that (at least) canonical equivalents must be treated the same. In theory, this could be done only to identifiers, but then it needs to be done inline for getattr. Since we don't want the results of (str1 == str2) to change based on context, I think string equality also needs to look at canonicalized (though probably not compatibility) forms. This in turn means that hashing a unicode string should first canonicalize it. (I believe that is a change from 2.x.) This means that all literal unicode characters are subject to normalization unless they appear in a comment. At that point, it might be simpler to just canonicalize the whole source file up front. -jJ From talin at acm.org Tue Jun 5 18:01:39 2007 From: talin at acm.org (Talin) Date: Tue, 05 Jun 2007 09:01:39 -0700 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <466573FF.8060001@gmail.com> References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com> <46643F27.2040804@acm.org> <466573FF.8060001@gmail.com> Message-ID: <466588E3.4020600@acm.org> Nick Coghlan wrote: > Talin wrote: >> What I wanted to avoid in the PEP was having to specify how all of >> these different parts fit together and the exact nature of the >> parameters being passed between them. >> >> And I think that even if we do break up vformat this way, we still end >> up with people having to replace a fairly substantial chunk of code in >> order to change the behaviors represented by these flags. > > If you make the methods to be overridden simple stateless queries with a > True/False return like the two I suggested in my other message, then it > becomes easy to tailor these behaviours without replacing the whole parser. > > For cases where changing the behaviour of those cases isn't enough then > you would still have the option of completely overriding vformat. I don't have a problem with this approach either. -- Talin From talin at acm.org Tue Jun 5 18:15:23 2007 From: talin at acm.org (Talin) Date: Tue, 05 Jun 2007 09:15:23 -0700 Subject: [Python-3000] Substantial rewrite of PEP 3101 In-Reply-To: <46657E3E.2000508@trueblade.com> References: <466310FC.8020707@acm.org> <4663EB44.1010507@trueblade.com> <46643F27.2040804@acm.org> <46657E3E.2000508@trueblade.com> Message-ID: <46658C1B.8090008@acm.org> Eric V. Smith wrote: > Talin wrote: >> Other kinds of customization require replacing a much larger chunk of >> code. Changing the "underscores" and "check-unused" behavior requires >> overriding 'vformat', which means replacing the entire template string >> parser. I figured that there would be a lot of people who might want >> these features, but didn't want to rewrite all of vformat. > > Actually you only have to replace get_positional or get_named, I think. I don't think that people writing replacements for get_positional/named should have to reimplement the checking code. I'd like for them to worry only about accessing values, and leave the usage checking out of it. > And I don't see how the "check-unused" behavior can be written in the > base class, in the presence of get_positional and get_named. If the > list of identifiers isn't known to the base class (as in your example of > NamespaceFormatter), then how can the base class know if they're all used? Because the checking only applies to arguments that are explicitly passed in to vformat(). It never applies to the default namespace. Think of it this way: Would you consider it an error if the format string failed to refer to every global variable? Of course not. The default namespace is open-ended, whereas the positional and keyword arguments to vformat are a bounded set. So vformat can know exactly which arguments are and aren't used. The checking code is, I think, relatively simple: checked_args = set() if checking_positional: checked_args.update(range(0,len(positional)) if checking_named: checked_args.update(kwds.iterkeys()) # now parse the template string, removing from the set # any arg names/indices that are referred to. if checked_args: # If set non-empty # error The code to populate the set of checked args could be in an overridable method, as suggested by Nick Coghlan. This method could simply return the set of args to check or None if checking is turned off. The other way to do it would be to always build the set of 'used' names, and then call the method afterwards to do a set.difference operation. However, this means you always build a set even if you aren't checking, whereas with the first method you can skip creating the set if checking is turned off. >>> I've started a sample implementation to test this API. For starters, >>> I'm writing it in pure Python, but my intention is to use the code in >>> the pep3101 sandbox once I have some tests written and we're happy >>> with the API. >> >> Cool. > > I think we'll know more when I've made some more progress on this. > > Eric. > From martin at v.loewis.de Tue Jun 5 18:56:37 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 18:56:37 +0200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <466595C5.6070301@v.loewis.de> > I'd love to get rid of full-width ASCII and halfwidth kana (via > compatibility decomposition). Native Japanese speakers often use them > interchangably with the "proper" versions when correcting typos and > updating numbers in a series. Ugly, to say the least. I don't think > that native Japanese would care, as long as the decomposition is done > internally to Python. Not sure what the proposal is here. If people say "we want the PEP do NFKC", I understand that as "instead of saying NFC, it should say NFKC", which in turn means "all identifiers are converted into the normal form NFKC while parsing". With that change, the full-width ASCII characters would still be allowed in source - they just wouldn't be different from the regular ones anymore when comparing identifiers. Another option would be to require that the source is in NFKC already, where I then ask again what precisely that means in presence of non-UTF source encodings. Regards, Martin From jimjjewett at gmail.com Tue Jun 5 19:10:02 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 13:10:02 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/5/07, Stephen J. Turnbull wrote: > It seems to me that what UAX#31 is saying is "Distinguishing (or not) > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be > equivalent to distinguishing (or not) between LATIN CAPITAL > LETTER A and LATIN SMALL LETTER A." I don't know that > I agree (or disagree) in principle. So effectively, they consider "a" and "A" to be presentational variants. In some languages, certain presentational variants are used depending on word position. I think the ID_START property does exclude letters that cannot appear in an initial position, but putting a final character in the middle or vice versa would still be wrong. If identifiers are only ever typed, I suppose that isn't a problem. If identifiers are built up in the equivalent of handler="do_" + name then the character will sometimes be wrong in a way that many editors will either hide or silently "correct." The standard also says (but I can't verify) that replacing the presentational variant with the generic form will generally *improve* presentation, presumably because there are now more systems which do the font shaping correctly than there are systems able to handle the old character formats. The folding rules do say that it is OK (even good) to exclude certain characters from certain foldings; I think we could preserve case (including title-case?) as the only presentational variant we recognize. > A scan of the full table for Unicode Version 2.0 (what I have here in > print) suggests that problematic decompositions actually are > restricted to only a few scripts. LATIN (CAPITAL|SMALL) > LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of > UAX#31) As best I understand it, this one would be helped by using compatibility mappings. There is an official way to spell l-middle dot, but enough old texts used the "wrong" character that it has to be special-cased for round-tripping. Since the ID is a final destination, we care less about round-trips, and more about "if they switch editors, will the identifier still match". At the very least, it is mentioned as needing special care (when used as an identifier) in http://www.unicode.org/reports/tr31/ section 5.1 paragraph 1. > decompositions, unlike almost all other Latin decompositions (which > are canonical, and thus get recomposed in NFKC). 'n (Afrikaans), and > a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would > get lost. The Koreans would lose a truckload of partially composed > Hangul and some archaic ones, http://www.unicode.org/versions/corrigendum3.html suggests that many of the Hangul are either pronunciation guide variants or even exact duplicates (that were presumably missed when the canonicalization was frozen?) > the Arabic speakers their presentation forms. http://www.unicode.org/reports/tr31/ 5.1 paragraph 3 includes: """It is recommended that all Arabic presentation forms be excluded from identifiers in any event, although only a few of them must be excluded for normalization to guarantee identifier closure.""" > And that's about it (but I may have missed a bunch because > that database doesn't give the character classes, so I guessed for > stuff like technical symbols -> not ID characters). Depends on what you mean by technical symbols. IMHO, many of them are in fact listed as ID characters. The math versions (generally 1D400 - 1DC7B) are included. But http://unicode.org/reports/tr39/data/xidmodifications.txt suggests excluding them again. > However, of the ones I can judge to some extent (Latin printer's > ligatures, width variants, non-syllabic precomposed Korean Jamo), *not > one* of the compatibility decompositions would be a loss in my > opinion. On the other hand, there are a bunch of cases where NKFC > would be a marked improvement. -jJ From jimjjewett at gmail.com Tue Jun 5 19:14:59 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 13:14:59 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <466595C5.6070301@v.loewis.de> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <466595C5.6070301@v.loewis.de> Message-ID: On 6/5/07, "Martin v. L?wis" wrote: > > I'd love to get rid of full-width ASCII and halfwidth kana (via > > compatibility decomposition). Native Japanese speakers often use them > > interchangably with the "proper" versions when correcting typos and > > updating numbers in a series. Ugly, to say the least. I don't think > > that native Japanese would care, as long as the decomposition is done > > internally to Python. > Not sure what the proposal is here. If people say "we want the PEP do > NFKC", I understand that as "instead of saying NFC, it should say > NFKC", which in turn means "all identifiers are converted into the > normal form NFKC while parsing". I would prefer that. > With that change, the full-width ASCII characters would still be > allowed in source - they just wouldn't be different from the regular > ones anymore when comparing identifiers. I *think* that would be OK; so long as they mean the same thing, it is just a quirk like using a different font. I am slightly concerned that it might mean "string as string" and "string as identifier" have different tests for equality. > Another option would be to require that the source is in NFKC already, > where I then ask again what precisely that means in presence of > non-UTF source encodings. My own opinion is that it would be reasonable to put those in NFKC form as part of the parser's internal translation to unicode. (But I agree that it makes sense to do that for all encodings, if it is done for any.) -jJ From martin at v.loewis.de Tue Jun 5 19:15:35 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 19:15:35 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> Message-ID: <46659A37.4000900@v.loewis.de> Jim Jewett schrieb: > On 6/5/07, "Martin v. L?wis" wrote: >> > Always normalizing would have the advantage of simplicity (no >> > matter what the encoding, the result is the same), and I think >> > that is the real path of least surprise if you sum over all >> > surprises. > >> I'd like to repeat that this is out of scope of this PEP, though. >> This PEP doesn't, and shouldn't, specify how string literals get >> from source to execution. > > I see that as a gray area. Please read the PEP title again. What is unclear about "Supporting Non-ASCII Identifiers"? > Unicode does say pretty clearly that (at least) canonical equivalents > must be treated the same. Chapter and verse, please? > In theory, this could be done only to identifiers, but then it needs > to be done inline for getattr. Why that? The caller of getattr would need to apply normalization in case the input isn't known to be normalized? > Since we don't want the results of (str1 == str2) to change based on > context, I think string equality also needs to look at canonicalized > (though probably not compatibility) forms. This in turn means that > hashing a unicode string should first canonicalize it. (I believe > that is a change from 2.x.) And you think this is still within the scope of the PEP? Please, if you want that to happen, write your own PEP. Regards, Martin From jimjjewett at gmail.com Tue Jun 5 19:33:59 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 13:33:59 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/5/07, Rauli Ruohonen wrote: > On 6/5/07, Stephen J. Turnbull wrote: > > I'd love to get rid of full-width ASCII and halfwidth kana (via > > compatibility decomposition). > If you do forbid compatibility characters in identifiers, then they > should be flagged as an error, not converted silently. Forbidding them seems reasonable to me; the only catch is that it is the first step toward making a ton of individual decisions, some of which will be wrong. Better than getting them all wrong, of course, but not better than postponing. (I don't mean "ban all unicode characters"; I do mean to ban far more of them, or to use a site-specific incremental whitelist, or both.) -jJ From jimjjewett at gmail.com Tue Jun 5 20:48:40 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 14:48:40 -0400 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <46659A37.4000900@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> <46659A37.4000900@v.loewis.de> Message-ID: On 6/5/07, "Martin v. L?wis" wrote: > Jim Jewett schrieb: > > On 6/5/07, "Martin v. L?wis" wrote: > >> > Always normalizing would have the advantage of simplicity (no > >> > matter what the encoding, the result is the same), and I think > >> > that is the real path of least surprise if you sum over all > >> > surprises. > >> I'd like to repeat that this is out of scope of this PEP, though. > >> This PEP doesn't, and shouldn't, specify how string literals get > >> from source to execution. > > I see that as a gray area. > Please read the PEP title again. What is unclear about > "Supporting Non-ASCII Identifiers"? That strings can also be used as identifiers. > > Unicode does say pretty clearly that (at least) canonical equivalents > > must be treated the same. > Chapter and verse, please? I am pretty sure this list is not exhaustive, but it may be helpful: The Identifiers Annex http://www.unicode.org/reports/tr31/ """ UAX31-C2. An implementation claiming conformance to Level 1 of this specification shall describe which of the following it observes: R1 Default Identifiers R2 Alternative Identifiers R3 Pattern_White_Space and Pattern_Syntax Characters R4 Normalized Identifiers R5 Case-Insensitive Identifiers """ I interpret this as "If we normalize the Identifiers, then we must observe R4." R4 lets us exclude individual characters from normalization, but it says that two IDs with the same Normalization Form are equivalent, unless they include specifically excluded characters. """ R4 Normalized Identifiers To meet this requirement, an implementation shall specify the Normalization Form and shall provide a precise list of any characters that are excluded from normalization. If the Normalization Form is NFKC, the implementation shall apply the modifications in Section 5.1, NFKC Modifications, given by the properties XID_Start and XID_Continue. Except for identifiers containing excluded characters, any two identifiers that have the same Normalization Form shall be treated as equivalent by the implementation. """ Additional Support: The Normalization Annex http://www.unicode.org/reports/tr15/ near the end of section 1 (but before 1.1) """ Normalization Forms KC and KD must not be blindly applied to arbitrary text. """ ... """ They can be applied more freely to domains with restricted character sets, such as in Section 13, Programming Language Identifiers. """ (section 13 then forwards back to UAX31) TR 15, section 19, numbered paragraph 3 """ Higher-level processes that transform or compare strings, or that perform other higher-level functions, must respect canonical equivalence or problems will result. """ Looking at the main standard, I revert to Unicode 4 because it is online at http://www.unicode.org/versions/Unicode4.0.0/ 2.2 Equivalent Sequences """ ... If an application or user attempts to distinguish non-identical sequences which are nonetheless considered to be equivalent sequences, as shown in the examples in Figure 2-6, it would not be guaranteed that other applications or users would recognize the same distinctions. To prevent introducing interoperability problems between applications, such distinctions must be avoided wherever possible. """ which is echoed in chapter 3 (conformance) """ C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. ... Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. """ """ C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points. ... All processes and higher-level protocols are required to abide by C10 as a minimum. However, higher-level protocols may define additional equivalences that do not constitute modifications under that protocol. For example, a higher-level protocol may allow a sequence of spaces to be replaced by a single space. """ > > In theory, this could be done only to identifiers, but then it needs > > to be done inline for getattr. > Why that? The caller of getattr would need to apply normalization in > case the input isn't known to be normalized? OK, I suppose that might work, if documented, but ... it seems like another piece of boilerplate; when it isn't there, it won't really be because the input is normalized so after as it is because the author didn't think about normalization. -jJ From martin at v.loewis.de Tue Jun 5 21:09:03 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 21:09:03 +0200 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: References: Message-ID: <4665B4CF.2050107@v.loewis.de> > Here's a summary of some of the remaining open issues and unaddressed > arguments regarding PEP 3131. These are the ones I'm familiar with, > so I don't claim this to be complete. I hope it helps give some > perspective on this huge thread, though. Thanks, I added them all to the PEP. Not sure which of these you would consider "open issues", or "unaddressed arguments"; I'll indicate below how I see them dealt with by the PEP currently. > A. Should identifiers be allowed to contain any Unicode letter? Not an open issue; the PEP has been accepted. > 1. Python will lose the ability to make a reliable round trip to > a human-readable display on screen or on paper. Correct. Was already the case, though, because of comments and string literals. > 2. Python will become vulnerable to a new class of security exploits; > code and submitted patches will be much harder to inspect. The first class is correct; I'd question the second part (in particular the "much" part of it). It's now addressed in the PEP by being listed in the discussion section. > 3. Humans will no longer be able to validate Python syntax. That's not true. Instead, they might not be able to do that for *all* Python programs - however, that is the case already: if programs are sufficiently complex, people cannot validate Python syntax today. Addressed by being listed. > 4. Unicode is young; its problems are not yet well understood and > solved; tool support is weak. Now listed. I disagree that Unicode is young; it is roughly as old as Python. > 5. Languages with non-ASCII identifiers use different character sets > and normalization schemes; PEP 3131's choices are non-obvious. I disagree. PEP 3131 follows UAX#31 literally, and makes that decision very clear. If people still cannot see that, please provide wording to make it more clear. > 6. The Unicode bidi algorithm yields an extremely confusing display > order for RTL text when digits or operators are nearby. Now listed. > B. Should the default behaviour accept only ASCII identifiers, or > should it accept identifiers containing non-ASCII characters? Added as an open issue. > C. Should non-ASCII identifiers be optional? How is that different from B? > D. Should the identifier character set be configurable? Still seems to be the same open issue. > E. Which identifier characters should be allowed? > > 1. What to do about bidi format control characters? That was already listed as an open issue. > 2. What about other ID_Continue characters? What about characters > that look like punctuation? What about other recommendations > in UTS #39? What about mixed-script identifiers? > > http://mail.python.org/pipermail/python-3000/2007-May/007836.html That was also listed as an open issue. > F. Which normalization form should be used, NFC or NFKC? Now listed as an open issue. > G. Should source code be required to be in normalized form? Should I add a section "Rejected ideas"? This is out of scope of the PEP. Regards, Martin From martin at v.loewis.de Tue Jun 5 21:21:59 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 21:21:59 +0200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> <46659A37.4000900@v.loewis.de> Message-ID: <4665B7D7.6030501@v.loewis.de> >> > Unicode does say pretty clearly that (at least) canonical equivalents >> > must be treated the same. > >> Chapter and verse, please? > > I am pretty sure this list is not exhaustive, but it may be helpful: > > The Identifiers Annex http://www.unicode.org/reports/tr31/ Ah, that's in the context of identifiers, not in the context of text in general. > """ > UAX31-C2. An implementation claiming conformance to Level 1 of this > specification shall describe which of the following it observes: > > R1 Default Identifiers > R2 Alternative Identifiers > R3 Pattern_White_Space and Pattern_Syntax Characters > R4 Normalized Identifiers > R5 Case-Insensitive Identifiers > """ > > I interpret this as "If we normalize the Identifiers, then we must > observe R4." R4 lets us exclude individual characters from > normalization, but it says that two IDs with the same Normalization > Form are equivalent, unless they include specifically excluded > characters. Correct, and that's indeed what PEP 3131 does. > """ > Normalization Forms KC and KD must not be blindly applied to arbitrary > text. > """ ... """ > They can be applied more freely to domains with restricted character > sets, such as in Section 13, Programming Language Identifiers. > """ > (section 13 then forwards back to UAX31) How is that a requirement that comparison should apply normalization? > TR 15, section 19, numbered paragraph 3 > """ > Higher-level processes that transform or compare strings, or that > perform other higher-level functions, must respect canonical > equivalence or problems will result. > """ That's not a mandatory requirement, but an "important aspect". Also, it applies to "higher-level processes"; I would expect that string comparison is not a higher-level function. Indeed, UAX#15 only gives definitions, no rules. > C9 A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct. Right. What is "a process"? > ... > Ideally, an implementation would always interpret two > canonical-equivalent character sequences identically. There are > practical circumstances under which implementations may reasonably > distinguish them. > """ So it should be the application's choice. > """ > C10 When a process purports not to modify the interpretation of a > valid coded character representation, it shall make no change to that > coded character representation other than the possible replacement of > character sequences by their canonical-equivalent sequences or the > deletion of noncharacter code points. > ... > All processes and higher-level protocols are required to abide by C10 > as a minimum. However, higher-level protocols may define additional > equivalences that do not constitute modifications under that protocol. > For example, a higher-level protocol may allow a sequence of spaces to > be replaced by a single space. > """ So this *allows* to canonicalize strings, it doesn't *require* Python to do so. Indeed, doing so would be fairly expensive, and therefore it should not be done (IMO). >> Why that? The caller of getattr would need to apply normalization in >> case the input isn't known to be normalized? > > OK, I suppose that might work, if documented, but ... it seems like > another piece of boilerplate; when it isn't there, it won't really be > because the input is normalized so after as it is because the author > didn't think about normalization. No. It might also be because the author *knows* that the string is already normalized. Regards, Martin From martin at v.loewis.de Tue Jun 5 22:59:55 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Jun 2007 22:59:55 +0200 Subject: [Python-3000] example Python code under PEP 3131? In-Reply-To: <4664A2B6.903@canterbury.ac.nz> References: <471885.91417.qm@web33503.mail.mud.yahoo.com> <4664A2B6.903@canterbury.ac.nz> Message-ID: <4665CECB.3000109@v.loewis.de> Greg Ewing schrieb: > Steve Howell wrote: > >> einfugen = in joints (????) > > Maybe "join in" (as a verb)? It's actually "insert" (into the list). Regards, Martin From alexandre at peadrop.com Tue Jun 5 23:33:10 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Tue, 5 Jun 2007 17:33:10 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch Message-ID: Hi, On Ubuntu linux, when I try run make in the py3k-struni branch I get an weird error about split(). However, I don't get this error when I run ``make clean; make''. Thanks, -- Alexandre % make Traceback (most recent call last): File "./setup.py", line 6, in import sys, os, imp, re, optparse File "/home/alex/src/python.org/py3k-struni/Lib/optparse.py", line 412, in _builtin_cvt = { "int" : (_parse_int, _("integer")), File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 563, in gettext return dgettext(_current_domain, message) File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 527, in dgettext codeset=_localecodesets.get(domain)) File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 462, in translation mofiles = find(domain, localedir, languages, all=1) File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 434, in find for nelang in _expand_lang(lang): File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 129, in _expand_lang locale = normalize(locale) File "/home/alex/src/python.org/py3k-struni/Lib/locale.py", line 329, in normalize norm_encoding = encodings.normalize_encoding(encoding) File "/home/alex/src/python.org/py3k-struni/Lib/encodings/__init__.py", line 68, in normalize_encoding return '_'.join(encoding.translate(_norm_encoding_map).split()) TypeError: split() takes at least 1 argument (0 given) make: *** [sharedmods] Error 1 From guido at python.org Wed Jun 6 00:04:30 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Jun 2007 15:04:30 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: Message-ID: If "make clean" makes the problem go away, it's usually because there were old .pyc files with incompatible byte code. We don't change the .pyc magic number for each change to the compiler. --Guido On 6/5/07, Alexandre Vassalotti wrote: > Hi, > > On Ubuntu linux, when I try run make in the py3k-struni branch I get > an weird error about split(). However, I don't get this error when I > run ``make clean; make''. > > Thanks, > -- Alexandre > > % make > Traceback (most recent call last): > File "./setup.py", line 6, in > import sys, os, imp, re, optparse > File "/home/alex/src/python.org/py3k-struni/Lib/optparse.py", line > 412, in > _builtin_cvt = { "int" : (_parse_int, _("integer")), > File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line > 563, in gettext > return dgettext(_current_domain, message) > File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line > 527, in dgettext > codeset=_localecodesets.get(domain)) > File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line > 462, in translation > mofiles = find(domain, localedir, languages, all=1) > File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line 434, in find > for nelang in _expand_lang(lang): > File "/home/alex/src/python.org/py3k-struni/Lib/gettext.py", line > 129, in _expand_lang > locale = normalize(locale) > File "/home/alex/src/python.org/py3k-struni/Lib/locale.py", line > 329, in normalize > norm_encoding = encodings.normalize_encoding(encoding) > File "/home/alex/src/python.org/py3k-struni/Lib/encodings/__init__.py", > line 68, in normalize_encoding > return '_'.join(encoding.translate(_norm_encoding_map).split()) > TypeError: split() takes at least 1 argument (0 given) > make: *** [sharedmods] Error 1 > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Wed Jun 6 00:43:29 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Tue, 5 Jun 2007 18:43:29 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: Message-ID: On 6/5/07, Guido van Rossum wrote: > If "make clean" makes the problem go away, it's usually because there > were old .pyc files with incompatible byte code. We don't change the > .pyc magic number for each change to the compiler. Nope. It is still not working. I just did the following, and I still get the same error. % unset CC # to turn off ccache % make distclean % svn revert -R . % svn up % ./configure % make # run fine % make # fail -- Alexandre From rrr at ronadam.com Wed Jun 6 01:14:12 2007 From: rrr at ronadam.com (Ron Adam) Date: Tue, 05 Jun 2007 18:14:12 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: Message-ID: <4665EE44.2010306@ronadam.com> Alexandre Vassalotti wrote: > On 6/5/07, Guido van Rossum wrote: >> If "make clean" makes the problem go away, it's usually because there >> were old .pyc files with incompatible byte code. We don't change the >> .pyc magic number for each change to the compiler. > > Nope. It is still not working. I just did the following, and I still > get the same error. > > % unset CC # to turn off ccache > % make distclean > % svn revert -R . > % svn up > % ./configure > % make # run fine > % make # fail > > -- Alexandre I can confirm the same behavior. Works on the first make, same error on the second. I deleted the contents of the branch and did an "svn up" on an empty directory. Same thing. Ron From jimjjewett at gmail.com Wed Jun 6 01:18:09 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 19:18:09 -0400 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <4665B4CF.2050107@v.loewis.de> References: <4665B4CF.2050107@v.loewis.de> Message-ID: On 6/5/07, "Martin v. L?wis" wrote: > > 1. Python will lose the ability to make a reliable round trip to > > a human-readable display on screen or on paper. > Correct. Was already the case, though, because of comments > and string literals. But these are usually less important; when written as literals, they are normally part of the User Interface, and if the user can't see the difference, it doesn't matter. There are exceptions, such as the "HELO" magic cookie in the (externally defined) SMTP protocol, but I think these exceptions are uncommon -- and outside python's control anyhow. > > 5. Languages with non-ASCII identifiers use different > > character sets and normalization schemes; PEP 3131's > > choices are non-obvious. > I disagree. PEP 3131 follows UAX#31 literally, and makes that > decision very clear. If people still cannot see that, I think "obvious" referred to the reasoning, not the outcome. I can tell that the decision was "NFC, anything goes", but I don't see why. (1) I am not sure why it was NFC; UAX 31 seems agnostic on which normalization form to use. The only explicit recommendations I can find suggest using NFKC for identifiers. http://www.unicode.org/faq/normalization.html#2 (Outside of that recommendation for KC, it isn't even clear why we should use the Composed form. As of tonight, I realized that "composed" means less than I thought, and the actual algorithm means it should work as well as the Decomposed forms -- but I had missed that detail the first several times I read about the different Normalization forms, and it certainly isn't included directly in the PEP.) (2) I cannot understand why ID_START/CONTINUE was chosen instead of the newer and more recommended XID_START/CONTINUE. From UAX31 section 2: """ The XID_Start and XID_Continue properties are improved lexical classes that incorporate the changes described in Section 5.1, NFKC Modifications. They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties. """ Nor can I understand why the additional restrictions in xidmodifications (from TR39) were ignored. The reason to remove those characters is given as """ The restricted characters are characters not in common use, removed so as to further reduce the possibilities for visual confusion. Initially, the following are being excluded: characters not in modern use; characters only used in specialized fields, such as liturgical characters, mathematical letter-like symbols, and certain phonetic alphabetics; and ideographic characters that are not part of a set of core CJK ideographs consisting of the CJK Unified Ideographs block plus IICore (the set of characters defined by the IRG as the minimal set of required ideographs for East Asian use). A small number of such characters are allowed back in so that the profile includes all the characters in the country-specific restricted IDN lists: """ As best I can tell, the remaining list is *still* too generous to be called conservative, but the characters being removed are almost certainly good choices for removal -- no one's native language requires them. > > B. Should the default behaviour accept only ASCII identifiers, or > > should it accept identifiers containing non-ASCII characters? > > D. Should the identifier character set be configurable? > Still seems to be the same open issue. Defaulting to ASCII or defaulting to "accept unicode" is one issue. A related but separate issue is whether accepting unicode is a single on/off switch, or whether it will be possible to accept only some unicode characters. As written, there is no good way to accept, say, Japanese characters, but not Cyrillic. I would prefer to whitelist individual characters or scripts, but there should at least be a way to exclude certain characters. http://www.unicode.org/reports/tr39/data/intentional.txt is a list of characters that *should* be impossible to distinguish visually. It isn't just that the standard representations are identical; (like some of the combining marks looking like quote signs), it is that the (distinct abstract) characters *should* use the same glyph, so long as they are in the same (or even harmonized) fonts. Several of the Greek and Cyrillic characters are glyph-identical with ASCII letters. I won't say that people using those scripts shouldn't be allowed to use those letters, but *I* certainly don't want to get code using them just because I allowed the ?. -jJ From alexandre at peadrop.com Wed Jun 6 01:45:24 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Tue, 5 Jun 2007 19:45:24 -0400 Subject: [Python-3000] help() broken in the py3k-struni branch Message-ID: Hi, I found another bug to report. It seems there is a bug in subprocess.py that makes help() fail. -- Alexandre Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> help(open) Traceback (most recent call last): File "", line 1, in File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350, in __call__ return pydoc.help(*args, **kwds) File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1687, in __call__ self.help(request) File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help else: doc(request, 'Help on %s:') File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc pager(render_doc(thing, title, forceload)) File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager pager(text) File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1333, in return lambda text: pipepager(text, 'less') File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1352, in pipepager pipe = os.popen(cmd, 'w') File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen bufsize=buffering) File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line 476, in __init__ raise TypeError("bufsize must be an integer") TypeError: bufsize must be an integer From guido at python.org Wed Jun 6 01:47:24 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Jun 2007 16:47:24 -0700 Subject: [Python-3000] help() broken in the py3k-struni branch In-Reply-To: References: Message-ID: Feel free to mail me a patch to fix it. On 6/5/07, Alexandre Vassalotti wrote: > Hi, > > I found another bug to report. It seems there is a bug in > subprocess.py that makes help() fail. > > -- Alexandre > > Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> help(open) > Traceback (most recent call last): > File "", line 1, in > File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350, > in __call__ > return pydoc.help(*args, **kwds) > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > 1687, in __call__ > self.help(request) > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help > else: doc(request, 'Help on %s:') > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc > pager(render_doc(thing, title, forceload)) > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager > pager(text) > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > 1333, in > return lambda text: pipepager(text, 'less') > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > 1352, in pipepager > pipe = os.popen(cmd, 'w') > File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen > bufsize=buffering) > File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line > 476, in __init__ > raise TypeError("bufsize must be an integer") > TypeError: bufsize must be an integer > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Wed Jun 6 01:51:44 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Tue, 5 Jun 2007 19:51:44 -0400 Subject: [Python-3000] pdb help is broken in py3k-struni branch Message-ID: Hi again, I just found yet another bug in py3k-struni branch. This one about the pdb module. Should I start to report these bugs to the bug tracker, instead? At this pace, I will flood the mailing list. :) -- Alexandre Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> raise TypeError Traceback (most recent call last): File "", line 1, in TypeError >>> import pdb >>> pdb.pm() > (1)() (Pdb) help Documented commands (type help ): ======================================== Traceback (most recent call last): File "", line 1, in File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1198, in pm post_mortem(sys.last_traceback) File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1195, in post_mortem p.interaction(t.tb_frame, t) File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 192, in interaction self.cmdloop() File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 139, in cmdloop stop = self.onecmd(line) File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 242, in onecmd return cmd.Cmd.onecmd(self, line) File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 216, in onecmd return func(arg) File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 336, in do_help self.print_topics(self.doc_header, cmds_doc, 15,80) File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 345, in print_topics self.columnize(cmds, maxcol-1) File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 361, in columnize ", ".join(map(str, nonstrings))) TypeError: list[i] not a string for i in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 >>> From guido at python.org Wed Jun 6 02:00:33 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Jun 2007 17:00:33 -0700 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: I'd rather see them here than in SF, SF is a pain to use. But unless the bugs prevent you from proceeding, you could also ignore them. There are 96 failing unit tests right now in that branch -- no need to report all of them. --Guido On 6/5/07, Alexandre Vassalotti wrote: > Hi again, > > I just found yet another bug in py3k-struni branch. This one about the > pdb module. > > Should I start to report these bugs to the bug tracker, instead? At > this pace, I will flood the mailing list. :) > > -- Alexandre > > Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> raise TypeError > Traceback (most recent call last): > File "", line 1, in > TypeError > >>> import pdb > >>> pdb.pm() > > (1)() > (Pdb) help > > Documented commands (type help ): > ======================================== > Traceback (most recent call last): > File "", line 1, in > File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1198, in pm > post_mortem(sys.last_traceback) > File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 1195, > in post_mortem > p.interaction(t.tb_frame, t) > File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 192, > in interaction > self.cmdloop() > File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 139, in cmdloop > stop = self.onecmd(line) > File "/home/alex/src/python.org/py3k-struni/Lib/pdb.py", line 242, in onecmd > return cmd.Cmd.onecmd(self, line) > File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 216, in onecmd > return func(arg) > File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 336, in do_help > self.print_topics(self.doc_header, cmds_doc, 15,80) > File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 345, > in print_topics > self.columnize(cmds, maxcol-1) > File "/home/alex/src/python.org/py3k-struni/Lib/cmd.py", line 361, > in columnize > ", ".join(map(str, nonstrings))) > TypeError: list[i] not a string for i in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, > 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, > 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, > 44, 45 > >>> > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Wed Jun 6 02:14:08 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Tue, 5 Jun 2007 20:14:08 -0400 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: On 6/5/07, Guido van Rossum wrote: > I'd rather see them here than in SF, SF is a pain to use. > > But unless the bugs prevent you from proceeding, you could also ignore them. The first bug that I reported today (the one about `make`) stop me from running the test suite. So, can't really test the _string_io and _bytes_io modules. > There are 96 failing unit tests right now in that branch -- no need to > report all of them. Ah, well. Then, running the test suite wouldn't really useful, after all. Thanks, -- Alexandre From guido at python.org Wed Jun 6 02:27:45 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 5 Jun 2007 17:27:45 -0700 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: On 6/5/07, Alexandre Vassalotti wrote: > On 6/5/07, Guido van Rossum wrote: > > I'd rather see them here than in SF, SF is a pain to use. > > > > But unless the bugs prevent you from proceeding, you could also ignore them. > > The first bug that I reported today (the one about `make`) stop me > from running the test suite. So, can't really test the _string_io and > _bytes_io modules. I tried to reproduce it but it works fine for me -- I'm on Ubuntu dapper (with some Google mods) on a 2.6.18.5-gg4 kernel. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From python at zesty.ca Wed Jun 6 03:21:53 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Tue, 5 Jun 2007 20:21:53 -0500 (CDT) Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <4665B4CF.2050107@v.loewis.de> References: <4665B4CF.2050107@v.loewis.de> Message-ID: > > A. Should identifiers be allowed to contain any Unicode letter? > > Not an open issue; the PEP has been accepted. The items listed under "A." are concerns that I wanted to be noted in the PEP, so thanks for listing them. > > B. Should the default behaviour accept only ASCII identifiers, or > > should it accept identifiers containing non-ASCII characters? > > Added as an open issue. > > > C. Should non-ASCII identifiers be optional? > > How is that different from B? C asks "should there be an on/off switch"; B asks whether the default should be on or off. > > D. Should the identifier character set be configurable? > > Still seems to be the same open issue. D asks "should you be able to select which character set you want", which is finer-grained than an all-or-nothing switch. > > G. Should source code be required to be in normalized form? > > Should I add a section "Rejected ideas"? This is out of scope of the PEP. It seems to me that the issue is directly related -- since the PEP intends to change the definition of acceptable source code, ought we not to settle what we're going to accept? To your earlier question of "what about non-UTF-8 files", I imagine that the normalization restriction would apply to the decoded characters. That is, once you know the source code encoding, there's a one-to-one mapping between the sequence of bytes in the source file and the sequence of characters to be parsed. Thus, two references to the same identifier will be represented by exactly the same bytes in the source file (you can't have different byte sequences in the source file alias to the same identifier). -- ?!ng From jimjjewett at gmail.com Wed Jun 6 03:47:40 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 5 Jun 2007 21:47:40 -0400 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: References: <4665B4CF.2050107@v.loewis.de> Message-ID: On 6/5/07, Ka-Ping Yee wrote: > > > G. Should source code be required to be in normalized > > > form? ... > To your earlier question of "what about non-UTF-8 files", I > imagine that the normalization restriction would apply to the > decoded characters. That is, once you know the source code > encoding, there's a one-to-one mapping between the > sequence of bytes in the source file and the sequence of > characters to be parsed. One of the unicode goals is that a given sequence of bytes in the source encoding will round-trip to a corresponding sequence of bytes in unicode. But that corresponding sequence will not always be in Normal form; normalization may prevent an (unchanged) round-trip. Even if they can produce the "correct" form, it may not be as easy. If someone's keyboard easily produces the "wrong" form, I don't want to give them syntax errors for something that can be automatically corrected. > Thus, two references to the same identifier will be > represented by exactly the same bytes in the source > file (you can't have different byte sequences in the source > file alias to the same identifier). The bytes -- and possibly even the original character -- can still be different between different files (with different encodings), even if they reference the same (imported) identifier. I think (limited, source) aliasing is something we just have to accept with unicode. I believe the best we can do is to say: Python will normalize, so if two identifiers are canonically equivalent, you won't get any rare impossible-to-debug inequality showing as an AttributeError. Ideally, that "canonical equivalence" would extend to strings (or at least be done automatically before hashing). Ideally, either that equivalence would also include compatibility, or else characters whose compatibility and canonical equivalents are different would be banned for use in identifiers. -jJ From showell30 at yahoo.com Wed Jun 6 03:49:59 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 5 Jun 2007 18:49:59 -0700 (PDT) Subject: [Python-3000] PEP 3131 roundup In-Reply-To: Message-ID: <738986.24163.qm@web33513.mail.mud.yahoo.com> --- Ka-Ping Yee wrote: > > > > B. Should the default behaviour accept only > ASCII identifiers, or > > > should it accept identifiers containing > non-ASCII characters? > > > > Added as an open issue. > [...] Martin, I hope you close out this issue, and just make a firm, explicit stance that PEP 3131 accepts non-ascii identifiers as the default, even though I'm 60/40 against it. Guido has already posted some comments that suggest that he is behind the already implicit idea from the PEP that unicode would be the default. Then I would change the open issue to be how best to address ascii users who want to revert to an ascii-only mode. A simple environment variable like ASCII_ONLY would to the trick. > > C asks "should there be an on/off switch"; B asks > whether the > default should be on or off. > > > > D. Should the identifier character set be > configurable? > > > > Still seems to be the same open issue. > > D asks "should you be able to select which character > set you want", > which is finer-grained than an all-or-nothing > switch. > I agree with the importance of this distinction. For example, in my American corporate day job, on question (B), I'm 90/10 on ascii-only, and (D) is a total non-issue to me, because at least in the short term, I could probably deal with the few Unicode-identifierified modules that I ever needed using some kind of very coarse workaround. In a more international context, such as trying to get more international users for some open source app I'd written, I'd be 90/10 on unicode-tolerance, and (D) would be much more of an important issue for me, because it could affect usability of the app. (B) and (D) really address two different classes of users, and I think both groups could reasonably include a lot of opponents to PEP 3131 as currently written. Cheers, Steve P.S. Martin, thanks for adding the objections to the PEP. I really think it's good to have it for the records. Maybe five years later from now, we'll look back on it and wonder what the heck we were thinking. :) ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ From showell30 at yahoo.com Wed Jun 6 04:24:18 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 5 Jun 2007 19:24:18 -0700 (PDT) Subject: [Python-3000] PEP 3131 roundup In-Reply-To: Message-ID: <461528.7704.qm@web33514.mail.mud.yahoo.com> --- Jim Jewett wrote: > > Ideally, either that equivalence would also include > compatibility, or > else characters whose compatibility and canonical > equivalents are > different would be banned for use in identifiers. > Current Python has the precedence that color/colour are treated as two separate identifers, as are metre/meter, despite the equivalence of "o" to "ou" and "re" to "er," and I don't think that burns too many people. So I'm +1 on the unquoted third option, that canonically equivalent, but differently encoded, Unicode characters are allowed yet treated as different. Am I stretching the analogy too far? ____________________________________________________________________________________Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. http://tv.yahoo.com/ From stephen at xemacs.org Wed Jun 6 05:01:10 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 12:01:10 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > On 6/5/07, Stephen J. Turnbull wrote: > > I'd love to get rid of full-width ASCII and halfwidth kana (via > > compatibility decomposition). > > If you do forbid compatibility characters in identifiers, then they > should be flagged as an error, not converted silently. No. The point is that people want to use their current tools; they may not be able to easily specify normalization. We should provide tools to pick this lint from programs, but the normalization should be done inside of Python, not by the user. Please look through the list (I've already done so; I'm speaking from detailed examination of the data) and state what compatibility characters you want to keep. On reflection, I would make an exception for LATIN L WITH MIDDLE DOT (both cases); just don't decompose it for the sake of Catalan. (And there possibly should be a warning for L followed by MIDDLE DOT.) But as a native English speaker and one who lectures and deals with the bureaucracy in Japanese, I can tell you unequivocally I want the fi and ffi ligatures and full-width ASCII compatibility decomposed, and as a daily user of several Japanese input methods, I can tell you it would be a massive pain in the ass if Python doesn't convert those, and errors would be an on-the-minute-every-minute annoyance. > Unicode, and adding extra equivalences (whether it's "FoO" == "foo", > "??" == -------------- next part -------------- "??" or "????" == "A123") is surprising. How many Japanese documents do you deal with on a daily basis? I live with the half-width kana and full-width ASCII every day, and they are simply an annoyance to me and to everybody I know. They are treated as font variants, not different characters, by *all* users. Users are quite happy to substitute ultra-wide ASCII fonts for JIS X 0208 ASCII, or ultra-condensed fonts for JIS X 0201 kana. Japanese don't expect equivalence, but that's because it's too much effort for the programmers when nobody is asking for it; the users are unsophisticated and don't demand it. But where equivalence is provided on web forms and the like, people are indeed surprised, they are *impressed*. "Wow! Gaijin magic! How'd he *do* that?!" They *hate* the fact that some forms want the postal code entered in JIS X 0208 full-width digits while others want ASCII (and I've even seen a form that expected the address, including the yuubin mark, to be in full-width JIS, but the postal code itself, embedded in the address, had to be entered in ASCII or the form couldn't parse it). > In short, I would like this function to return 'OK' or be a > syntax error, but it should not fail or return something else: > > def test(): > if 'A' == '?': return 'OK' > A = 'O' > ? = 'K' # as tested above, 'A' and '?' are not the same thing > return locals()['A']+locals()['?'] I would like this code to return "KK". This might be an unpleasant surprise, once, and there would need to be a warning on the box for distribution in Japan (and other cultures with compatibility decompositions). On the other hand, diffusion of non-ASCII identifiers at best will be moderately paced; people will have to learn about usage and will have time to get used to it. From stephen at xemacs.org Wed Jun 6 05:44:59 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 12:44:59 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <466595C5.6070301@v.loewis.de> Message-ID: <874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > > Not sure what the proposal is here. If people say "we want the PEP do > > NFKC", I understand that as "instead of saying NFC, it should say > > NFKC", which in turn means "all identifiers are converted into the > > normal form NFKC while parsing". > > I would prefer that. +1 > > With that change, the full-width ASCII characters would still be > > allowed in source - they just wouldn't be different from the regular > > ones anymore when comparing identifiers. > > I *think* that would be OK; +1 For the case of Japanese compatibility characters, this would make it much easier to teach use of non-ASCII identifiers ("sensei, sensei, do I use full-width numbers or half-width numbers?" "Whatever you like, kid, whatever you like."), and eliminate a common source of typos for neophytes and experienced typists alike. Rauli Ruohonen disagrees pretty strongly. While I suspect I have a substantial edge over Rauli in experience with daily use of Japanese, that worries me. I will be polling my students (for "would you be more interested in learning Python if ...") and my more or less able-to-program colleagues. BTW -- Martin, what about numeric tokens? I don't expect ideographic numbers to be translated to decimal, but if full-width "ABC123" is decomposed to halfwidth as an identifier, I think Japanese will expect a literal full-width "123" to be recognized as the decimal number 123 (and similarly for e notation for floating point). I really think this should be in the scope of this PEP. (Feel free to count it as a reason against NFKC, if that simplifies things for you.) > so long as they mean the same thing, it is just a quirk like using > a different font. I am slightly concerned that it might mean > "string as string" and "string as identifier" have different tests > for equality. It does mean that; see Rauli's code. Does anybody know if this bothers LISP users, where identifiers are case-insensitive? (My Emacs LISP experience is useless, since identifiers are case-sensitive.) We will need (possibly external) tools to warn about such decompositions, and a sophisticated tool should warn about accesses to identifier dictionaries in the presence of such decompositions as well. > > Another option would be to require that the source is in NFKC already, > > where I then ask again what precisely that means in presence of > > non-UTF source encodings. I don't think this is a good idea. NB: if there's substantial resistance from users of some of the other classes of compatibility characters, I have an acceptable fallback. NFC plus external tools to audit for NFKC would be usable, and for the character sets I'm likely to encounter, it would be well-defined for the usual encodings. From turnbull at sk.tsukuba.ac.jp Wed Jun 6 06:19:33 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 13:19:33 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > On 6/5/07, Stephen J. Turnbull wrote: > > > It seems to me that what UAX#31 is saying is "Distinguishing (or not) > > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be > > equivalent to distinguishing (or not) between LATIN CAPITAL > > LETTER A and LATIN SMALL LETTER A." I don't know that > > I agree (or disagree) in principle. > > So effectively, they consider "a" and "A" to be presentational variants. Well, no, they're pretty explicit that they have semantic content, as do superscripts. This is different from the Arabic initial, medial, and final forms, ligatures, the Croatian digraphs, and the Japanese double-byte ASCII, where there is no semantic content (not even word division for Arabic AFAIK), use is just required by "the rules" (for Arabic) or is 100% at the discretion of the user (ASCII variants). > In some languages, certain presentational variants are used depending > on word position. I think the ID_START property does exclude letters > that cannot appear in an initial position, but putting a final > character in the middle or vice versa would still be wrong. Good point. I'm going to interview some Arabic speakers who I believe have some programming skills; I'll add that to the list. > If identifiers are built up in the equivalent of > > handler="do_" + name I think this is pretty likely, and one of the attractions of languages like Python. > The folding rules do say that it is OK (even good) to exclude certain > characters from certain foldings; I think we could preserve case > (including title-case?) as the only presentational variant we > recognize. AFAICS from looking at the V2 table, case is an *analogy* used by UAX#31 to clarify when NKFC is useful. NKFC itself does not fold case, it is considered appropriate if you have a language that folds case anyway. > http://www.unicode.org/versions/corrigendum3.html suggests that many > of the Hangul are either pronunciation guide variants or even exact > duplicates (that were presumably missed when the canonicalization was > frozen?) I'll have to ask some Koreans what they would use. > """It is recommended that all Arabic presentation forms be excluded > from identifiers in any event, although only a few of them must be > excluded for normalization to guarantee identifier closure.""" Cool. I'll ask that, too. > Depends on what you mean by technical symbols. Eg, the letterlike symbols (DEGREE CELSIUS), the number forms (ROMAN NUMERAL ONE), and the APL set (2336--237A) in the BMP. [[ I really need to put together some tools to access that database from XEmacs.... ]] > IMHO, many of them are in fact listed as ID characters. The math > versions (generally 1D400 - 1DC7B) are included. But > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests > excluding them again. I'm not really worried about people using characters outside the BMP very often, any more than people use an embedded comma in LISP identifiers or file names (eg RCS ,v), unless they use a script lately admitted to Unicode, or if they just wish to tempt the wrath of the gods. The former will not have a problem, and the latter can look out for themselves, I'm sure. From stephen at xemacs.org Wed Jun 6 06:41:28 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 13:41:28 +0900 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4665B7D7.6030501@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> <46659A37.4000900@v.loewis.de> <4665B7D7.6030501@v.loewis.de> Message-ID: <871wgpy9s7.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > TR 15, section 19, numbered paragraph 3 > > """ > > Higher-level processes that transform or compare strings, or that > > perform other higher-level functions, must respect canonical > > equivalence or problems will result. > > """ > > That's not a mandatory requirement, but an "important aspect". Also, > it applies to "higher-level processes"; I would expect that string > comparison is not a higher-level function. Indeed, UAX#15 only > gives definitions, no rules. In the language of these standards, I would expect that string comparison is exactly the kind of higher-level process they have in mind. In fact, it is given as an example in what Jim quoted above. > > C9 A process shall not assume that the interpretations of two > > canonical-equivalent character sequences are distinct. > > Right. What is "a process"? Anything that accepts Unicode on input or produces it on output, and claims to conform to the standard. > > ... > > Ideally, an implementation would always interpret two > > canonical-equivalent character sequences identically. There are > > practical circumstances under which implementations may reasonably > > distinguish them. > > """ > > So it should be the application's choice. I don't think so. I think the kind of practical circumstance they have in mind is (eg) a Unicode document which is PGP-signed. PGP clearly will not be able to verify a canonicalized document, unless it happened to be in canonical form when transmitted. But I think it is quite clear that they do not admit that an implementation might return False when evaluating u"L\u00F6wis" == u"Lo\u0308wis". > So this *allows* to canonicalize strings, it doesn't *require* Python > to do so. Indeed, doing so would be fairly expensive, and therefore > it should not be done (IMO). It would be much more expensive to make all string comparisons grok canonical equivalence. That's why it *allows* canonicalization. Otherwise the PGP signature case would suggest that canonicalization should be forbidden (except where that is part of the definition of the process), and canonical equivalencing be done at the site of each comparison. You are correct that this is outside the scope of PEP 3131, but I don't want your interpretation of "Unicode conformance" (which I believe to be incorrect) to go unchallenged. From martin at v.loewis.de Wed Jun 6 07:15:21 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 Jun 2007 07:15:21 +0200 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: References: <4665B4CF.2050107@v.loewis.de> Message-ID: <466642E9.1020505@v.loewis.de> > I think "obvious" referred to the reasoning, not the outcome. > > I can tell that the decision was "NFC, anything goes", but I don't see why. I think I'm repeating myself: Because UAX 31 says so. That's it. There is a standard that experts in the domain have specified, and PEP 3131 follows it. Following standards is a good thing, deviating from them is a bad thing. > (2) > I cannot understand why ID_START/CONTINUE was chosen instead of the > newer and more recommended XID_START/CONTINUE. From UAX31 section 2: > """ > The XID_Start and XID_Continue properties are improved lexical classes > that incorporate the changes described in Section 5.1, NFKC > Modifications. They are recommended for most purposes, especially for > security, over the original ID_Start and ID_Continue properties. > """ Right. I read it that these should be used when 5.1 is considered in the language. This, in turn, should be used when the normalization form is NFKC: """ Where programming languages are using NFKC to fold differences between characters, they need the following modifications of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These modifications are reflected in the XID_Start and XID_Continue properties. """ As the PEP does not use NFKC (currently), it should not use XID_Start and XID_Continue either. > Nor can I understand why the additional restrictions in > xidmodifications (from TR39) were ignored. Consideration of UTR 39 is listed as an open issue. One problem with it is that using it would restrict the language over time, so that previously correct programs might not be correct anymore in a future version. So using it might break backwards compatibility. Regards, Martin From stephen at xemacs.org Wed Jun 6 07:28:36 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 14:28:36 +0900 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com> References: <461528.7704.qm@web33514.mail.mud.yahoo.com> Message-ID: <87zm3dwt17.fsf@uwakimon.sk.tsukuba.ac.jp> Steve Howell writes: > So I'm +1 on the unquoted third option, that canonically > equivalent, but differently encoded, Unicode characters are allowed > yet treated as different. > > Am I stretching the analogy too far? Yes. By definition, that is nonconformant to the standard. Canonically equivalent sequences are *identical characters* in Unicode. The difference you are talking about is equivalent to the differences among "7", "07", and "0x7" as C numeric literals. They look different, but their semantics is identical in the program. Pragmatically, if you have an editor which normally produces NFD, and another which normally produces NFC, those programs will not be link-compatible under your program, yet both editors will present the user with identical displays. From rauli.ruohonen at gmail.com Wed Jun 6 09:09:43 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Wed, 6 Jun 2007 10:09:43 +0300 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/6/07, Stephen J. Turnbull wrote: > No. The point is that people want to use their current tools; they > may not be able to easily specify normalization. > Please look through the list (I've already done so; I'm speaking from > detailed examination of the data) and state what compatibility > characters you want to keep. I cannot really say about code points I'm not familiar with, but I wouldn't use any of the ones I do know in identifiers. The only compatibility characters in ID_Continue I have used myself are, I think, halfwidth katakana and fullwidth alphanumerics. Examples: ? -> ? # halfwidth katakana ? -> x # fullwidth alphabetic ? -> 1 # fullwidth numeric Practically speaking I won't be using such things in my code. I don't like them but if it's more pragmatic to allow them then I guess it can't be helped. There are some cases where users might in the future want to make a distinction between "compatibility" characters, such as these: http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols If some day everyone writes their TeX using such things, then it'd make sense to allow and distinguish them in Python, too. For this reason I think that compatibility transformation, if any, should only be applied to characters where there's a practical reason to do so, and for other cases punting (=syntax error) is safest. When in doubt, refuse the temptation to guess. > as a daily user of several Japanese input methods, I can tell you it > would be a massive pain in the ass if Python doesn't convert those, > and errors would be an on-the-minute-every-minute annoyance. I use two Japanese input methods (MS IME and scim/anthy), but only the latter one daily. When I type text that mixes Japanese and other languages, I switch the input mode off when not typing Japanese. For code that uses a lot of Japanese this may not be convenient, but then you'd want to set your input method to use ASCII for ASCII anyway, as that would still be required in literals (???? or "?" won't work) and punctuation (??????????????? won't work). A code mixing fullwidth and halfwidth alphanumerics also looks horrible, but that's just a coding style issue :-) > > Unicode, and adding extra equivalences (whether it's "FoO" == "foo", > > "??" == > "??" or "????" == "A123") is surprising. > > How many Japanese documents do you deal with on a daily basis? Much fewer than you, as I don't live in Japan. I read a fair amount but don't type long texts in Japanese. When I do type, I usually use fullwidth alphanumerics except for foreign words that aren't acronyms. E.g. ??? but not ????????. For code, consistently using ASCII for ASCII would be the most predictable rule (TOOWTDI). You have to go out of your way to type halfwidth katakana, and it isn't really useful in identifiers IMHO. > They are treated as font variants, not different characters, by *all* > users. I think programmers in general expect identifier identity to behave the same way as string identity. In this way they are a special class of users. (those who use case-insensitive programming languages have all my sympathy :-) > I would like this code to return "KK". This might be an unpleasant > surprise, once, and there would need to be a warning on the box for > distribution in Japan (and other cultures with compatibility > decompositions). This won't have a big impact if you apply it only to carefully selected code points, and that way it sounds like a viable choice. Asking your students for input as you suggested is surely a good idea. From hfuerstenau at gmx.net Wed Jun 6 08:01:04 2007 From: hfuerstenau at gmx.net (=?ISO-8859-1?Q?Hagen_F=FCrstenau?=) Date: Wed, 06 Jun 2007 08:01:04 +0200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46664DA0.9070307@gmx.net> Stephen J. Turnbull writes: > > http://www.unicode.org/versions/corrigendum3.html suggests that many > > of the Hangul are either pronunciation guide variants or even exact > > duplicates (that were presumably missed when the canonicalization was > > frozen?) > > I'll have to ask some Koreans what they would use. The Windows Korean Input Method chooses between Unified Han and Compatibility characters based on the reading you use to enter them. So I guess most Koreans won't be aware of what variant they're using at any given moment. Seems to me that NFKC would be essential here. From stephen at xemacs.org Wed Jun 6 10:26:33 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 17:26:33 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <876461yefd.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87tztlwksm.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > There are some cases where users might in the future want to make > a distinction between "compatibility" characters, such as these: > http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols I don't think they belong in identifiers in a general purpose programming language, though their usefulness to mathematical printers is obvious. I think programs should be verbalizable, unlike math where most of the text is not intended to correspond to any reality, but is purely syntactic transformation. > For this reason I think that compatibility transformation, if any, > should only be applied to characters where there's a practical > reason to do so, and for other cases punting (=syntax error) is > safest. "Banzai Python!" and all that, but even if Python is in use 10,000 years from now, I think compatibility characters will still be a YAGNI. I admit that's a reasonable compromise, and allows future extension without gratuitously making existing programs illegal; I could live with it very easily (but I'd want those full-width ASCII decomposed :-). I just feel it would be wiser to limit Python identifiers to NFKC. > I use two Japanese input methods (MS IME and scim/anthy), but only the > latter one daily. When I type text that mixes Japanese and other > For code that uses a lot of Japanese this may not be convenient, > but then you'd want to set your input method to use ASCII for ASCII > anyway, Both of those address the issue of the annoyance of syntax errors in original code to a great extent, but not in debug/maintenance mode where you only type a few characters of code at a time, and typically enter from user mode. > You have to go out of your way to type halfwidth katakana, and it > isn't really useful in identifiers IMHO. I agree, but then I don't work for the Japanese Social Security Administration. From rauli.ruohonen at gmail.com Wed Jun 6 10:50:08 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Wed, 6 Jun 2007 11:50:08 +0300 Subject: [Python-3000] String comparison Message-ID: (Martin's right, it's not good to discuss this in the huge PEP 3131 thread, so I'm changing the subject line) On 6/6/07, Stephen J. Turnbull wrote: > In the language of these standards, I would expect that string > comparison is exactly the kind of higher-level process they have in > mind. In fact, it is given as an example in what Jim quoted above. > > > > C9 A process shall not assume that the interpretations of two > > > canonical-equivalent character sequences are distinct. > > > > Right. What is "a process"? > > Anything that accepts Unicode on input or produces it on output, and > claims to conform to the standard. Strings are internal to Python. This is a whole separate issue from normalization of source code or its parts (such as identifiers). Once you have read in a text file and done the normalizations you want to, what you have left is an internal representation in memory, which may be anything that's convenient to the programmer. The question is, what is convenient to the programmer? > > > Ideally, an implementation would always interpret two > > > canonical-equivalent character sequences identically. There are > > > > So it should be the application's choice. > > I don't think so. I think the kind of practical circumstance they > have in mind is (eg) a Unicode document which is PGP-signed. PGP > clearly will not be able to verify a canonicalized document, unless it > happened to be in canonical form when transmitted. But I think it is > quite clear that they do not admit that an implementation might return > False when evaluating u"L\u00F6wis" == u"Lo\u0308wis". It is up to Python to define what "==" means, just like it defines what "is" means. It may be canonical equivalence for strings, but then again it may not. It depends on what you want, and what you think strings are. If you think they're sequences of code points, which is what they act like in general (assuming UTF-32 was selected at compile time), then bitwise comparison is quite consistent whether the string is in normalized form or not. Handling strings as sequences of code points is the most general and simple thing to do, but there are other options. One is to simply change comparison to be collation (and presumably also make regexp matching and methods like startswith consistent with that). Another is to always keep strings in a specific normalized form. Yet another is to have another type for strings-as-grapheme-sequences, which would strictly follow user expectations for characters (= graphemes), such as string length and indexing, comparison, etc. Changing just the comparison has the drawback that many current string invariants break. a == b would no longer imply any of len(a) == len(b), set(a) == set(b), a[i:j] == b[i:j], repr(a) == repr(b). You'd also have to use bytes for any processing of code point sequences (such as XML processing), because most operations would act as if you had normalized your strings (including dictionary and set operations), and if you have to do contortions to avoid problems with that, then it's easier to just use bytes. There would also be security implications with strings comparing equal but not always quite acting equal. Always doing normalization would still force you to use bytes for processing code point sequences (e.g. XML, which must not be normalized), which is not nice. It's also not nice to force a particular normalization on the programmer, as another one may be better for some uses. E.g. an editor may be simpler to implement if everything is consistently decomposed (NFD), but for most communication you'd want to use NFC, as you would for many other processing (e.g. the "code point == grapheme" equation is perfectly adequate for many purposes with NFC, but not with NFD). Having a type for grapheme sequences would seem like the least problematic choice, but there's little demand for such a type. Most intelligent Unicode processing doesn't use a grapheme representation for performance reasons, and in most other cases the "code point == grapheme" equation or treatment of strings as atoms is adequate. The standard library might provide this type if necessary. > > So this *allows* to canonicalize strings, it doesn't *require* Python > > to do so. Indeed, doing so would be fairly expensive, and therefore > > it should not be done (IMO). > > It would be much more expensive to make all string comparisons grok > canonical equivalence. That's why it *allows* canonicalization. FWIW, I don't buy that normalization is expensive, as most strings are in NFC form anyway, and there are fast checks for that (see UAX#15, "Detecting Normalization Forms"). Python does not currently have a fast path for this, but if it's added, then normalizing everything to NFC should be fast. From turnbull at sk.tsukuba.ac.jp Wed Jun 6 14:33:19 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 06 Jun 2007 21:33:19 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: Message-ID: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > Strings are internal to Python. This is a whole separate issue from > normalization of source code or its parts (such as identifiers). Agreed. But please note that we're not talking about representation. We're talking about the result of evaluating a comparison: if u"L\u00F6wis" == u"Lo\u0308wis": print "Python is Unicode conforming in this respect." else: print "I guess it's time to start learning Ruby." I think it's reasonable to be astonished if Python doesn't at least try to print "Python is Unicode conforming in this respect." for the above snippet by default. > It is up to Python to define what "==" means, just like it defines > what "is" means. You are of course correct. However, if given that u prefix Python chooses to define == in a way that does not respect canonical equivalence, what's the point of having these things? > Always doing normalization would still force you to use bytes for > processing code point sequences (e.g. XML, which must not be > normalized), which is not nice. I'm not talking about "nice" yet, just about Unicode conformance. How to implement conformant behavior is of course entirely up to Python. As is choosing *whether* to conform or not, but it seems bizarre to me that one might choose to implement UAX#31 verbatim, and also have u"L\u00F6wis" == u"Lo\u0308wis" evaluate to False. > FWIW, I don't buy that normalization is expensive, as most strings are > in NFC form anyway, and there are fast checks for that (see UAX#15, > "Detecting Normalization Forms"). Python does not currently have > a fast path for this, but if it's added, then normalizing everything > to NFC should be fast. If O(n) is "fast". From jcarlson at uci.edu Wed Jun 6 17:57:45 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 06 Jun 2007 08:57:45 -0700 Subject: [Python-3000] String comparison In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20070606084543.6F3D.JCARLSON@uci.edu> "Stephen J. Turnbull" wrote: > Rauli Ruohonen writes: > > > Strings are internal to Python. This is a whole separate issue from > > normalization of source code or its parts (such as identifiers). > > Agreed. But please note that we're not talking about representation. > We're talking about the result of evaluating a comparison: > > if u"L\u00F6wis" == u"Lo\u0308wis": > print "Python is Unicode conforming in this respect." > else: > print "I guess it's time to start learning Ruby." > > I think it's reasonable to be astonished if Python doesn't at least > try to print "Python is Unicode conforming in this respect." for the > above snippet by default. > > > It is up to Python to define what "==" means, just like it defines > > what "is" means. > > You are of course correct. However, if given that u prefix Python > chooses to define == in a way that does not respect canonical > equivalence, what's the point of having these things? Maybe I'm missing something, but it seems to me that there might be a simple solution. Don't normalize any identifiers or strings. Hear me out for a moment. People type what they want. Isn't that the whole point of PEP 3131? If they don't know what they want, then that is as much a problem with display/representation as anything else that we have discussed. Any of the flagging methods could easily disable things like u"o\u0308" for identifiers to force them to be in the "one true form" to begin with. As for strings, I think we should opt for keeping it as simple as possible. Compare by code points. To handle normalization issues, add a normalization method that people call if they care about normalized unicode strings*. If at some point we think that normalization should happen on identifiers by default, all we need to do is to call st.normalize() on any string that is used for getattr, and/or could use a subclass of dict to make it happen automatically. - Josiah * Or leave out normalization all together in 3.0 . I haven't heard any complaints about the lack of normalization in Python so far (though maybe I'm not reading the right python-list messages), and Python has had unicode for what, almost 10 years now? From guido at python.org Wed Jun 6 18:46:13 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 09:46:13 -0700 Subject: [Python-3000] String comparison In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/6/07, Stephen J. Turnbull wrote: > Rauli Ruohonen writes: > > > Strings are internal to Python. This is a whole separate issue from > > normalization of source code or its parts (such as identifiers). > > Agreed. But please note that we're not talking about representation. > We're talking about the result of evaluating a comparison: > > if u"L\u00F6wis" == u"Lo\u0308wis": > print "Python is Unicode conforming in this respect." > else: > print "I guess it's time to start learning Ruby." > > I think it's reasonable to be astonished if Python doesn't at least > try to print "Python is Unicode conforming in this respect." for the > above snippet by default. Alas, you will remain astonished for a long time, and you're welcome to try Ruby instead. I'm all for adding a way to do normalized string comparisons to the library. But I'm not about to change the == operator to apply normalization first. It would affect too much (e.g. hashing). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Wed Jun 6 19:12:56 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 6 Jun 2007 10:12:56 PDT Subject: [Python-3000] String comparison In-Reply-To: <20070606084543.6F3D.JCARLSON@uci.edu> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> Message-ID: <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com> > Hear me out for a moment. People type what they want. I do a lot of Pythonic processing of UTF-8, which is not "typed by people", but instead extracted from documents by automated processing. Text is also data -- an important thing to keep in mind. As far as normalization goes, I agree with you about identifiers, and I use "unicodedata.normalize" extensively in the cases where I care about normalization of data strings. The big issue is string literals. I think I agree with Stephen here: u"L\u00F6wis" == u"Lo\u0308wis" should be True (assuming he typed it correctly in the first place :-), because they are the same Unicode string. I don't understand Guido's objection here -- it's a lexer issue, right? The underlying character string will still be the same in both cases. But it's complicated. Clearly we expect (u"abc" + u"def") == (u"a" + u"bcdef") to be True, so (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis") should also be True. Where I see difficulty is (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis") I suppose unichr(0x0308) should raise an exception -- a combining diacritic by itself shouldn't be convertible to a character. Bill From guido at python.org Wed Jun 6 19:37:47 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 10:37:47 -0700 Subject: [Python-3000] String comparison In-Reply-To: <-6248387165431892706@unknownmsgid> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Bill Janssen wrote: > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to keep in mind. > > As far as normalization goes, I agree with you about identifiers, and > I use "unicodedata.normalize" extensively in the cases where I care > about normalization of data strings. The big issue is string literals. > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > because they are the same Unicode string. I don't understand Guido's > objection here -- it's a lexer issue, right? The underlying character > string will still be the same in both cases. So let me explain it. I see two different sequences of code points: 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', 'w', 'i', 's' on the other. Never mind that Unicode has semantics that claim they are equivalent. They are two different sequences of code points. We should not hide that Python's unicode string object can store each sequence of code points equally well, and that when viewed as a sequence they are different: the first has len() == 5, the scond has len() == 6! When read from a file they are different. Why should the lexer apply normalization to literals behind my back? I might be writing either literal with the expectation to get exactly that sequence of code points, in order to use it as a test case or as input for another program that requires specific input. > But it's complicated. Clearly we expect > > (u"abc" + u"def") == (u"a" + u"bcdef") > > to be True, so > > (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis") > > should also be True. Where I see difficulty is > > (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis") > > I suppose unichr(0x0308) should raise an exception -- a combining > diacritic by itself shouldn't be convertible to a character. There's a simpler solution. The unicode (or str, in Py3k) data type represents a sequence of code points, not a sequence of characters. This has always been the case, and will continue to be the case. Note that I'm not arguing against normalization of *identifiers*. I see that as a necessity. I also see that there will be border cases where getattr(x, 'XXX') and x.XXX are not equivalent for some values of XXX where the normalized form is a different sequence of code points. But I don't believe the solution should be to normalize all string literals. Clearly we will have a normalization routine so the lexer can normalize identifiers, so if you need normalized data it is as simple as writing 'XXX'.normalize() (or whatever the spelling should be). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rauli.ruohonen at gmail.com Wed Jun 6 20:18:53 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Wed, 6 Jun 2007 21:18:53 +0300 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Guido van Rossum wrote: > Why should the lexer apply normalization to literals behind my back? The lexer shouldn't, but NFC normalizing the source before the lexer sees it would be slightly more robust and standards-compliant. This is because technically an editor or any other program is allowed by the Unicode standard to apply any normalization or other canonical equivalent replacement it sees fit, and other programs aren't supposed to care. The standard even says that such differences should be rendered in an indistinguishable way. Practically everyone uses NFC, though. > There's a simpler solution. The unicode (or str, in Py3k) data type > represents a sequence of code points, not a sequence of characters. > This has always been the case, and will continue to be the case. This is how Java and ICU (http://www.icu-project.org/) do it, too. The latter is a library specifically designed for processing Unicode text. Both Java and ICU are even mentioned in the Unicode FAQ. > Clearly we will have a normalization routine so the lexer can > normalize identifiers, so if you need normalized data it is > as simple as writing 'XXX'.normalize() (or whatever the spelling > should be). The routine is at the moment at unicodedata.normalize. From martin at v.loewis.de Wed Jun 6 20:21:10 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 Jun 2007 20:21:10 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: Message-ID: <4666FB16.2070209@v.loewis.de> > FWIW, I don't buy that normalization is expensive, as most strings are > in NFC form anyway, and there are fast checks for that (see UAX#15, > "Detecting Normalization Forms"). Python does not currently have > a fast path for this, but if it's added, then normalizing everything > to NFC should be fast. That would be useful to have, anyway. Would you like to contribute it? Regards, Martin From martin at v.loewis.de Wed Jun 6 20:26:01 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 Jun 2007 20:26:01 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: <4666FC39.8050307@v.loewis.de> Guido van Rossum schrieb: > Clearly we will have a normalization routine so the > lexer can normalize identifiers, so if you need normalized data it is > as simple as writing 'XXX'.normalize() (or whatever the spelling > should be). It's actually in Python already, and spelled as unicodedata.normalize("NFC", 'XXX') Regards, Martin From stephen at xemacs.org Wed Jun 6 20:41:28 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 07 Jun 2007 03:41:28 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > But I'm not about to change the == operator to apply normalization > first. It would affect too much (e.g. hashing). Yah, that's one reason why Jim Jewett and I lean to normalizing on the way in for explicitly Unicode data. But since that's not going to happen, I guess the thing is to get cracking on that library just in case there's some help that Python itself could give. From martin at v.loewis.de Wed Jun 6 20:44:27 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 Jun 2007 20:44:27 +0200 Subject: [Python-3000] String comparison In-Reply-To: <87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <87myzdgc2v.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4667008B.5080302@v.loewis.de> > > But I'm not about to change the == operator to apply normalization > > first. It would affect too much (e.g. hashing). > > Yah, that's one reason why Jim Jewett and I lean to normalizing on the > way in for explicitly Unicode data. But since that's not going to > happen, I guess the thing is to get cracking on that library just in > case there's some help that Python itself could give. There are issues with that as well. Concatenation would need to perform normalization, and then len(a+b) <> len(a)+len(b), for some a and b. Regards, Martin From guido at python.org Wed Jun 6 20:47:20 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 11:47:20 -0700 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Rauli Ruohonen wrote: > On 6/6/07, Guido van Rossum wrote: > > Why should the lexer apply normalization to literals behind my back? > > The lexer shouldn't, but NFC normalizing the source before the lexer > sees it would be slightly more robust and standards-compliant. I have no opinion on this, but NFC normalizing the source shouldn't affect the use of \u.... in string literals. Remember, Python's \u is very different from \u in Java (where it happens before the lexer starts tokenizing). Python's \u is more like \x, only valid in string literals. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jcarlson at uci.edu Wed Jun 6 22:05:37 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Wed, 06 Jun 2007 13:05:37 -0700 Subject: [Python-3000] String comparison In-Reply-To: <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com> References: <20070606084543.6F3D.JCARLSON@uci.edu> <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com> Message-ID: <20070606125328.6F40.JCARLSON@uci.edu> Bill Janssen wrote: > > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to keep in mind. Right, but (and this is a big but), you are reading data in from a file. That is different from source code identifiers and embedded strings. If you *want* normalization to happen on your data, that is perfectly reasonable, and you can do so (Explicit is better than implicit?). But if someone didn't want normalization, and Python did it anyways, then there would be an error that passed silently. > As far as normalization goes, I agree with you about identifiers, and > I use "unicodedata.normalize" extensively in the cases where I care > about normalization of data strings. The big issue is string literals. > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > because they are the same Unicode string. I don't understand Guido's > objection here -- it's a lexer issue, right? The underlying character > string will still be the same in both cases. It's the unicode character versus code point issue. I personally prefer code points, as a code point approach does exactly what I want it to do by default; nothing. If it *does* something without me asking, then that would seem to be magic to me, and I'm a minimal magic kind of guy. - Josiah From guido at python.org Wed Jun 6 23:31:17 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 14:31:17 -0700 Subject: [Python-3000] [Python-Dev] PEP 367: New Super In-Reply-To: <20070531170734.273393A40AA@sparrow.telecommunity.com> References: <001101c79aa7$eb26c130$0201a8c0@mshome.net> <002d01c79f6d$ce090de0$0201a8c0@mshome.net> <003f01c79fd9$66948ec0$0201a8c0@mshome.net> <009c01c7a04f$7e348460$0201a8c0@mshome.net> <20070531170734.273393A40AA@sparrow.telecommunity.com> Message-ID: On 5/31/07, Phillip J. Eby wrote: > At 07:48 PM 5/31/2007 +0800, Guido van Rossum wrote: > >I've updated the patch; the latest version now contains the grammar > >and compiler changes needed to make super a keyword and to > >automatically add a required parameter 'super' when super is used. > >This requires the latest p3yk branch (r55692 or higher). > > > >Comments anyone? What do people think of the change of semantics for > >the im_class field of bound (and unbound) methods? > > Please correct me if I'm wrong, but just looking at the patch it > seems to me that the descriptor protocol is being changed as well -- > i.e., the 'type' argument is now the found-in-type in the case of an > instance __get__ as well as class __get__. > > It would seem to me that this change would break classmethods both on > the instance and class level, since the 'cls' argument is supposed to > be the derived class, not the class where the method was > defined. There also don't seem to be any tests for the use of super > in classmethods. I've not gotten a new patch out based on a completely different approach. (I'm afraid I didn't quite get your suggestion so this is original work.) It creates a cell named __class__ which is shared between all methods defined in a particular class, and initialized to the class object (before class decorators are applied). Only a small change is made to super(): instead of making it a keyword, it can be invoked as a function without arguments, and then it digs around in the frame to find the __class__ cell and the first argument, and uses those as its arguments. Example: class B: def foo(self): return 'B' class C(B): def foo(self): return 'C' + super().foo() C().foo() will return 'CB'. The notation super() is equivalent to super(C, self) or super(__class__, self). It works for class methods too. I realize this is a deviation from the PEP: you need to call super().foo() instead of super.foo(). Looking at the examples I find that quite acceptable; in hindsight making super a keyword smelled a bit too magical. (Yes, I know I've been flip-flopping a lot on this issue. Working code is convincing. :-) This __class__ variable can also be used explicitly (thereby implementing 33% of PEP 3130): class C: def f(self): print(__class__) C().f() I wonder if this may meet the needs for your PEP 3124? In particularly, earlier on, you wrote: > Btw, PEP 3124 needs a way to receive the same class object at more or > less the same moment, although in the form of a callback rather than > a cell assignment. Guido suggested I co-ordinate with you to design > a mechanism for this. Is this relevant at all? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Wed Jun 6 23:43:15 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 17:43:15 -0400 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com> References: <461528.7704.qm@web33514.mail.mud.yahoo.com> Message-ID: On 6/5/07, Steve Howell wrote: > > --- Jim Jewett wrote: > > Ideally, either that equivalence would also include > > compatibility, or > > else characters whose compatibility and canonical > > equivalents are > > different would be banned for use in identifiers. > Current Python has the precedence that color/colour > are treated as two separate identifers, as are > metre/meter, despite the equivalence of "o" to "ou" > and "re" to "er," and I don't think that burns too > many people. So I'm +1 on the unquoted third option, > that canonically equivalent, but differently encoded, > Unicode characters are allowed yet treated as > different. > Am I stretching the analogy too far? I think so. As best I can judge, "color/colour" is arguably a compatibility equivalent, but is not a canonical equivalent. A better analogy for canonical equivalence would be "color" typed on a PC vs "color" typed on an old EBCDIC mainframe terminal. In that particular case, I think the re-encoding to unicode would be able to use the same code points, but that "mostly invisible; might need it for a round-trip" level of difference is the sort of thing expressed by different code points with canonical equivalence. -jJ From baptiste13 at altern.org Thu Jun 7 00:01:26 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Thu, 07 Jun 2007 00:01:26 +0200 Subject: [Python-3000] Conservative Defaults (was: Re: Support for PEP 3131) In-Reply-To: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> References: <740c3aec0706031858q1dde5ccbxb39d4a808902f5d2@mail.gmail.com> Message-ID: BJ?rn Lindqvist a ?crit : >> Those most eager for unicode identifiers are afraid that people >> (particularly beginning students) won't be able to use local-script >> identifiers, unless it is the default. My feeling is that the teacher >> (or the person who pointed them to python) can change the default on a >> per-install basis, since it can be a one-time change. > > What if the person discovers Python by him/herself? > Don't people read the (funky:-) manual any more? More seriously, they will probably read some tutorials in that case. Also, the error message could advertise the feature, as in: SyntaxError: if you really want to use unicode identifiers, call python with -U Also, think of it from the other side: the person who discovers python by him/herself and reads no manuals won't know that you should avoid unicode identifiers in code you later want to distribute, or that there can be security issues. >> On the other hand, if "anything from *any* script" becomes the >> default, even on a single widespread distribution, then the community >> starts to splinter in a new way. It starts to separate between people >> who distribute source code (generally ASCII) and people who are >> effectively distributing binaries (not for human end-users to read). > > That is FUD. > definitely not. Big open source projects will of course do the right thing, but the smaller ones? I doubt it. Think of all those little apps on the cheeseshop which get updated every other year. Do you really think all of them run a test suite? >>> ... Java, ... don't hear constant complaints >> They aren't actually a problem because they aren't used; they aren't >> used because almost no one knows about them. Python would presumably >> advertise the feature, and see more use. (We shouldn't add it at all >> *unless* we expect much more usage than unicode IDs have seen in other >> programming languages.) > > Every Swedish book I've read about Java (only 2) mentioned that feature. > cool, then everybody reading Swedish tutorials on python will also learn about the feature, even if it s not the default! >> The same one-step-at-a-time reasoning applies to unicode identifers. >> Allowing IDs in your native language (or others that you explicitly >> approve) is probably a good step. Allowing IDs in *any* language by >> default is probably going too far. > > If you set different native languages won't you get the exact same > problems that codepages caused and that unicode was invented to solve? > nope, because you do not reuse the same coding for different characters in different languages. You just turn languages (scripts, in fact) on or off. Cheers, BC From jimjjewett at gmail.com Thu Jun 7 00:22:17 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 18:22:17 -0400 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <466642E9.1020505@v.loewis.de> References: <4665B4CF.2050107@v.loewis.de> <466642E9.1020505@v.loewis.de> Message-ID: On 6/6/07, "Martin v. L?wis" wrote: > > I think "obvious" referred to the reasoning, not the outcome. > > I can tell that the decision was "NFC, anything goes", but I don't see why. > I think I'm repeating myself: Because UAX 31 says so. That's it. There > is a standard that experts in the domain have specified, and PEP 3131 > follows it. Following standards is a good thing, deviating from them > is a bad thing. I think we are reading UAX31 very differently. If it is (or even seems) ambiguous, then we need to specify our interpretation. > > (2) > > I cannot understand why ID_START/CONTINUE was chosen instead of the > > newer and more recommended XID_START/CONTINUE. From UAX31 section 2: > > """ > > The XID_Start and XID_Continue properties are improved lexical classes > > that incorporate the changes described in Section 5.1, NFKC > > Modifications. They are recommended for most purposes, especially for > > security, over the original ID_Start and ID_Continue properties. > > """ > Right. I read it that these should be used when 5.1 is considered > in the language. This, in turn, should be used when the > normalization form is NFKC: I read that as XID is almost always better. XID is better for security in particular, but also better for other things. And as an extra bonus, XID even already takes care of some 5.1 issues for you. And my personal opinion is that those 5.1 issues are not really restricted to NFKC. Other normalization forms won't get syntactic errors over them, but the results could still be nonsense. Issue 1 is that Catalan treats a 0xB7 as a character instead of as punctuation. The unicode recommendation (*required* only for NFKC, but already supported by XID, since it is recommended) says "OK, it isn't syntax or whitespace, and it is a character sometimes in practice, so we'll allow it." Issue 2 says "Technically these are characters, but they should never be used to start a word, so don't start an identifier with them anyhow." If you're not using NFKC, you *can* just ignore the problem (and produce garbage), but you probably shouldn't. XID takes care of it for you. (At least for these characters.) Issue 3 says "OK, these characters don't work with NFKC -- but you shouldn't be using them anyhow." It even says explicitly that "It is recommended that all Arabic presentation forms be excluded from identifiers in any event" Note that neither ID nor XID actually remove all the Arabic presentation forms, despite this clear recommendation. Technically, they are characters, and *could* be processed. XID removes the ones that break NFKC, and xidmodifications removes some more (hopefully, all the rest, but I haven't verified that). > """ > Where programming languages are using NFKC to fold differences between > characters, they need the following modifications of the identifier > syntax from the Unicode Standard to deal with the idiosyncrasies of a > small number of characters. These modifications are reflected in the > XID_Start and XID_Continue properties. > """ > As the PEP does not use NFKC (currently), it should not use XID_Start > and XID_Continue either. I read that as "If you are using NFKC, then you need to do some extra work. But notice that if you are using the new and improved XID, then some of this work was already done for you..." > > Nor can I understand why the additional restrictions in > > xidmodifications (from TR39) were ignored. > Consideration of UTR 39 is listed as an open issue. One problem > with it is that using it would restrict the language over time, > so that previously correct programs might not be correct anymore > in a future version. So using it might break backwards > compatibility. Then we should start with a more restricted charset, and expand it over time. The restrictions in xidmodifications are not remotely sufficient for security, even now. (Doing that would require restricting some characters that are actually needed in some languages.) Instead, xidmodifications represents (a mechanically determined subset of) characters that can be removed cheaply, because they shouldn't be used in identifiers anyhow. -jJ From guido at python.org Thu Jun 7 00:57:23 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 15:57:23 -0700 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? Message-ID: A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP 367 (new super) and PEP 344 (exception chaining). Are there any others? I propose that we renumber these to numbers in the 3100+ range. I can see two forms of renaming: (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number (b) just use the next available number Preferences? What other PEPs should be renumbered? Should we renumber at all? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From collinw at gmail.com Thu Jun 7 01:00:24 2007 From: collinw at gmail.com (Collin Winter) Date: Wed, 6 Jun 2007 16:00:24 -0700 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: References: Message-ID: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> On 6/6/07, Guido van Rossum wrote: > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP > 367 (new super) and PEP 344 (exception chaining). Are there any > others? I propose that we renumber these to numbers in the 3100+ > range. I can see two forms of renaming: > > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number > > (b) just use the next available number > > Preferences? > > What other PEPs should be renumbered? > > Should we renumber at all? Renumbering, +1; using the next 31xx number, +1. Collin Winter From jimjjewett at gmail.com Thu Jun 7 01:06:05 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 19:06:05 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/5/07, Stephen J. Turnbull wrote: > A scan of the full table for Unicode Version 2.0 ... 'n (Afrikaans), and I asked a friend who speaks Afrikaans; apparently it is more a word than a letter. """ ? is derived from the Dutch word en which means "a" in English. The ` is in place of the e e.g. a woman would translate into "? vrou" It is used very often as it is an indefinite article. SMS language usually just uses the n without the apostrophe. ""' -- Tania Adendorff So it is common, but losing it is already sort of acceptable. And that is the strongest endorsement we have seen. (There were mixed opinions on Technical symbols, and no one has spoken up yet about the half-dozen Croatian digraphs corresponding to Serbian Cyrillic.) There is legitimate disagreement over whether to (1) forbid the Kompatibility characters in IDs (2) translate them to the canonical equivalents, (3) or just leave them alone because ID= should be the same as string=, but I think dealing with K characters is now a "least of evils" decision, instead of "we need them for something." On another note, I have no idea how Martin's name (in the Cc line) ended up as: """ L$(D+S(Bwis" """ If I knew, it *might* have a bearing on what sorts of canonicalizations should be performed, and what sorts of warnings the parser ought to emit for likely corrupted text. -jJ From jimjjewett at gmail.com Thu Jun 7 01:29:09 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 19:29:09 -0400 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <873b15yasq.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/6/07, Stephen J. Turnbull wrote: > Jim Jewett writes: > > Depends on what you mean by technical symbols. ... The math > > versions (generally 1D400 - 1DC7B) are included. But > > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests > > excluding them again. > Eg, the letterlike symbols (DEGREE CELSIUS), not an ID character > the number forms (ROMAN NUMERAL ONE), an ID_START (a letter), not excluded even by xidmodifications No canonical equivalent. Will be turned into the regular ASCII letters (only) by Kompatibility canonicalization. > and the APL set (2336--237A) in the BMP. not ID characters -jJ From jimjjewett at gmail.com Thu Jun 7 02:09:50 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 20:09:50 -0400 Subject: [Python-3000] String comparison In-Reply-To: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/6/07, Stephen J. Turnbull wrote: > Rauli Ruohonen writes: > > FWIW, I don't buy that normalization is expensive, as most strings are > > in NFC form anyway, and there are fast checks for that (see UAX#15, > > "Detecting Normalization Forms"). Python does not currently have > > a fast path for this, but if it's added, then normalizing everything > > to NFC should be fast. > If O(n) is "fast". Normalize before hashing; then it becomes O(1) for the remaining uses. The hash is already O(N), and most literals already end up being interned, which requires hashing. -jJ From jimjjewett at gmail.com Thu Jun 7 02:38:51 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 20:38:51 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Guido van Rossum wrote: > > about normalization of data strings. The big issue is string literals. > > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > > because they are the same Unicode string. > So let me explain it. I see two different sequences of code points: > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > claim they are equivalent. Your (conforming) editor can silently replace one with the other. A second editor can silently use one, and not replace the other. ==> Uncontrollable, invisible bugs. > They are two different sequences of code points. So "str" is about bytes, rather than text? and bytes is also about bytes; it just happens to be mutable? Then what was the point of switching to unicode? Why not just say "When printed, a string will be interpreted as if it were UTF-8" and be done with it? > We should not hide that Python's unicode string object can > store each sequence of code points equally well, and that when viewed > as a sequence they are different: the first has len() == 5, the scond > has len() == 6! For a bytes object, that is true. For unicode text, they shouldn't be different -- at least not by the time a user can see it (or measure it). > I might be writing either literal with the expectation to get exactly that > sequence of code points, Then you are assuming non-conformance with unicode, which requires you not to depend on that distinction. You should have used bytes, rather than text. http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance) C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. > Note that I'm not arguing against normalization of *identifiers*. I > see that as a necessity. I also see that there will be border cases > where getattr(x, 'XXX') and x.XXX are not equivalent for some values > of XXX where the normalized form is a different sequence of code > points. But I don't believe the solution should be to normalize all > string literals. For strings created by an extension module, that would be valid. But python source code is human-readable text, and should be treated that way. Either follow the unicode rules (at least for strings), or don't call them unicode. -jJ From guido at python.org Thu Jun 7 02:47:38 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 6 Jun 2007 17:47:38 -0700 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Jim Jewett wrote: > On 6/6/07, Guido van Rossum wrote: > > > > about normalization of data strings. The big issue is string literals. > > > I think I agree with Stephen here: > > > > u"L\u00F6wis" == u"Lo\u0308wis" > > > > should be True (assuming he typed it correctly in the first place :-), > > > because they are the same Unicode string. > > > So let me explain it. I see two different sequences of code points: > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > > claim they are equivalent. > > Your (conforming) editor can silently replace one with the other. No it cannot. We are talking about \u escapes, not about a string literal containing Unicode characters ("L?wis"). > A second editor can silently use one, and not replace the other. > ==> Uncontrollable, invisible bugs. No. Seems you're again not reading before posting. :-( > > They are two different sequences of code points. > > So "str" is about bytes, rather than text? > and bytes is also about bytes; it just happens to be mutable? Bytes are not code points. The unicode string type has always been about code points, not characters. > Then what was the point of switching to unicode? Why not just say > "When printed, a string will be interpreted as if it were UTF-8" and > be done with it? Manipulating code points is a lot more convenient than manipulating UTF-8. > > We should not hide that Python's unicode string object can > > store each sequence of code points equally well, and that when viewed > > as a sequence they are different: the first has len() == 5, the scond > > has len() == 6! > > For a bytes object, that is true. For unicode text, they shouldn't be > different -- at least not by the time a user can see it (or measure > it). Have you ever even used the unicode string type in Python 2? > > I might be writing either literal with the expectation to get exactly that > > sequence of code points, > > Then you are assuming non-conformance with unicode, which requires you > not to depend on that distinction. You should have used bytes, rather > than text. Again, bytes != code points. > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance) > > C9 A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct. That is surely contained inside all sorts of weasel words that allow us to define a "normalized equivalence" function that works that way, and leave the "==" operator for arrays of code points alone. > > Note that I'm not arguing against normalization of *identifiers*. I > > see that as a necessity. I also see that there will be border cases > > where getattr(x, 'XXX') and x.XXX are not equivalent for some values > > of XXX where the normalized form is a different sequence of code > > points. But I don't believe the solution should be to normalize all > > string literals. > > For strings created by an extension module, that would be valid. But > python source code is human-readable text, and should be treated that > way. Either follow the unicode rules (at least for strings), or don't > call them unicode. Again, did you realize that the example was about \u escapes? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Thu Jun 7 02:49:09 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 20:49:09 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Guido van Rossum wrote: > On 6/6/07, Rauli Ruohonen wrote: > > On 6/6/07, Guido van Rossum wrote: > > > Why should the lexer apply normalization to literals behind my back? > > The lexer shouldn't, but NFC normalizing the source before the lexer > > sees it would be slightly more robust and standards-compliant. > I have no opinion on this, but NFC normalizing the source shouldn't > affect the use of \u.... in string literals. Agreed; normalizing the source should be applied only to code points; the code sequence <0x5c, 0x75> normalizes to itself. If there is a \u in a string, it will still be there after normalization, before python lexes. If there is a \u outside a string, it will still be there to cause syntax errors. -jJ From jimjjewett at gmail.com Thu Jun 7 03:15:57 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 6 Jun 2007 21:15:57 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: On 6/6/07, Guido van Rossum wrote: > On 6/6/07, Jim Jewett wrote: > > On 6/6/07, Guido van Rossum wrote: > > > > > > about normalization of data strings. The big issue is string literals. > > > > I think I agree with Stephen here: > > > > u"L\u00F6wis" == u"Lo\u0308wis" > > > > should be True (assuming he typed it correctly in the first place :-), > > > > because they are the same Unicode string. > > > So let me explain it. I see two different sequences of code points: > > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > > > claim they are equivalent. > > Your (conforming) editor can silently replace one with the other. > No it cannot. We are talking about \u escapes, not about a string > literal containing Unicode characters ("L?wis"). ahh... my apologies. I was interpreting the \u as a way of showing the bytes in email. I discarded the interpretation you are using because that would require a sequence of 10 or 11 code points, rather than the 5 or 6 you mentioned. Python lexes it into a shorter string (just as it lexes 1.0 into a number) at a conceptually later time. Those later strings should compare equal according to unicode, but I agree that you no longer need to worry about editors introducing bugs. (And I even agree that this may be valid case for ignoring the recommendation; if someone has been explicit by writing out 6 characters to represent one, they probably meant it.) -jJ From shiblon at gmail.com Thu Jun 7 03:21:58 2007 From: shiblon at gmail.com (Chris Monson) Date: Wed, 6 Jun 2007 21:21:58 -0400 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> References: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> Message-ID: On 6/6/07, Collin Winter wrote: > > On 6/6/07, Guido van Rossum wrote: > > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP > > 367 (new super) and PEP 344 (exception chaining). Are there any > > others? I propose that we renumber these to numbers in the 3100+ > > range. I can see two forms of renaming: > > > > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number > > > > (b) just use the next available number > > > > Preferences? > > > > What other PEPs should be renumbered? > > > > Should we renumber at all? > > Renumbering, +1; using the next 31xx number, +1. Renumbering +1 Leaving (old PEP number) in place as a stripped down PEP that just points to the new number: +1 Collin Winter > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/shiblon%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070606/fec1137f/attachment.html From greg.ewing at canterbury.ac.nz Thu Jun 7 03:38:07 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 07 Jun 2007 13:38:07 +1200 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> Message-ID: <4667617F.5060807@canterbury.ac.nz> Jim Jewett wrote: > Since we don't want the results of (str1 == str2) to change based on > context, I think string equality also needs to look at canonicalized > (though probably not compatibility) forms. Are you suggesting that this should be done on the fly when comparing strings? Or that all strings should be stored in canonicalised form? I can see some big cans of worms being opened up by either approach. Surprising results could include things like s1 == s2 but len(s1) <> len(s2), or len(s1 + s2) <> len(s1) + len(s2). -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From janssen at parc.com Thu Jun 7 03:57:40 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 6 Jun 2007 18:57:40 PDT Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: <07Jun6.185746pdt."57996"@synergy1.parc.xerox.com> > So let me explain it. I see two different sequences of code points: > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > claim they are equivalent. They are two different sequences of code > points. If they were sequences of integers, or sequences of bytes, I'd agree with you. But they are explicitly sequences of characters, not sequences of codepoints. There should be one internal normalized form for strings. > We should not hide that Python's unicode string object can > store each sequence of code points equally well, and that when viewed > as a sequence they are different: the first has len() == 5, the scond > has len() == 6! We should definitely not expose that difference! > When read from a file they are different. A file is in UTF-8, or UTF-2, or whatever -- it contains a string coerced to a sequence of bits. Whatever reads that file should in fact either preserve that sequence of bytes (in which case it's not a string), or coerce it to a Unicode string, in which case the file representation is immaterial and the Python normalized form is used internally. > I might be > writing either literal with the expectation to get exactly that > sequence of code points, in order to use it as a test case or as input > for another program that requires specific input. In that case you should write it as a sequence of integers, because that's what you're dealing with. > There's a simpler solution. The unicode (or str, in Py3k) data type > represents a sequence of code points, not a sequence of characters. > This has always been the case, and will continue to be the case. Bad idea, IMO. Bill From janssen at parc.com Thu Jun 7 03:59:52 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 6 Jun 2007 18:59:52 PDT Subject: [Python-3000] String comparison In-Reply-To: <20070606125328.6F40.JCARLSON@uci.edu> References: <20070606084543.6F3D.JCARLSON@uci.edu> <07Jun6.101305pdt."57996"@synergy1.parc.xerox.com> <20070606125328.6F40.JCARLSON@uci.edu> Message-ID: <07Jun6.190001pdt."57996"@synergy1.parc.xerox.com> > But > if someone didn't want normalization, and Python did it anyways, then > there would be an error that passed silently. Then they'd read it as bytes, and do the processing themselves explicitly (actually, what I do). > It's the unicode character versus code point issue. I personally prefer > code points, as a code point approach does exactly what I want it to do > by default; nothing. If it *does* something without me asking, then > that would seem to be magic to me, and I'm a minimal magic kind of guy. Strings are not code point sequences, which are available anyway for people who want them as tuples of integer values. Bill From tjreedy at udel.edu Thu Jun 7 04:00:07 2007 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 6 Jun 2007 22:00:07 -0400 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? References: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> Message-ID: "Chris Monson" wrote in message | Leaving (old PEP number) in place as a stripped down PEP that just points to | the new number: +1 Good idea. And new number = next available. Special PEP numbers should be for special PEPs. tjr From janssen at parc.com Thu Jun 7 04:19:56 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 6 Jun 2007 19:19:56 PDT Subject: [Python-3000] String comparison In-Reply-To: <07Jun6.185746pdt."57996"@synergy1.parc.xerox.com> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <07Jun6.185746pdt."57996"@synergy1.parc.xerox.com> Message-ID: <07Jun6.191956pdt."57996"@synergy1.parc.xerox.com> I wrote: > Guido wrote: > > So let me explain it. I see two different sequences of code points: > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > > claim they are equivalent. They are two different sequences of code > > points. > > If they were sequences of integers, or sequences of bytes, I'd agree > with you. But they are explicitly sequences of characters, not > sequences of codepoints. There should be one internal normalized form > for strings. I meant to say that *strings* are explicitly sequences of characters, not codepoints. So both sequences of codepoints should collapse to the same *string* when they are turned into a string. While the two sequences of codepoints should not compare equal, the strings formed from them should compare equal. I also believe that the literal form '\u0308' should generate a compile error. It's a valid Unicode codepoint, sure, but not a valid string. string((ord('L'), 0xF6, ord('w'), ord('i'), ord('s'))) == string((ord('L'), ord('o'), 0x308, ord('w'), ord('i'), ord('s'))) Bill From greg.ewing at canterbury.ac.nz Thu Jun 7 04:31:38 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 07 Jun 2007 14:31:38 +1200 Subject: [Python-3000] PEP 3131 roundup In-Reply-To: <461528.7704.qm@web33514.mail.mud.yahoo.com> References: <461528.7704.qm@web33514.mail.mud.yahoo.com> Message-ID: <46676E0A.6020506@canterbury.ac.nz> Steve Howell wrote: > Current Python has the precedence that color/colour > are treated as two separate identifers, But there's always a clear visual difference between "color" and "colour", and your editor is not going to turn one into the other while you're not looking (unless you've got some sort of automatic english- to-american spelling correction, which would be insane to turn on for editing code). -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Thu Jun 7 04:46:32 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 07 Jun 2007 14:46:32 +1200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <466595C5.6070301@v.loewis.de> <874pllycec.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46677188.2070809@canterbury.ac.nz> Stephen J. Turnbull wrote: > Jim Jewett writes: > > > I am slightly concerned that it might mean > > "string as string" and "string as identifier" have different tests > > for equality. > > It does mean that; see Rauli's code. Does anybody know if this > bothers LISP users, where identifiers are case-insensitive? I don't think the issue arises in Lisp, because to use a string as an identifier you have to explicitly convert it to a symbol, whereupon there is an opportunity for case folding, normalisation, etc. to be done. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From nnorwitz at gmail.com Thu Jun 7 05:18:27 2007 From: nnorwitz at gmail.com (Neal Norwitz) Date: Wed, 6 Jun 2007 20:18:27 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4665EE44.2010306@ronadam.com> References: <4665EE44.2010306@ronadam.com> Message-ID: On 6/5/07, Ron Adam wrote: > Alexandre Vassalotti wrote: > > On 6/5/07, Guido van Rossum wrote: > >> If "make clean" makes the problem go away, it's usually because there > >> were old .pyc files with incompatible byte code. We don't change the > >> .pyc magic number for each change to the compiler. > > > > Nope. It is still not working. I just did the following, and I still > > get the same error. > > > > % make # run fine > > % make # fail > > I can confirm the same behavior. Works on the first make, same error on > the second. I deleted the contents of the branch and did an "svn up" on an > empty directory. Same thing. This probably means there is a problem with marshalling the byte code out. The first run compiles the .pyc files. Theoretically this writes out the same thing in memory. This isn't always the case though (ie, when there are bugs). A work around would be to just remove the .pyc files each time rather than do a make clean. Do: find . -name '*.pyc' -print0 | xargs -0 rm Bonus points for finding the bug. :-) A quick way to test this is to try to roundrip it. Something like: >>> s = '''\ ... class F: ... def foo(self, *args): ... print(self, args) ... ''' >>> code = compile(s, 'foo', 'exec') >>> import marshal >>> marshal.loads(marshal.dumps(code)) == code True If it doesn't equal True, you found the problem. n From rauli.ruohonen at gmail.com Thu Jun 7 05:32:47 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Thu, 7 Jun 2007 06:32:47 +0300 Subject: [Python-3000] String comparison In-Reply-To: <4846239003818249252@unknownmsgid> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <4846239003818249252@unknownmsgid> Message-ID: On 6/7/07, Bill Janssen wrote: > I meant to say that *strings* are explicitly sequences of characters, > not codepoints. This is false. When you access the contents of a string using the *sequence* protocol, what you get is code points, not characters (grapheme clusters). To get those, you have to use a regexp, as outlined in UAX#29. You could normalize at the same time so you can do bitwise comparison instead of collation to compare graphemes the way the user does. If you're going to do all that, then you could as well implement your own type (which could even be provided by the standard library). Note that normalization alone does not produce a sequence of grapheme clusters, because there aren't precomposed characters for everything - for full generality you just have to deal with combining characters. > I also believe that the literal form '\u0308' should generate a compile > error. It's a valid Unicode codepoint, sure, but not a valid string. Then you wouldn't even be able to iterate over or index strings anymore, as that could produce such "invalid" strings, which would need to generate exceptions if you really want to ban them. Or is there point in making people type 'o\u0308'[1] instead of '\u0308'? From brett at python.org Thu Jun 7 05:47:03 2007 From: brett at python.org (Brett Cannon) Date: Wed, 6 Jun 2007 20:47:03 -0700 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> References: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> Message-ID: On 6/6/07, Collin Winter wrote: > > On 6/6/07, Guido van Rossum wrote: > > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP > > 367 (new super) and PEP 344 (exception chaining). Are there any > > others? I propose that we renumber these to numbers in the 3100+ > > range. I can see two forms of renaming: > > > > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number > > > > (b) just use the next available number > > > > Preferences? > > > > What other PEPs should be renumbered? > > > > Should we renumber at all? > > Renumbering, +1; using the next 31xx number, +1. +1 on this vote. -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070606/f40d1c15/attachment.htm From showell30 at yahoo.com Thu Jun 7 07:00:11 2007 From: showell30 at yahoo.com (Steve Howell) Date: Wed, 6 Jun 2007 22:00:11 -0700 (PDT) Subject: [Python-3000] String comparison In-Reply-To: Message-ID: <652940.60544.qm@web33501.mail.mud.yahoo.com> --- Guido van Rossum wrote: > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf > (Conformance) > > > > C9 A process shall not assume that the > interpretations of two > > canonical-equivalent character sequences are > distinct. > > That is surely contained inside all sorts of weasel > words that allow > us to define a "normalized equivalence" function > that works that way, > and leave the "==" operator for arrays of code > points alone. > Regarding weasel words, my reading of the text below (particularly the word "Ideally") is that processes should not make assumptions about other processes, but C9 is not strict on how processes themselves behave. ''' C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonical-equivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences. *Ideally* [emphasis added], an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. ''' I guess you could interpret the following tidbit to say that Python should never assume that text editors will distinguish canonical-equivalent sequences, but I doubt that settles any debate about what Python should do, and I think I'm stretching the interpretation to begin with: ''' Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences. ''' ____________________________________________________________________________________ Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool. http://autos.yahoo.com/carfinder/ From nnorwitz at gmail.com Thu Jun 7 09:16:04 2007 From: nnorwitz at gmail.com (Neal Norwitz) Date: Thu, 7 Jun 2007 00:16:04 -0700 Subject: [Python-3000] problem with checking whitespace in svn pre-commit hook Message-ID: When I originally tried to check in rev 55797, I got this exception: Traceback (most recent call last): File "/data/repos/projects/hooks/checkwhitespace.py", line 50, in ? run_app(main) File "/usr/lib/python2.3/site-packages/svn/core.py", line 33, in run_app return apply(func, (pool,) + args, kw) File "/data/repos/projects/hooks/checkwhitespace.py", line 43, in main if reindenter.run(): File "/data/repos/projects/hooks/reindent.py", line 166, in run tokenize.tokenize(self.getline, self.tokeneater) File "/usr/lib/python2.3/tokenize.py", line 153, in tokenize tokenize_loop(readline, tokeneater) File "/usr/lib/python2.3/tokenize.py", line 159, in tokenize_loop for token_info in generate_tokens(readline): File "/usr/lib/python2.3/tokenize.py", line 233, in generate_tokens raise TokenError, ("EOF in multi-line statement", (lnum, 0)) tokenize.TokenError: ('EOF in multi-line statement', (315, 0)) I'm guessing this is because tokenization has changed between 2.3 and 3.0. I didn't have 2.3 on my system to test with. I ran reindent prior to committing, but that had no effect (ie, still go the error). I disabled the hook so I could check in the change which I'm pretty sure is normalized. However, we are likely to have this problem in the future. I fixed the script so it shouldn't raise an exception any longer. But people will still be prevented from checking in if this happens again. I wish I had modified the commit hook *before* checking in so I would at least know which file caused it. Oh well, I shouldn't do these things at the end of the day or beginning depending on how you look at it. :-) n From turnbull at sk.tsukuba.ac.jp Thu Jun 7 09:34:51 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Thu, 07 Jun 2007 16:34:51 +0900 Subject: [Python-3000] String comparison In-Reply-To: <20070606084543.6F3D.JCARLSON@uci.edu> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> Message-ID: <87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp> Josiah Carlson writes: > Maybe I'm missing something, but it seems to me that there might be a > simple solution. Don't normalize any identifiers or strings. That's not a solution, that's denying that there's a problem. > Hear me out for a moment. People type what they want. You're thinking in ASCII terms still, where code points == characters. With Unicode, what they see is the *single character* they *want*, but it may be represented by a half-dozen characters in RAM, a different set of characters in the file, and they may have typed a dozen hard- to-relate keystrokes to get it (eg, typing a *phonetic prefix* of a word whose untyped trailing character is the one they want). And if everything that handles the text is Unicode conformant, it doesn't matter! In that context, just what does "people type what they want" mean? By analogy, suppose I want to generate a table (such as Martin's table-331.html) algorithmically. Then doesn't it seem reasonable that the representation might be something like u"\u0041"? But you know what that sneaky ol' Python 2.5 does to me if I evaluate it? It returns u'A'! And guess what else? u"\u0041" == u'A' returns True! And when I print either of them, I see what I expect: A. Well, what Unicode-conformant editors are allowed to do with NKD and NKC (and all the non-normalized forms as well) is quite analogous. But a conformant process is expected not to distinguish among them, just as two instances of Python are expected to compare those two *different* string literals as equal. Thus it doesn't matter (for most purposes) what those editors do, just as it doesn't matter (except as a point of style) how you spell u"A". > As for strings, I think we should opt for keeping it as simple as > possible. Compare by code points. If you normalize on the way in, you can do that *correctly*. If you don't ... > To handle normalization issues, add a normalization method that > people call if they care about normalized unicode strings*. ...you impose the normalization on application programmers who think of unicode strings as internationalized text (but they aren't! they're arrays of unsigned shorts), or on module writers who have weak incentive to get 100% coverage. Note that these programs don't crash; they silently give false negatives. Fixing these bugs *before* selling the code is hard and expensive; who will care to do it? Eg, *you*. You clearly *don't* care in your daily work, even though you are sincerely trying to understand on python-dev. But your (quite proper!) objective is to lower costs to you and your code since YAGNI. Where *I* need it, I will cross you off my list of acceptable vendors (of off-the-shelf modules, I can't afford your consulting rates). Well and good, that's how it *should* work. But your (off-the-shelf) modules will possibly see use by the Japanese Social Security Administration, who have demonstrated quite graphically how little they care[1]. :-( Furthermore, there are typically an awful lot of ways that a string can get into the process, and if you do care, you want to catch them all. This is a lot easier to do *in* the Python compiler and interpreter, which have a limited number of I/O channels, than it will be to do for a large library of modules, not all of which even exist at this date. > * Or leave out normalization all together in 3.0 . I haven't heard any > complaints about the lack of normalization in Python so far (though > maybe I'm not reading the right python-list messages), and Python has > had unicode for what, almost 10 years now? I presented a personal anecdote about docutils in my response to GvR, and an failed test from XEmacs (which, admittedly, Python already gets right). Strictly speaking the former is not a normalization issue, since it's probably a fairly idiosyncratic change in docutils, but it's the kind of problem that would be mitigated by normalization. But you won't see a lot, because almost all text in Western European languages is almost automatically NFC, unless somebody who knows what they're doing deliberately denormalizes or renormalizes it (as in Mac OS X). Also, a lot of problems will get attributed to legacy encodings, although proper attention to canonical (and a subset of compatibility) equivalences would go a long way to resolve them. These issues are going to become more prevalent as more scripts are added to Unicode, and actually come into use. And as their users start deploying IT on a large scale for the first time. Footnotes: [1] About 20 million Japanese face partial or total loss of their pensions because the Japanese SSA couldn't be bothered to canonicalize their names accurately when the system was automated in the '90s. From rrr at ronadam.com Thu Jun 7 11:15:30 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 07 Jun 2007 04:15:30 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> Message-ID: <4667CCB2.6040405@ronadam.com> Neal Norwitz wrote: > On 6/5/07, Ron Adam wrote: >> Alexandre Vassalotti wrote: >> > On 6/5/07, Guido van Rossum wrote: >> >> If "make clean" makes the problem go away, it's usually because there >> >> were old .pyc files with incompatible byte code. We don't change the >> >> .pyc magic number for each change to the compiler. >> > >> > Nope. It is still not working. I just did the following, and I still >> > get the same error. >> > >> > % make # run fine >> > % make # fail >> >> I can confirm the same behavior. Works on the first make, same error on >> the second. I deleted the contents of the branch and did an "svn up" >> on an >> empty directory. Same thing. > > This probably means there is a problem with marshalling the byte code > out. The first run compiles the .pyc files. Theoretically this > writes out the same thing in memory. This isn't always the case > though (ie, when there are bugs). > > A work around would be to just remove the .pyc files each time rather > than do a make clean. Do: > > find . -name '*.pyc' -print0 | xargs -0 rm > > Bonus points for finding the bug. :-) Well not the bug yet, but I did find the file. :-) The following clears it so make will work. rm ./build/lib.linux-i686-3.0/_struct.so So maybe something to do with Modules/_struct.c, or would it be something else that uses it? Removing all the .pyc files wasn't enough, nor was removing all the .o files. BTW, I found it by running the commands from the 'clean' section of the makefile one at a time, then narrowed it down from there by making it more and more specific. Version info: Python 3.0x (py3k-struni, Jun 7 2007, 03:28:43) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 On 7.04 ?Fiesty Fawn? Ron From ncoghlan at gmail.com Thu Jun 7 12:59:50 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 07 Jun 2007 20:59:50 +1000 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: References: Message-ID: <4667E526.1000503@gmail.com> Guido van Rossum wrote: > A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP > 367 (new super) and PEP 344 (exception chaining). Are there any > others? I propose that we renumber these to numbers in the 3100+ > range. I can see two forms of renaming: > > (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number > > (b) just use the next available number > > Preferences? > > What other PEPs should be renumbered? > > Should we renumber at all? > +1 for renumbering to the next available 31xx number, with the old number kept as a pointer to the new one. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From barry at python.org Thu Jun 7 13:45:45 2007 From: barry at python.org (Barry Warsaw) Date: Thu, 7 Jun 2007 07:45:45 -0400 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: References: <43aa6ff70706061600y568ad4b4u17730d7f3ad97691@mail.gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jun 6, 2007, at 9:21 PM, Chris Monson wrote: > Renumbering, +1; using the next 31xx number, +1. > > Renumbering +1 > Leaving (old PEP number) in place as a stripped down PEP that just > points to the new number: +1 I don't want to (accidentally) re-use the old number for some other PEP, and PEPs are intended to be the historical record of a feature, so my own preferences would be: - - Leave the old PEP in place, with a pointer to the renumbered PEP - - Renumber the PEP by putting a '3' in front of it instead of using the next available We don't have a template for renumbered PEPs so just come up with something reasonable that fits the flavor of PEPs. If we need to generalize later, we can. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRmfv6nEjvBPtnXfVAQL1yAP9FOGBU5TMa4HUiP8IRoqS/wemFOdotHwf GwvNPIEthJXheUBS/lOWLpSCERUzToSfqWzUJWkOUk5JfxsDP6MgWKwfkOwhvp35 oihXrkWoc/XtK2qJipLXVWLhg/5CkPuvnjXSrVMzqpu5J26YPV/QIb2Xa0ICF90e c2mQY0cuzWM= =HMOs -----END PGP SIGNATURE----- From g.brandl at gmx.net Thu Jun 7 15:01:29 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Thu, 07 Jun 2007 15:01:29 +0200 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: <4667E526.1000503@gmail.com> References: <4667E526.1000503@gmail.com> Message-ID: Nick Coghlan schrieb: > Guido van Rossum wrote: >> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP >> 367 (new super) and PEP 344 (exception chaining). Are there any >> others? I propose that we renumber these to numbers in the 3100+ >> range. I can see two forms of renaming: >> >> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number >> >> (b) just use the next available number >> >> Preferences? >> >> What other PEPs should be renumbered? >> >> Should we renumber at all? >> > > +1 for renumbering to the next available 31xx number, with the old > number kept as a pointer to the new one. That would be my vote too. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From stephen at xemacs.org Thu Jun 7 15:30:13 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 07 Jun 2007 22:30:13 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> Message-ID: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > No it cannot. We are talking about \u escapes, not about a string > literal containing Unicode characters ("L?wis"). Ah, good point. I apologize for mistyping the example. *I* *was* talking about a string literal containing Unicode characters. However, on my terminal, you can't see the difference! So I (ab)used the \u escapes to make clear that in one case the representation used 5 characters and in the other 6. > > > I might be writing either literal with the expectation to get > > > exactly that sequence of code points, This should be possible, agreed. Couldn't rawstring read syntax be given the right semantics? And of course you've always got tuples of integers. What bothers me about the "sequence of code points" way of thinking is that len("L?wis") is nondeterministic. To my mind, especially from the educational standpoint, but also from the point of view of implementing a text editor or docutils, that's much more horrible than Martin's point that len(a) + len(b) == len(a+b) could fail if we do NFC normalization. (NKD would work here.) I'm not sure what happened, but after recent upgrades to Python and docutils (presumably the latter) a bunch of Japanese reST documents of mine broke. I have no idea how to count the number of characters in a line containing Japanese any more (even having fixed the tables by trial and error, it's not obvious), but of course tables require being able to do that exactly. Normalization would guarantee TOOWDTI. But IMO the right way to do normalization in such cases is in Python itself. One is *never* going to be able to keep up with all the external libraries, and it seems very unlikely that many will be high quality from this point of view. So even if your own code does the right thing, you have to wrap every external module you call. Or you can rewrite Python to normalize in the right places once, and then you don't have to worry about it. (Bugs, yes, but then you fix them in the forked Python, and all your code benefits from the fix automatically.) > Bytes are not code points. The unicode string type has always been > about code points, not characters. I wish you had named it "widechar", then. I think that a language where len("L?wis") == len("L?wis") is an invariant is one honking good idea! > Have you ever even used the unicode string type in Python 2? Yes. On the Mac, I often have to run unicodes through normalization NFD because some levels of Mac OS X do normalize NFD and others don't normalize at all. That means that file names in particular tend to be different depending on whether I get them from the OS or from the user. But a test as simple as creating a file with a name containing \u010D and trying to stat it can fail, AIUI because stdio normalizes NFD but the raw OS stat call doesn't. This particular test does work in Python, I'm not sure what the difference is. Granted that that's part of the plan and not serendipity, nonetheless, I think the default case should be that text operations produce the expected result in the text domain, even at the expense of array invariants. People who need arrays of code points have several ways to get them, and the usual comparison operators will work on them as desired. While people who need operations on *text* still have no straightforward way to get them, and no promise of one as I read your remarks. From jimjjewett at gmail.com Thu Jun 7 17:24:22 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 7 Jun 2007 11:24:22 -0400 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4665B7D7.6030501@v.loewis.de> References: <46371BD2.7050303@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> <46659A37.4000900@v.loewis.de> <4665B7D7.6030501@v.loewis.de> Message-ID: On 6/5/07, "Martin v. L?wis" wrote: > >> > Unicode does say pretty clearly that (at least) canonical > >> > equivalents must be treated the same. On reflection, what it actually says is that you may not assume they are different. They can be different in the same way that two identical strings are different under "is", but anything stronger has to be strictly internal. If any code outside the python core even touches the string, then the the choice of representations becomes arbitrary, and can switch for spurious reasons. Immutability should prevent mid-run switching for a single "is" string, but not for different strings that should compare "==". Dictionaries keys need to keep working, which means hash and equality have to do the right thing. Ordering may technically be a quality-of-implementation issue, but ... normalizing strings on creation solves an awful lot of problems, including providing a "best practice" for C extensions. Not normalizing will save a small amount of time, at the cost of a never-ending hunt for rare and obscure bugs. > >> Chapter and verse, please? > > I am pretty sure this list is not exhaustive, but it may be > > helpful: > > The Identifiers Annex http://www.unicode.org/reports/tr31/ > Ah, that's in the context of identifiers, not in the context of text > in general. Yes, but that should also apply to dict and shelve keys. If you want an array of code points, then you want a tuple of ints, not text. > > """ > > Normalization Forms KC and KD must not be blindly > > applied to arbitrary text. > > """ Note that it lists only the Kompatibility forms. By implication, forms NFC and NFD *can* be blindly applied to arbitrary text. (And conformance rule C9 means you have to assume that someone else might do so, if, say, the text is python source code that may have been externally edited.) ... """ > > They can be applied more freely to domains with restricted > > character sets, such as in Section 13, Programming > > Language Identifiers. > > """ > > (section 13 then forwards back to UAX31) > How is that a requirement that comparison should apply > normalization? It isn't a requirement that we apply normalization. But (1) There is a requirement that semantics not change based on external canonical [de]normalization of source code, including literal strings. (I agree that explicit python-level escapes -- made after the file has already been converted from bytes to characters -- are legitimate, just as changing 1.0 from a string to a number is legitimate.) (2) It is a *suggestion* that we consider the stronger Kompatibility normalizations for source code. There are cases where strings which are equal under Kompatibility shouldl be treated differently, but, I think, in practice, the difference is more likely to be from typos or difficulty entering the proper characters. Normalizing to the compatibility form would be helpful for some people (Japanese and Korean input was mentioned). I think needed to distinguish the Kompatibility characters (and not even in data; in source literals) will be rare enough that it is worth making the distinction explicit. (If you need to use a compatibility character, then use an escape, rather than the character, so that people will know you really mean the alternate, instead of the "normal" character looking like that.) -jJ From jimjjewett at gmail.com Thu Jun 7 17:29:39 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 7 Jun 2007 11:29:39 -0400 Subject: [Python-3000] PEP: Supporting Non-ASCII Identifiers In-Reply-To: <4667617F.5060807@canterbury.ac.nz> References: <46371BD2.7050303@v.loewis.de> <4662F639.2070806@v.loewis.de> <46646E76.8060804@v.loewis.de> <4665189D.4020301@v.loewis.de> <4667617F.5060807@canterbury.ac.nz> Message-ID: On 6/6/07, Greg Ewing wrote: > Are you suggesting that this should be done on the fly > when comparing strings? Or that all strings should be > stored in canonicalised form? Preferably the second; store them canonicalized. > I can see some big cans of worms being opened up by > either approach. Surprising results could include > things like s1 == s2 but len(s1) <> len(s2), or > len(s1 + s2) <> len(s1) + len(s2). Yes, these are surprising, but that is the nature of unicode. People will get used to it, with the same pains they face now over "1" + "1" = "11", or output that doesn't line up because one row had a single-digit number. -jJ From alexandre at peadrop.com Thu Jun 7 17:47:28 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 11:47:28 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> Message-ID: On 6/6/07, Neal Norwitz wrote: > This probably means there is a problem with marshalling the byte code > out. The first run compiles the .pyc files. Theoretically this > writes out the same thing in memory. This isn't always the case > though (ie, when there are bugs). > > A work around would be to just remove the .pyc files each time rather > than do a make clean. Do: > > find . -name '*.pyc' -print0 | xargs -0 rm > Nope. Removing the byte-compiled Python files didn't change anything. > Bonus points for finding the bug. :-) Oh? :) > A quick way to test this is to try to roundrip it. Something like: > > >>> s = '''\ > ... class F: > ... def foo(self, *args): > ... print(self, args) > ... ''' > >>> code = compile(s, 'foo', 'exec') > >>> import marshal > >>> marshal.loads(marshal.dumps(code)) == code > True > > If it doesn't equal True, you found the problem. I got True. So, the problem probably not the byte code. -- Alexandre From alexandre at peadrop.com Thu Jun 7 17:50:05 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 11:50:05 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4667CCB2.6040405@ronadam.com> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: On 6/7/07, Ron Adam wrote: > Well not the bug yet, but I did find the file. :-) > > The following clears it so make will work. > > rm ./build/lib.linux-i686-3.0/_struct.so > > So maybe something to do with Modules/_struct.c, or would it be something > else that uses it? Removing any compiled extension files will work too. So, _struct isn't the source of the problem. -- Alexandre From janssen at parc.com Thu Jun 7 18:35:55 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 7 Jun 2007 09:35:55 PDT Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <4846239003818249252@unknownmsgid> Message-ID: <07Jun7.093602pdt."57996"@synergy1.parc.xerox.com> > Then you wouldn't even be able to iterate over or index strings anymore, > as that could produce such "invalid" strings, which would need to > generate exceptions if you really want to ban them. I don't think that's right: iterating over the the string should presumably generate a iteration of valid sub-strings, each of length one. It would not generate a sequence of integers. [x for x in "abc"] != [ord(x) for x in "abc"] > making people type 'o\u0308'[1] instead of '\u0308'? 'o\u0308'[1] should generate an ArrayBounds exception, since you're indexing into a string of length 1. Bill From rauli.ruohonen at gmail.com Thu Jun 7 18:47:17 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Thu, 7 Jun 2007 19:47:17 +0300 Subject: [Python-3000] String comparison In-Reply-To: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/7/07, Stephen J. Turnbull wrote: > I apologize for mistyping the example. *I* *was* talking about a > string literal containing Unicode characters. Then I misunderstood you too. To avoid such problems, I will use XML character references to denote code points here. Wherever you see such a thing in this e-mail, replace it in your mind with the corresponding code point *immediately*. E.g. len(r'�c5;') == 1, but len(r'\u00c5') == 6. This is not a proposal for Python syntax, it is a device to make what I say clear. > However, on my terminal, you can't see the difference! So I (ab)used > the \u escapes to make clear that in one case the representation used > 5 characters and in the other 6. Your code was: > if u"L\u00F6wis" == u"Lo\u0308wis": > print "Python is Unicode conforming in this respect." I take it, by your explanation above, that you meant that the (py3k) source code is this: if "L�F6;wis" == "LoĴwis": print "Python is Unicode conforming in this respect." I agree that here == should be true, but only because Python should normalize the source code to look like this before processing it: if "L�F6;wis" == "L�F6;wis": print "Python is Unicode conforming in this respect." In the following code == should be false: if "L\u00F6wis" == "Lo\u0308wis": print "Python is Unicode conforming in this respect." > I think the default case should be that text operations produce the > expected result in the text domain, even at the expense of array > invariants. If you really want that, then you need a type for sequences of graphemes. E.g. 'c\u0308' is already normalized according to all four normalization rules, but it's still one grapheme ('c' with diaeresis, c?) and two code points. This type could be provided in the standard library. > People who need arrays of code points have several ways to get them, > and the usual comparison operators will work on them as desired. But regexps and other string operations won't, and those are the whole point of strings, not comparison operators. If comparisons were enough, then the string type could be removed as redundant - there's already the array module (or numpy) if you're only concerned about efficient storage. > While people who need operations on *text* still have no > straightforward way to get them, and no promise of one as I read your > remarks. Then you missed some of his earlier remarks: Guido: : I'm all for adding a way to do normalized string comparisons to the : library. But I'm not about to change the == operator to apply : normalization first. From guido at python.org Thu Jun 7 19:10:12 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 10:10:12 -0700 Subject: [Python-3000] String comparison In-Reply-To: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/7/07, Stephen J. Turnbull wrote: > What bothers me about the "sequence of code points" way of thinking is > that len("L?wis") is nondeterministic. It doesn't have to be, *for this specific example*. After what I've read so far, I'm okay with normalization happening on the text of the source code before it reaches the lexer, if that's what people prefer. I'm also okay with normalization happening by default in the text I/O layer, as long as there's a way to disable it that doesn't require me to switch to bytes. However, I'm *not* okay with requiring all text strings to be normalized, or normalizing them before comparing/hashing, after slicing/concatenation, etc. If you want to have an abstraction that guarantees you'll never see an unnormalized text string you should design a library for doing so. I encourage you or others to contribute such a library (*). But the 3.0 core language's 'str' type (like Python 2.x's 'unicode' type) will be an array of code points that is neutral about normalization. Python is a general programming language, not a text manipulating library. As a general programming language, it must be possible to represent unnormalized sequences of code points -- otherwise, it could not implement algorithms for normalization in Python! (Again, forcing me to do this using UTF-8-encoded bytes or lists of ints is unacceptable.) There are also Jython and IronPython to consider. These have extensive integration in the Java and .NET runtimes, respectively, where strings are represented as sequences of code points. Having a correspondence between the "natural" string type across language boundaries is very important. Yes, this makes text processing harder if you want to get every corner case right. We need to educate our users about Unicode and point them to relevant portions of the standard. I don't think that can be avoided anyway -- the complexity is inherent to the domain of multi-alphabet text processing, and cannot be argued away by insisting that the language handle it. (*) It looks like such a library will not have a way to talk about "\u0308" at all, since it is considered unnormalized. Things like bidirectionality will probably have to be handled in a different way (without referencing the code points indicating text direction) as well. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Jun 7 19:34:22 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 10:34:22 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: On 6/7/07, Alexandre Vassalotti wrote: > On 6/7/07, Ron Adam wrote: > > Well not the bug yet, but I did find the file. :-) > > > > The following clears it so make will work. > > > > rm ./build/lib.linux-i686-3.0/_struct.so > > > > So maybe something to do with Modules/_struct.c, or would it be something > > else that uses it? > > Removing any compiled extension files will work too. So, _struct isn't > the source of the problem. It's time to look at the original traceback (attached as "tb", after fixing the formatting problems). it looks like any call to encodings.normalize_encoding() causes this problem. I don't know why linking an extension avoids this, and why it's only a problem for you and not for me, but that's probably a locale setting (if you mail me the values of all your locale-specific environment variables I can try to reproduce it). The trail leads back to the optparse module using the gettext module to translate its error messages. That seems overengineered to me, but I won't argue too strongly. In any case, the root cause is that normalize_encoding() is badly broken. I've attached a hack that might fix it. Can you try if that helps? -- --Guido van Rossum (home page: http://www.python.org/~guido/) -------------- next part -------------- A non-text attachment was scrubbed... Name: tb Type: application/octet-stream Size: 1267 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/7b1608fb/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: hack Type: application/octet-stream Size: 778 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/7b1608fb/attachment-0001.obj From martin at v.loewis.de Thu Jun 7 19:37:44 2007 From: martin at v.loewis.de (martin at v.loewis.de) Date: Thu, 07 Jun 2007 19:37:44 +0200 Subject: [Python-3000] problem with checking whitespace in svn pre-commit hook Message-ID: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu> > tokenize.TokenError: ('EOF in multi-line statement', (315, 0)) I analyzed that a bit further, and found that Lib/distutils/unixccompiler.py:214 reads if not isinstance(output_dir, (str, type(None)): This is a syntax error; a closing parenthesis is missing. tokenize.py chokes at the EOF as the parentheses aren't balanced. > I ran reindent prior to committing, but that had no effect (ie, > still go the error). I find that hard to believe - running reindent.py on the file fails for me with Python 2.5 as well. Regards, Martin From nnorwitz at gmail.com Thu Jun 7 19:55:35 2007 From: nnorwitz at gmail.com (Neal Norwitz) Date: Thu, 7 Jun 2007 10:55:35 -0700 Subject: [Python-3000] problem with checking whitespace in svn pre-commit hook In-Reply-To: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu> References: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu> Message-ID: On 6/7/07, martin at v.loewis.de wrote: > > tokenize.TokenError: ('EOF in multi-line statement', (315, 0)) > > I analyzed that a bit further, and found that > Lib/distutils/unixccompiler.py:214 reads > > if not isinstance(output_dir, (str, type(None)): > > This is a syntax error; a closing parenthesis is missing. > tokenize.py chokes at the EOF as the parentheses aren't balanced. > > > I ran reindent prior to committing, but that had no effect (ie, > > still go the error). > > I find that hard to believe - running reindent.py on the file > fails for me with Python 2.5 as well. I ran reindent with py3k, something like: ./python Tools/scripts/reindent.py Lib IIRC. I don't have the command line handy. I'll fix this when I get home tonight. Has anyone tried the 3k reindent? Or did I just screw that up? n From guido at python.org Thu Jun 7 20:16:41 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 11:16:41 -0700 Subject: [Python-3000] problem with checking whitespace in svn pre-commit hook In-Reply-To: References: <20070607193744.p3ttcg6gd4wsgokc@webmail.df.eu> Message-ID: On 6/7/07, Neal Norwitz wrote: > On 6/7/07, martin at v.loewis.de wrote: > > > tokenize.TokenError: ('EOF in multi-line statement', (315, 0)) > > > > I analyzed that a bit further, and found that > > Lib/distutils/unixccompiler.py:214 reads > > > > if not isinstance(output_dir, (str, type(None)): > > > > This is a syntax error; a closing parenthesis is missing. > > tokenize.py chokes at the EOF as the parentheses aren't balanced. > > > > > I ran reindent prior to committing, but that had no effect (ie, > > > still go the error). > > > > I find that hard to believe - running reindent.py on the file > > fails for me with Python 2.5 as well. > > I ran reindent with py3k, something like: ./python > Tools/scripts/reindent.py Lib > IIRC. I don't have the command line handy. I'll fix this when I get > home tonight. > > Has anyone tried the 3k reindent? Or did I just screw that up? http://mail.python.org/mailman/options/python-3000/guido%40python.org The py3k reindent is just fine; you screwed up the closing paren on line 214 in unixccompile.py. All versions of reindent that I can find correctly complain about that. I'm curious how you managed to bypass it! :-) I've checked in the fix. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jcarlson at uci.edu Thu Jun 7 20:34:16 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Thu, 07 Jun 2007 11:34:16 -0700 Subject: [Python-3000] String comparison In-Reply-To: <87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20070606084543.6F3D.JCARLSON@uci.edu> <87lkewgquc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20070607084121.6F4A.JCARLSON@uci.edu> "Stephen J. Turnbull" wrote: > Josiah Carlson writes: > > > Maybe I'm missing something, but it seems to me that there might be a > > simple solution. Don't normalize any identifiers or strings. > > That's not a solution, that's denying that there's a problem. For core Python, there is no problem. The standard libraries don't have any normalization issues, nor will they have any normalization issues. The only place where there could be potential for normalization issues is in to-be-written 3rd party code. With that said, from what I understand, there are three places where we could potentially do normalization; identifiers, literals, data. Identifiers and literals have the best case for normalization, data the worst (don't change my data without me telling you to!) From Guido's recent post, he seems to say more or less the same thing with normalization to text read through the text IO layer. Since I don't expect to be reading much unicode from disk (and/or I expect to be reading bytes and decoding them to unicode manually), being able to disable normalization on data from text IO is fine. Regarding the rest of it, I've come to the point of exhaustion. I no longer have the energy to care what happens with Python 3.0 and unicode (identifiers, literals, data, types, etc.), but I hope Ka-Ping is able to convince people more than I have. Good luck with the decisions. Good day, - Josiah From alexandre at peadrop.com Thu Jun 7 20:37:45 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 14:37:45 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: On 6/7/07, Guido van Rossum wrote: > It's time to look at the original traceback (attached as "tb", after > fixing the formatting problems). it looks like any call to > encodings.normalize_encoding() causes this problem. Don't know if it will help to know that, but it seems adding a debugging print() in the normalize_encoding method, makes Python act weird: >>> print("hello") # no output [38357 refs] >>> hello? # note the exception is not shown [30684 refs] >>> exit() # does quit > I don't know why linking an extension avoids this, and why it's only > a problem for you and not for me, but that's probably a locale > setting (if you mail me the values of all your locale-specific > environment variables I can try to reproduce it). I don't think it is related to locales settings. Since even with a minimum number of environment variables, I still can reproduce the problem. % sh $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1` > do unset $v; done $ make make: *** [sharedmods] Error 1 > The trail leads back to the optparse module using the gettext module > to translate its error messages. That seems overengineered to me, > but I won't argue too strongly. > > In any case, the root cause is that normalize_encoding() is badly > broken. I've attached a hack that might fix it. Can you try if that > helps? Yep, that worked. What this new str8 type is for, btw? It is the second time I encounter it, today. -- Alexandre From alexandre at peadrop.com Thu Jun 7 20:46:15 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 14:46:15 -0400 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: I found a way to fix the bug; look at the attached patch. Although, I am not sure it was correct way to fix it. The problem was due to str8 that is recognized as an instance of `str'. -- Alexandre On 6/5/07, Guido van Rossum wrote: > On 6/5/07, Alexandre Vassalotti wrote: > > On 6/5/07, Guido van Rossum wrote: > > > I'd rather see them here than in SF, SF is a pain to use. > > > > > > But unless the bugs prevent you from proceeding, you could also ignore them. > > > > The first bug that I reported today (the one about `make`) stop me > > from running the test suite. So, can't really test the _string_io and > > _bytes_io modules. > > I tried to reproduce it but it works fine for me -- I'm on Ubuntu > dapper (with some Google mods) on a 2.6.18.5-gg4 kernel. > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -------------- next part -------------- A non-text attachment was scrubbed... Name: pdb-help.patch Type: text/x-patch Size: 1056 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/b2463067/attachment.bin From alexandre at peadrop.com Thu Jun 7 20:55:08 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 14:55:08 -0400 Subject: [Python-3000] help() broken in the py3k-struni branch In-Reply-To: References: Message-ID: On 6/5/07, Guido van Rossum wrote: > Feel free to mail me a patch to fix it. > Since you asked so politely, here a patch for you. :) > On 6/5/07, Alexandre Vassalotti wrote: > > Hi, > > > > I found another bug to report. It seems there is a bug in > > subprocess.py that makes help() fail. > > > > -- Alexandre > > > > Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) > > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > > Type "help", "copyright", "credits" or "license" for more information. > > >>> help(open) > > Traceback (most recent call last): > > File "", line 1, in > > File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350, > > in __call__ > > return pydoc.help(*args, **kwds) > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > 1687, in __call__ > > self.help(request) > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help > > else: doc(request, 'Help on %s:') > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc > > pager(render_doc(thing, title, forceload)) > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager > > pager(text) > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > 1333, in > > return lambda text: pipepager(text, 'less') > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > 1352, in pipepager > > pipe = os.popen(cmd, 'w') > > File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen > > bufsize=buffering) > > File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line > > 476, in __init__ > > raise TypeError("bufsize must be an integer") > > TypeError: bufsize must be an integer > > _______________________________________________ > > Python-3000 mailing list > > Python-3000 at python.org > > http://mail.python.org/mailman/listinfo/python-3000 > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -- Alexandre Vassalotti -------------- next part -------------- A non-text attachment was scrubbed... Name: help-buf-fix.patch Type: text/x-patch Size: 441 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070607/a36a4ac0/attachment.bin From guido at python.org Thu Jun 7 20:55:40 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 11:55:40 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: On 6/7/07, Alexandre Vassalotti wrote: > On 6/7/07, Guido van Rossum wrote: > > It's time to look at the original traceback (attached as "tb", after > > fixing the formatting problems). it looks like any call to > > encodings.normalize_encoding() causes this problem. > > Don't know if it will help to know that, but it seems adding a > debugging print() in the normalize_encoding method, makes Python act > weird: > > >>> print("hello") # no output > [38357 refs] > >>> hello? # note the exception is not shown > [30684 refs] > >>> exit() # does quit That's a bootstrapping issue. normalize_encoding() is apparently called in order to set up stdin/stdout/stderr, so it shouldn't attempt to touch those (or raise errors). > > I don't know why linking an extension avoids this, and why it's only > > a problem for you and not for me, but that's probably a locale > > setting (if you mail me the values of all your locale-specific > > environment variables I can try to reproduce it). > > I don't think it is related to locales settings. Since even with a > minimum number of environment variables, I still can reproduce the > problem. > > % sh > $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1` > > do unset $v; done > $ make > make: *** [sharedmods] Error 1 Well, then it is up to you to come up with a hypothesis for why it doesn't happen on my system. (I tried the above thing and it still works.) > > The trail leads back to the optparse module using the gettext module > > to translate its error messages. That seems overengineered to me, > > but I won't argue too strongly. > > > > In any case, the root cause is that normalize_encoding() is badly > > broken. I've attached a hack that might fix it. Can you try if that > > helps? > > Yep, that worked. What this new str8 type is for, btw? It is the second > time I encounter it, today. It is the temporary new name for the old 8-bit str type. The plan is to rename unicode->str and delete the old str type, but in the short term that doesn't quite work because there is too much C code that requires 8-bit strings (and can't be made to work with the bytes type either). So for the time being I've renamed the old str type to str8 rather than deleting it altogether. Once we have things 99% working tis way we'll make another pass to get rid of str8 completely -- or perhaps keep it around under some other name with reduced functionality (since there have been requests for an immutable bytes type). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rrr at ronadam.com Thu Jun 7 22:54:07 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 07 Jun 2007 15:54:07 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: <4668706F.6080406@ronadam.com> Guido van Rossum wrote: > On 6/7/07, Alexandre Vassalotti wrote: >> On 6/7/07, Guido van Rossum wrote: >>> It's time to look at the original traceback (attached as "tb", after >>> fixing the formatting problems). it looks like any call to >>> encodings.normalize_encoding() causes this problem. >> Don't know if it will help to know that, but it seems adding a >> debugging print() in the normalize_encoding method, makes Python act >> weird: >> >> >>> print("hello") # no output >> [38357 refs] >> >>> hello? # note the exception is not shown >> [30684 refs] >> >>> exit() # does quit > > That's a bootstrapping issue. normalize_encoding() is apparently > called in order to set up stdin/stdout/stderr, so it shouldn't attempt > to touch those (or raise errors). > >>> I don't know why linking an extension avoids this, and why it's only >>> a problem for you and not for me, but that's probably a locale >>> setting (if you mail me the values of all your locale-specific >>> environment variables I can try to reproduce it). >> I don't think it is related to locales settings. Since even with a >> minimum number of environment variables, I still can reproduce the >> problem. >> >> % sh >> $ for v in `set | egrep -v 'OPTIND|PS|PATH' | cut -d "=" -f1` >> > do unset $v; done >> $ make >> make: *** [sharedmods] Error 1 > > Well, then it is up to you to come up with a hypothesis for why it > doesn't happen on my system. (I tried the above thing and it still > works.) There's a couple of things going on here. The "sharedmods" section of the makefile doesn't execute on every make depending on what options are set or what targets are built. That is why the error doesn't occur on the first run after a 'make clean', and why it doesn't occur if some targets are rebuilt like _struct.so. I'm not sure why it matters which files are built in this case. Also if you have some make flags set then it may be avoiding that particular problem because the default 'all' section is never ran. Does setup.py run without an error for you? (Without the encodings.__init__.py patch.) How about "make test". I've ran across the same zero arg split error a while back when attempting to run 'make test'. Below was the solution I came up with. Is there going to be an unicode equivalent to the str.translate() method? Cheers, Ron Index: Lib/encodings/__init__.py =================================================================== --- Lib/encodings/__init__.py (revision 55388) +++ Lib/encodings/__init__.py (working copy) @@ -34,19 +34,16 @@ _cache = {} _unknown = '--unknown--' _import_tail = ['*'] -_norm_encoding_map = (' . ' - '0123456789 ABCDEFGHIJKLMNOPQRSTUVWXYZ ' - ' abcdefghijklmnopqrstuvwxyz ' - ' ' - ' ' - ' ') +_norm_encoding_map = ('.0123456789' + 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' + 'abcdefghijklmnopqrstuvwxyz') + _aliases = aliases.aliases class CodecRegistryError(LookupError, SystemError): pass def normalize_encoding(encoding): - """ Normalize an encoding name. Normalization works as follows: all non-alphanumeric @@ -54,18 +51,12 @@ collapsed and replaced with a single underscore, e.g. ' -;#' becomes '_'. Leading and trailing underscores are removed. - Note that encoding names should be ASCII only; if they do use - non-ASCII characters, these must be Latin-1 compatible. + Note that encoding names should be ASCII characters only; if they + do use non-ASCII characters, these must be Latin-1 compatible. """ - # Make sure we have an 8-bit string, because .translate() works - # differently for Unicode strings. - if isinstance(encoding, str): - # Note that .encode('latin-1') does *not* use the codec - # registry, so this call doesn't recurse. (See unicodeobject.c - # PyUnicode_AsEncodedString() for details) - encoding = encoding.encode('latin-1') - return '_'.join(encoding.translate(_norm_encoding_map).split()) + return ''.join([ch if ch in _norm_encoding_map else '_' + for ch in encoding]) >>> The trail leads back to the optparse module using the gettext module >>> to translate its error messages. That seems overengineered to me, >>> but I won't argue too strongly. >>> >>> In any case, the root cause is that normalize_encoding() is badly >>> broken. I've attached a hack that might fix it. Can you try if that >>> helps? >> Yep, that worked. What this new str8 type is for, btw? It is the second >> time I encounter it, today. > > It is the temporary new name for the old 8-bit str type. The plan is > to rename unicode->str and delete the old str type, but in the short > term that doesn't quite work because there is too much C code that > requires 8-bit strings (and can't be made to work with the bytes type > either). So for the time being I've renamed the old str type to str8 > rather than deleting it altogether. Once we have things 99% working > tis way we'll make another pass to get rid of str8 completely -- or > perhaps keep it around under some other name with reduced > functionality (since there have been requests for an immutable bytes > type). From martin at v.loewis.de Thu Jun 7 23:05:26 2007 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 Jun 2007 23:05:26 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> Message-ID: <46687316.8090109@v.loewis.de> > It's time to look at the original traceback (attached as "tb", after > fixing the formatting problems). it looks like any call to > encodings.normalize_encoding() causes this problem. One problem with normalize_encoding is that it might do encoding = encoding.encode('latin-1') return '_'.join(encoding.translate(_norm_encoding_map).split()) Here, encoding is converted from a str (unicode) object into a bytes object. That is passed to translate, and then split, which in turn gives py> b"Hallo, World".split() Traceback (most recent call last): File "", line 1, in TypeError: split() takes at least 1 argument (0 given) So the problem is that bytes is not fully compatible with str or str8, here: it doesn't support the parameter-less split. In turn, normalize_encoding encodes as latin-1 because otherwise, translate won't work as expected. I think the right solution would be to just fix the translate table, replacing everything but [A-Za-z0-9] with a space. FWIW, for me the build error goes away when I unset LANG, so that the error occurs during build definitely *is* a locale issue. Regards, Martin From martin at v.loewis.de Thu Jun 7 23:07:32 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 Jun 2007 23:07:32 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4668706F.6080406@ronadam.com> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <4668706F.6080406@ronadam.com> Message-ID: <46687394.4070003@v.loewis.de> > I've ran across the same zero arg split error a while back when attempting > to run 'make test'. Below was the solution I came up with. Is there going > to be an unicode equivalent to the str.translate() method? The unicode type supports translate since 2.0. Regards, Martin From guido at python.org Thu Jun 7 23:47:10 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 14:47:10 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46687316.8090109@v.loewis.de> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> Message-ID: On 6/7/07, "Martin v. L?wis" wrote: > > It's time to look at the original traceback (attached as "tb", after > > fixing the formatting problems). it looks like any call to > > encodings.normalize_encoding() causes this problem. > > One problem with normalize_encoding is that it might do > > encoding = encoding.encode('latin-1') > return '_'.join(encoding.translate(_norm_encoding_map).split()) > > Here, encoding is converted from a str (unicode) object > into a bytes object. That is passed to translate, and then > split, which in turn gives > > py> b"Hallo, World".split() > Traceback (most recent call last): > File "", line 1, in > TypeError: split() takes at least 1 argument (0 given) > > So the problem is that bytes is not fully compatible with > str or str8, here: it doesn't support the parameter-less > split. Which is intentional (sort of). > In turn, normalize_encoding encodes as latin-1 because > otherwise, translate won't work as expected. > > I think the right solution would be to just fix the > translate table, replacing everything but [A-Za-z0-9] > with a space. I rewrote the algorithm using more basic operations. It's slower now -- does that matter? Here's what I checked in: chars = [] punct = False for c in encoding: if c.isalnum() or c == '.': if punct and chars: chars.append('_') chars.append(c) punct = False else: punct = True return ''.join(chars) > FWIW, for me the build error goes away when I unset > LANG, so that the error occurs during build definitely > *is* a locale issue. I still can't reproduce this. Oh well. It should be gone. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Thu Jun 7 23:50:37 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 14:50:37 -0700 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: Looks great -- can you check it in yourself? On 6/7/07, Alexandre Vassalotti wrote: > I found a way to fix the bug; look at the attached patch. Although, I > am not sure it was correct way to fix it. The problem was due to str8 > that is recognized as an instance of `str'. > > -- Alexandre > > On 6/5/07, Guido van Rossum wrote: > > On 6/5/07, Alexandre Vassalotti wrote: > > > On 6/5/07, Guido van Rossum wrote: > > > > I'd rather see them here than in SF, SF is a pain to use. > > > > > > > > But unless the bugs prevent you from proceeding, you could also ignore them. > > > > > > The first bug that I reported today (the one about `make`) stop me > > > from running the test suite. So, can't really test the _string_io and > > > _bytes_io modules. > > > > I tried to reproduce it but it works fine for me -- I'm on Ubuntu > > dapper (with some Google mods) on a 2.6.18.5-gg4 kernel. > > > > -- > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Thu Jun 7 23:53:32 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 7 Jun 2007 17:53:32 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/7/07, Rauli Ruohonen wrote: > ... I will use XML character references to denote code points here. > Wherever you see such a thing in this e-mail, replace it in your > mind with the corresponding code point *immediately*. E.g. > len(r'�c5;') == 1, but len(r'\u00c5') == 6. > In the following code == should be false: > if "L\u00F6wis" == "Lo\u0308wis": > print "Python is Unicode conforming in this respect." > On 6/7/07, Stephen J. Turnbull wrote: > > I think the default case should be that text operations produce the > > expected result in the text domain, even at the expense of array > > invariants. (There was confusion -- an explicit escape such as \u probably stands out enough to signal the non-default case. But even there, it would also be reasonable to say "use something other than text.") > > People who need arrays of code points have several ways to > > get them, and the usual comparison operators will work on them > > as desired. > But regexps and other string operations won't, and those are the > whole point of strings, (I was thinking that regexps would actually take an buffer interface, but...) How would you expect them to work on arrays of code points? What sort of answer should the following produce? # matches by codepoints, but doesn't look like it "LoĴwis".startswith("Lo") # if the above did match, then people will assume ? folds to o "L�F6wis".startswith("Lo") # looks like it matches. Matches as text. Does not match as bytes. "LoĴwis".startswith("L�F6") -jJ From guido at python.org Thu Jun 7 23:54:01 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 14:54:01 -0700 Subject: [Python-3000] help() broken in the py3k-struni branch In-Reply-To: References: Message-ID: Thanks for finding the issue! On this one I think subprocess.py should be changed to allow None (like all the other open() functions). I'll check it in. --Guido On 6/7/07, Alexandre Vassalotti wrote: > On 6/5/07, Guido van Rossum wrote: > > Feel free to mail me a patch to fix it. > > > > Since you asked so politely, here a patch for you. :) > > > On 6/5/07, Alexandre Vassalotti wrote: > > > Hi, > > > > > > I found another bug to report. It seems there is a bug in > > > subprocess.py that makes help() fail. > > > > > > -- Alexandre > > > > > > Python 3.0x (py3k-struni, Jun 5 2007, 18:41:44) > > > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > > > Type "help", "copyright", "credits" or "license" for more information. > > > >>> help(open) > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "/home/alex/src/python.org/py3k-struni/Lib/site.py", line 350, > > > in __call__ > > > return pydoc.help(*args, **kwds) > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > > 1687, in __call__ > > > self.help(request) > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1731, in help > > > else: doc(request, 'Help on %s:') > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1514, in doc > > > pager(render_doc(thing, title, forceload)) > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line 1313, in pager > > > pager(text) > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > > 1333, in > > > return lambda text: pipepager(text, 'less') > > > File "/home/alex/src/python.org/py3k-struni/Lib/pydoc.py", line > > > 1352, in pipepager > > > pipe = os.popen(cmd, 'w') > > > File "/home/alex/src/python.org/py3k-struni/Lib/os.py", line 717, in popen > > > bufsize=buffering) > > > File "/home/alex/src/python.org/py3k-struni/Lib/subprocess.py", line > > > 476, in __init__ > > > raise TypeError("bufsize must be an integer") > > > TypeError: bufsize must be an integer > > > _______________________________________________ > > > Python-3000 mailing list > > > Python-3000 at python.org > > > http://mail.python.org/mailman/listinfo/python-3000 > > > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > > > > > > > > > -- > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > -- > Alexandre Vassalotti > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Thu Jun 7 23:58:57 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 17:58:57 -0400 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46687316.8090109@v.loewis.de> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> Message-ID: On 6/7/07, "Martin v. L?wis" wrote: > FWIW, for me the build error goes away when I unset > LANG, so that the error occurs during build definitely > *is* a locale issue. Ah! You're right. I needed to do a `make clean` before, though. My LANG variable was set to "en_CA.UTF-8". -- Alexandre From guido at python.org Fri Jun 8 00:19:32 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 15:19:32 -0700 Subject: [Python-3000] Renumbering old PEPs now targeted for Py3k? In-Reply-To: References: <4667E526.1000503@gmail.com> Message-ID: On 6/7/07, Georg Brandl wrote: > Nick Coghlan schrieb: > > Guido van Rossum wrote: > >> A few PEPs with numbers < 400 are now targeting Python 3000, e.g. PEP > >> 367 (new super) and PEP 344 (exception chaining). Are there any > >> others? I propose that we renumber these to numbers in the 3100+ > >> range. I can see two forms of renaming: > >> > >> (a) 344 -> 3344 and 367 -> 3367, i.e. add 3000 to the number > >> > >> (b) just use the next available number > >> > >> Preferences? > >> > >> What other PEPs should be renumbered? > >> > >> Should we renumber at all? > >> > > > > +1 for renumbering to the next available 31xx number, with the old > > number kept as a pointer to the new one. > > That would be my vote too. And so it is done. 344 -> 3134, 367 -> 3135. I've left the old ones in place with status "Replaced" and a "Numbering Note" in front of the abstract. Are there any other candidates for such a renumbering? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From pje at telecommunity.com Fri Jun 8 00:33:12 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Thu, 07 Jun 2007 18:33:12 -0400 Subject: [Python-3000] [Python-Dev] PEP 367: New Super In-Reply-To: References: <001101c79aa7$eb26c130$0201a8c0@mshome.net> <002d01c79f6d$ce090de0$0201a8c0@mshome.net> <003f01c79fd9$66948ec0$0201a8c0@mshome.net> <009c01c7a04f$7e348460$0201a8c0@mshome.net> <20070531170734.273393A40AA@sparrow.telecommunity.com> Message-ID: <20070607223114.274A73A4060@sparrow.telecommunity.com> At 02:31 PM 6/6/2007 -0700, Guido van Rossum wrote: >I wonder if this may meet the needs for your PEP 3124? In >particularly, earlier on, you wrote: > >>Btw, PEP 3124 needs a way to receive the same class object at more or >>less the same moment, although in the form of a callback rather than >>a cell assignment. Guido suggested I co-ordinate with you to design >>a mechanism for this. > >Is this relevant at all? Well, it tells us more or less where the callback would need to be. :) Although I think that __class__ should really point to the *decorated* class, rather than the undecorated one. I have used decorators before that had to re-create the class object, but can't think of any use cases where I'd have wanted to use super() to refer to the *un*decorated class. Btw, my thought on the keyword and __class__ thing is simply that the plus of having a keyword (or other compiler support) is that we don't have to have the cell variable cluttering up the frames for every single method, whether it uses super or not. Thus, my inclination is either to require explicit use of __class__ (so the compiler would know whether to include the free variable), or to make super a keyword, so that in either case, only the functions that use it must pay for the overhead. (Currently, functions that use any cell variables are invoked more slowly than ones without them; in 2.x at least there's a fast calling path for code objects with CO_NOFREE, and this change would make it useless for everything but top-level functions.) From alexandre at peadrop.com Fri Jun 8 00:38:34 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 7 Jun 2007 18:38:34 -0400 Subject: [Python-3000] pdb help is broken in py3k-struni branch In-Reply-To: References: Message-ID: Done. Commited to r55817. On 6/7/07, Guido van Rossum wrote: > Looks great -- can you check it in yourself? > > On 6/7/07, Alexandre Vassalotti wrote: > > I found a way to fix the bug; look at the attached patch. Although, I > > am not sure it was correct way to fix it. The problem was due to str8 > > that is recognized as an instance of `str'. > > > > -- Alexandre > > > > On 6/5/07, Guido van Rossum wrote: > > > On 6/5/07, Alexandre Vassalotti wrote: > > > > On 6/5/07, Guido van Rossum wrote: > > > > > I'd rather see them here than in SF, SF is a pain to use. > > > > > > > > > > But unless the bugs prevent you from proceeding, you could also ignore them. > > > > > > > > The first bug that I reported today (the one about `make`) stop me > > > > from running the test suite. So, can't really test the _string_io and > > > > _bytes_io modules. > > > > > > I tried to reproduce it but it works fine for me -- I'm on Ubuntu > > > dapper (with some Google mods) on a 2.6.18.5-gg4 kernel. > > > > > > -- > > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > > > > > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > -- Alexandre Vassalotti From guido at python.org Fri Jun 8 00:41:09 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 15:41:09 -0700 Subject: [Python-3000] [Python-Dev] PEP 367: New Super In-Reply-To: <20070607223114.274A73A4060@sparrow.telecommunity.com> References: <001101c79aa7$eb26c130$0201a8c0@mshome.net> <003f01c79fd9$66948ec0$0201a8c0@mshome.net> <009c01c7a04f$7e348460$0201a8c0@mshome.net> <20070531170734.273393A40AA@sparrow.telecommunity.com> <20070607223114.274A73A4060@sparrow.telecommunity.com> Message-ID: On 6/7/07, Phillip J. Eby wrote: > At 02:31 PM 6/6/2007 -0700, Guido van Rossum wrote: > >I wonder if this may meet the needs for your PEP 3124? In > >particularly, earlier on, you wrote: > > > >>Btw, PEP 3124 needs a way to receive the same class object at more or > >>less the same moment, although in the form of a callback rather than > >>a cell assignment. Guido suggested I co-ordinate with you to design > >>a mechanism for this. > > > >Is this relevant at all? > > Well, it tells us more or less where the callback would need to > be. :) Although I think that __class__ should really point to the > *decorated* class, rather than the undecorated one. I have used > decorators before that had to re-create the class object, but can't > think of any use cases where I'd have wanted to use super() to refer > to the *un*decorated class. That's a problem, because I wouldn't know where to save a reference to the cell until after the decorations are done. If you want to suggest a solution, please study the patch first to see the difficulty. > Btw, my thought on the keyword and __class__ thing is simply that the > plus of having a keyword (or other compiler support) is that we don't > have to have the cell variable cluttering up the frames for every > single method, whether it uses super or not. Oh, but the patch *does* have compiler support, and only creates the cell when it is needed, and only passes it into those methods that need it. > Thus, my inclination is either to require explicit use of __class__ > (so the compiler would know whether to include the free variable), or > to make super a keyword, so that in either case, only the functions > that use it must pay for the overhead. My patch uses an intermediate solution: it assumes you need __class__ whenever you use a variable named 'super'. Thus, if you (globally) rename super to supper and use supper but not super, it won't work without arguments (but it will still work if you pass it either __class__ or the actual class object); if you have an unrelated variable named super, things will work but the method will use the slightly slower call path used for cell variables. I believe IronPython uses a similar strategy to support locals() -- AFAIK it generates slower code that provides an accessible stack frame when it thinks you may be using a global named 'locals'. So again, globally renaming locals to something else won't work, but having an unrelated variable named 'locals' will work at a slight performance penalty. > (Currently, functions that use any cell variables are invoked more > slowly than ones without them; in 2.x at least there's a fast calling > path for code objects with CO_NOFREE, and this change would make it > useless for everything but top-level functions.) Not true, explained above. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Fri Jun 8 00:42:08 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 15:42:08 -0700 Subject: [Python-3000] [Python-Dev] PEP 367: New Super In-Reply-To: References: <001101c79aa7$eb26c130$0201a8c0@mshome.net> <009c01c7a04f$7e348460$0201a8c0@mshome.net> <20070531170734.273393A40AA@sparrow.telecommunity.com> <20070607223114.274A73A4060@sparrow.telecommunity.com> Message-ID: BTW, from now on this is PEP 3135. http://python.org/dev/peps/pep-3135/ -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rauli.ruohonen at gmail.com Fri Jun 8 00:47:07 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Fri, 8 Jun 2007 01:47:07 +0300 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/8/07, Jim Jewett wrote: > How would you expect them to work on arrays of code points? Just like they do with Python 2.5 unicode objects, as long as the "array of code points" is str, not e.g. a numpy array or tuple of ints, which I don't expect to grow string methods :-) > What sort of answer should the following produce? That depends on what Python does when it reads in the source code. I think it should normalize to NFC (which Python 2.5 does not do). > # matches by codepoints, but doesn't look like it > "LoĴwis".startswith("Lo") > # if the above did match, then people will assume ? folds to o > "L�F6wis".startswith("Lo") > # looks like it matches. Matches as text. Does not match as bytes. > "LoĴwis".startswith("L�F6") Normalized to NFC: "L�F6;wis".startswith("Lo") "L�F6;wis".startswith("Lo") "L�F6;wis".startswith("L�F6;") After this Python lexes, parses and executes. The first two are false, the last one true. All of the examples should look the same in your editor (at least ideally). The following would, OTOH, be true false false: "Lo\u0308wis".startswith("Lo") "L\u00F6wis".startswith("Lo") "Lo\u0308wis".startswith("L\u00F6") As here the source code is pure ASCII, it's WYSIWYG everywhere. Python 2.5's output with each: >>> u"Lo?wis".startswith(u"Lo") True >>> u"L?wis".startswith(u"Lo") False >>> u"Lo?wis".startswith(u"L?") False >>> u"Lo\u0308wis".startswith(u"Lo") True >>> u"L\u00F6wis".startswith(u"Lo") False >>> u"Lo\u0308wis".startswith(u"L\u00F6") False From rrr at ronadam.com Fri Jun 8 01:20:23 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 07 Jun 2007 18:20:23 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46687316.8090109@v.loewis.de> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> Message-ID: <466892B7.4050108@ronadam.com> Martin v. L?wis wrote: > FWIW, for me the build error goes away when I unset > LANG, so that the error occurs during build definitely > *is* a locale issue. Yes, and to pin it down a bit further... This avoids the problem by setting the language to the default "C" which is a unicode string and has a .split method that accepts 0 args. Also LANG is 4th on the list of possible language setting sources, so if one of the other 3 environment variables is set, setting or unsetting LANG will have no effect. --- From gettext.py --- # Locate a .mo file using the gettext strategy def find(domain, localedir=None, languages=None, all=0): # Get some reasonable defaults for arguments that were not supplied if localedir is None: localedir = _default_localedir if languages is None: languages = [] for envar in ('LANGUAGE', 'LC_ALL', 'LC_MESSAGES', 'LANG'): # ^^^ first one is accepted. val = os.environ.get(envar) #<<< should return unicode? if val: languages = val.split(':') break if 'C' not in languages: languages.append('C') # <<< unicode 'C' # now normalize and expand the languages nelangs = [] for lang in languages: for nelang in _expand_lang(lang): #<<< error in this call # when it's normalized. if nelang not in nelangs: nelangs.append(nelang) ------ Guido's patch avoids this, but that fix was also needed as unicode translate works differently than str.translate. The os.environ.get() method probably should return a unicode string. (?) Ron From guido at python.org Fri Jun 8 01:54:40 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 16:54:40 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <466892B7.4050108@ronadam.com> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> Message-ID: On 6/7/07, Ron Adam wrote: > Martin v. L?wis wrote: > > > FWIW, for me the build error goes away when I unset > > LANG, so that the error occurs during build definitely > > *is* a locale issue. > > Yes, and to pin it down a bit further... > > This avoids the problem by setting the language to the default "C" which is > a unicode string and has a .split method that accepts 0 args. > > Also LANG is 4th on the list of possible language setting sources, so if > one of the other 3 environment variables is set, setting or unsetting LANG > will have no effect. > > > --- From gettext.py --- > > # Locate a .mo file using the gettext strategy > def find(domain, localedir=None, languages=None, all=0): > # Get some reasonable defaults for arguments that were not supplied > if localedir is None: > localedir = _default_localedir > if languages is None: > languages = [] > for envar in ('LANGUAGE', 'LC_ALL', 'LC_MESSAGES', 'LANG'): > # ^^^ first one is accepted. > val = os.environ.get(envar) #<<< should return unicode? > if val: > languages = val.split(':') > break > if 'C' not in languages: > languages.append('C') # <<< unicode 'C' > # now normalize and expand the languages > nelangs = [] > for lang in languages: > for nelang in _expand_lang(lang): #<<< error in this call > # when it's normalized. > if nelang not in nelangs: > nelangs.append(nelang) > > ------ > > Guido's patch avoids this, but that fix was also needed as unicode > translate works differently than str.translate. > > The os.environ.get() method probably should return a unicode string. (?) Indeed -- care to contribute a patch? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rauli.ruohonen at gmail.com Fri Jun 8 02:26:41 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Fri, 8 Jun 2007 03:26:41 +0300 Subject: [Python-3000] String comparison In-Reply-To: <4666FB16.2070209@v.loewis.de> References: <4666FB16.2070209@v.loewis.de> Message-ID: On 6/6/07, "Martin v. L?wis" wrote: > > FWIW, I don't buy that normalization is expensive, as most strings are > > in NFC form anyway, and there are fast checks for that (see UAX#15, > > "Detecting Normalization Forms"). Python does not currently have > > a fast path for this, but if it's added, then normalizing everything > > to NFC should be fast. > > That would be useful to have, anyway. Would you like to contribute it? I implemented it for all normalizations in the most straightforward way I could think of, which was adding a field to _PyUnicode_DatabaseRecord, generating data for it in makeunicodedata.py from DerivedNormalizationProps.txt of UCD 4.1, and writing a function is_normalized which uses it. The function is called from unicodedata_normalized. I made the modifications against py3k-struni. Does this sound reasonable? I haven't made any contributions to Python before, but I heard attempting such hazardous activity involves lots of hard knocks :-) Where should I send the patch? I saw some patches here in other threads, but then again http://www.python.org/dev/patches/ tells to use SourceForge. From rrr at ronadam.com Fri Jun 8 04:31:44 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 07 Jun 2007 21:31:44 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> Message-ID: <4668BF90.5080302@ronadam.com> Guido van Rossum wrote: >> The os.environ.get() method probably should return a unicode string. (?) > > Indeed -- care to contribute a patch? I thought you might ask that. :-) It looks like os.py module imports a 'envirion' dictionary from various sources depending on the platform. posix, nt, os2 <---> posixmodule.c mac, ce, riscos <---> ?, ?, ? Then os.py uses it to initialize the os._Environ user dict. I can contribute a patch for os.py to covert the items at that point, but if someone imports the platform modules directly they will get surprises. Patching posixmodule.c and the other platform files where ever they live may still be a bit beyond me at this time. I'm still learning my way around pythons C code. :-) Cheers, Ron From turnbull at sk.tsukuba.ac.jp Fri Jun 8 05:31:44 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Fri, 08 Jun 2007 12:31:44 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87fy53glzz.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: Stephen wrote: > > I think the default case should be that text operations produce the > > expected result in the text domain, even at the expense of array > > invariants. > > If you really want that, then you need a type for sequences of graphemes. No. "Text" != "sequence of graphemes". For example: > E.g. 'c\u0308' is already normalized according to all four normalization > rules, but it's still one grapheme ('c' with diaeresis, c~) Not on my terminal, it's not; it's two. And what about audible representation? Python cannot compute graphemes, the Python user can only observe them after some other process displays them. So Python's definition of "text" cannot be grapheme-based. > > People who need arrays of code points have several ways to get them, > > and the usual comparison operators will work on them as desired. > > But regexps and other string operations won't, I do not have any objection to treating Unicode strings as sequences of code points, and allowing them to be unnormalized -- as an option. The *default* should be to treat them as text, or there should be a simple way to make it default ("import trueunicode"). I do not want to have to check every string for normalization by hand. I don't object to the overhead---the overhead is already pretty high for Unicode conformance. It's that I know I'll make mistakes, or use libraries that do undocumented I/O or non-Unicode-conformant transformations, or whatever. The right place to do such checking is in the Unicode datatype, not in application code. > > While people who need operations on *text* still have no > > straightforward way to get them, and no promise of one as I read your > > remarks. > > Then you missed some of his earlier remarks: > > Guido: > : I'm all for adding a way to do normalized string comparisons to the > : library. But I'm not about to change the == operator to apply > : normalization first. Funny, that's precisely the remark I was thinking of. If I write a Unicode string, I want the == operator to "just work". As quoted, Guido says it will not. Note that we *already* have a way to do normalized string comparisons via unicodedata, and we can even use "==" for it. So Guido would have every right to consider his promise already fulfilled. The problem is not that a code-point oriented operator won't work if you know you have two TrueText objects; you only have to implement them correctly, and code-point comparison Just Works. The problem is that it's going to be very hard to be sure that you've got TrueText as opposed to arrays of shorts if the *language* does not provide ways to enforce the distinction. From martin at v.loewis.de Fri Jun 8 06:04:05 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 08 Jun 2007 06:04:05 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> Message-ID: <4668D535.7020103@v.loewis.de> >> The os.environ.get() method probably should return a unicode string. (?) > > Indeed -- care to contribute a patch? Ideally, such a patch would make use of the Win32 Unicode API for environment variables on Windows. People had already been complaining that they can't have "funny characters" in the value of an environment variable, even though the UI allows them to set the variable just fine. Regards, Martin From guido at python.org Fri Jun 8 06:06:49 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Jun 2007 21:06:49 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4668D535.7020103@v.loewis.de> References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> Message-ID: On 6/7/07, "Martin v. L?wis" wrote: > >> The os.environ.get() method probably should return a unicode string. (?) > > > > Indeed -- care to contribute a patch? > > Ideally, such a patch would make use of the Win32 Unicode API for > environment variables on Windows. People had already been complaining > that they can't have "funny characters" in the value of an environment > variable, even though the UI allows them to set the variable just fine. Yeah, but the Windows build of py3k is currently badly broken (e.g. the _fileio.c extension probably doesn't work at all) -- and I don't have access to a Windows box to work on it. I'm afraid 3.0a1 will be released without Windows support. Of course I'm counting on others to fix that before 3.0 final is released. I don't mind for now that the posix.environ variable contains 8-bit strings -- people shouldn't be importing that anyway. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Fri Jun 8 06:15:51 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Fri, 08 Jun 2007 06:15:51 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <4666FB16.2070209@v.loewis.de> Message-ID: <4668D7F7.7000106@v.loewis.de> > I implemented it for all normalizations in the most straightforward way I > could think of, which was adding a field to _PyUnicode_DatabaseRecord, > generating data for it in makeunicodedata.py from > DerivedNormalizationProps.txt of UCD 4.1, and writing a function > is_normalized which uses it. The function is called from > unicodedata_normalized. I made the modifications against py3k-struni. > Does this sound reasonable? In principle, yes. What's the cost of the additional field in terms of a size increase? If you just need another bit, could that fit into _PyUnicode_TypeRecord.flags instead? > I haven't made any contributions to Python before, but I heard attempting > such hazardous activity involves lots of hard knocks :-) Where should I > send the patch? I saw some patches here in other threads, but then again > http://www.python.org/dev/patches/ tells to use SourceForge. That would be best. You only need to include the patch to the generator, not the generated data. I'd like to see it in 2.6, so ideally, you would test it for the trunk (not that the branch should matter much)). Don't forget to include test suite and documentation changes. Regards, Martin From stephen at xemacs.org Fri Jun 8 10:21:36 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 08 Jun 2007 17:21:36 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > If you want to have an abstraction that guarantees you'll never see > an unnormalized text string you should design a library for doing so. OK. > (*) It looks like such a library will not have a way to talk about > "\u0308" at all, since it is considered unnormalized. >From the Unicode Standard, v4.0, p. 43: "In the Unicode Standard, all sequences of character codes are permitted." Since normalization only applies to characters with decompositions, "\u0308" is indeed valid Unicode, a one-character sequence in NFC. AFAIK, the only strings the Unicode standard absolutely prohibits emitting are those containing code points guaranteed not to be characters by the standard. And normalization is simply a internal technique that allows text operations to be implemented code-point- wise without fear that emitting them would result in illegal sequences or other externally visible incompatibilities with the standard. So there's nothing "wrong by definition" about defining strings as sequences of code points, and string operations in code-point-wise fashion. It just makes that library for Unicode more expensive to design and operate, and will require auditing and reimplementation of common libraries (including the standard library) by every program that requires strict Unicode conformance. From rauli.ruohonen at gmail.com Fri Jun 8 10:21:01 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Fri, 8 Jun 2007 11:21:01 +0300 Subject: [Python-3000] String comparison In-Reply-To: <4668D7F7.7000106@v.loewis.de> References: <4666FB16.2070209@v.loewis.de> <4668D7F7.7000106@v.loewis.de> Message-ID: On 6/8/07, "Martin v. L?wis" wrote: > In principle, yes. What's the cost of the additional field in terms of > a size increase? If you just need another bit, could that fit into > _PyUnicode_TypeRecord.flags instead? The additional field is 8 bits, two bits for each normalization (a Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are used, but I don't know if that's true of later versions. As _PyUnicode_Database_Records stores only unique records, this also results in an increase of the number of records, from 219 to 304. Each record looks like this: typedef struct { const unsigned char category; const unsigned char combining; const unsigned char bidirectional; const unsigned char mirrored; const unsigned char east_asian_width; const unsigned char normalization_quick_check; /* my addition */ } _PyUnicode_DatabaseRecord; I added the field to this record because the function needs to get the record anyway for each character (it needs the field "combining", too). The new field combines values for the derived properties (trinary) NFD_Quick_Check, NFKD_Quick_Check, NFC_Quick_Check and NFKC_Quick_Check. Here's the main loop (works for all four normalizations, only the value of quickcheck_shift changes): while (i < end) { const _PyUnicode_DatabaseRecord *record = _getrecord_ex(*i++); unsigned char combining = record->combining; unsigned char quickcheck = record->normalization_quick_check; if ((quickcheck>>quickcheck_shift) & 3) return 0; /* this character might need normalization */ if (combining && prev_combining > combining) return 0; /* non-canonical order, not normalized */ prev_combining = combining; } > That would be best. You only need to include the patch to the generator, > not the generated data. I'd like to see it in 2.6, so ideally, you would > test it for the trunk (not that the branch should matter much)). This is easy to do. The differences in these files between the versions are very small, and I actually initially wrote it for 2.5, as py3k-struni's normalization test fails at the moment. > Don't forget to include test suite and documentation changes. It doesn't affect behavior or the API much(*), only performance. Current test_normalize.py uses a test suite it fetches from UCD, so it should be adequate. (*) You *can* test for its presence by e.g. checking whether id(unicodedata.normalize('NFC', u'a')) is id(u'a') or not. The documentation does not specify either way. I'd say it's an implementation detail, and both tests and documentation should ignore it. From rauli.ruohonen at gmail.com Fri Jun 8 15:38:13 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Fri, 8 Jun 2007 16:38:13 +0300 Subject: [Python-3000] String comparison In-Reply-To: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/8/07, Stephen J. Turnbull wrote: > AFAIK, the only strings the Unicode standard absolutely prohibits > emitting are those containing code points guaranteed not to be > characters by the standard. The ones it absolutely prohibits in interchange are surrogates. They are also illegal in both UTF-16 and UTF-8. The pragmatic reason is that if you do encode them despite their illegality (like Python codecs do), strings won't always survive a round-trip to such pseudo-UTF-16 because multiple code point sequences necessarily map to the same byte sequence. For some reason Python's UTF-8 encoder introduces this ambiguity too, even though there's no need to do so with pseudo-UTF-8. In Python UCS-2 builds even string processing in the core works inconsistently with surrogates. Sometimes pseudo-UCS-2 is assumed, sometimes pseudo-UTF-16, and these are incompatible because pseudo-UTF-16 can't always represent surrogates, but pseudo-UCS-2 can. OTOH pseudo-UCS-2 can't represent code points outside the BMP, but pseudo-UTF-16 can. There's no way to always do the right thing as long as these two are mixed, but somebody somewhere probably depends on this behavior. Other than surrogates, there are two classes of characters with "restricted interchange". One is reserved characters, which need to be preserved if found in text for compatibility with future versions of the standard. Another is noncharacters, which are "reserved for internal use, such as for sentinel values". These should obviously be allowed, as the user may want to use them internally in their Python program. > So there's nothing "wrong by definition" about defining strings as > sequences of code points, and string operations in code-point-wise > fashion. It just makes that library for Unicode more expensive to > design and operate, and will require auditing and reimplementation of > common libraries (including the standard library) by every program > that requires strict Unicode conformance. It's not perfect, but that's the state of the art. AFAIK this (or worse) is what the other implementations do. Even the Unicode standard explains that strings generally work that way: 2.7. Unicode Strings A Unicode string datatype is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not also be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. From jimjjewett at gmail.com Fri Jun 8 16:27:40 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Fri, 8 Jun 2007 10:27:40 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <4666FB16.2070209@v.loewis.de> <4668D7F7.7000106@v.loewis.de> Message-ID: On 6/8/07, Rauli Ruohonen wrote: > The additional field is 8 bits, two bits for each normalization (a > Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are > used, but I don't know if that's true of later versions. There are no "Maybe" values for the Decomposed forms. It is impossible to be Compatibility without also being Canonical. (The definition of Compatibility includes folding as much as possible under either form.) So there are really 3 possibilities (both, canonical only, neither) for the decomposed, and (at most) 6 for the composed forms. (I'm not sure all 6 of those can occur in practice.) But there are other normalization forms that may be added later. The ones I found reference to are basically orthogonal (an existing normalization may or may not meet them). See the proposed changes at http://www.unicode.org/reports/tr15/tr15-28.html -jJ From amcnabb at mcnabbs.org Fri Jun 8 19:00:49 2007 From: amcnabb at mcnabbs.org (Andrew McNabb) Date: Fri, 8 Jun 2007 11:00:49 -0600 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20070608170049.GB20665@mcnabbs.org> On Thu, Jun 07, 2007 at 06:50:57PM -0400, Jim Jewett wrote: > On 6/7/07, Andrew McNabb wrote: > > On Wed, Jun 06, 2007 at 07:06:05PM -0400, Jim Jewett wrote: > > > (There were mixed opinions on Technical symbols, and no one has spoken > > > up yet about the half-dozen Croatian digraphs corresponding to Serbian > > > Cyrillic.) > > If the digraphs were converted to compatibility characters, would > that be good, bad, or no big deal? > > I'm not entirely certain which letters Stephen was talking about, but > believe they are the (upper, lower, and titlecase) digraphs for ?, ?, > ?, ? (DZ caron) > > Would it be acceptable if (only in identifier names, not normal text) > python treated those the same as the two-character sequences LJ, NJ, > DZ, and D?? I speak Serbian as a second language (and lived in Serbia for a few years), and my opinion is that a Serbian/Croatian speaker would expect the digraphs to be treated the same as the two-character sequences. The issue doesn't seem to come up too often, but people using typewriters have been typing the digraphs as separate characters for years. The place I noticed the issue most frequently was if there was a vertical sign, such as a storefront. A sign saying "bookstore" would like like this: K ? i ? a r a or: K nj i ? a r a The following would be incorrect: K n j i ? a r a But even many native speakers make this mistake. Other than that, ? is practically indistinguishable from nj, and the other Croatian digraphs have the same behavior. I hope this helps in the discussion. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 186 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070608/867eb658/attachment.pgp From martin at v.loewis.de Fri Jun 8 19:36:30 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Fri, 08 Jun 2007 19:36:30 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <4666FB16.2070209@v.loewis.de> <4668D7F7.7000106@v.loewis.de> Message-ID: <4669939E.2020700@v.loewis.de> > The additional field is 8 bits, two bits for each normalization (a > Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are > used, but I don't know if that's true of later versions. As > _PyUnicode_Database_Records stores only unique records, this also results > in an increase of the number of records, from 219 to 304. Each record > looks like this: If I count correctly, this gives roughly 900 additional bytes. That's fine. > It doesn't affect behavior or the API much(*), only performance. Current > test_normalize.py uses a test suite it fetches from UCD, so it > should be adequate. I assumed you want to expose it to Python also, as an is_normalized function. I guess not having such a function is fine if applications can do normalize(form, s) == s and have that be efficient as long as the outcome is true (i.e. if it is more expensive only if it's not normalized). Regards, Martin From martin at v.loewis.de Fri Jun 8 22:31:28 2007 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 08 Jun 2007 22:31:28 +0200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: <20070608170049.GB20665@mcnabbs.org> References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> <20070608170049.GB20665@mcnabbs.org> Message-ID: <4669BCA0.6000902@v.loewis.de> > I hope this helps in the discussion. Indeed it does. When I find the time, I'll propose a change to the PEP to do NFKC. Regards, Martin From martin at v.loewis.de Fri Jun 8 22:41:31 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 08 Jun 2007 22:41:31 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <465615C9.4080505@v.loewis.de> <320102.38046.qm@web33515.mail.mud.yahoo.com> <19dd68ba0705241805y52ba93fdt284a2c696b004989@mail.gmail.com> Message-ID: <4669BEFB.1000301@v.loewis.de> > This keeps getting characterized as only a security argument, but > it's much deeper; it's a basic code comprehension issue. Despite you repeating this over and over, I still honestly, sincerely do not understand the concern. You might be technically correct, but I feel that the cases where these issues could really arise in practice are so obscure that I can safely ignore them. More specifically: > Python will lose the ability to make a reliable round trip > between a computer file and any human-accessible medium > such as a visual display or a printed page. Practically, this is just not true. *Of course* you will be able to type in a piece of Python code written on a paper, provided you understand the natural language that the identifiers use. That the glyphs might be ambiguous is not an issue at all. What could really stop you from typing in the code is that you don't know how to type the characters, however I don't see that as a problem, either - I rarely need to type in code from a piece of paper, anyway, and only ever do so when I understand what the code does (so I likely don't type it in *literally*). > The Python language will become too large for any single > person to fully know Again, practically, this is not true. We both know what PEP 3131 says about identifiers: they start with a letter, followed by letters and digits. I fully well know the entire language. The fact that I cannot enumerate all letters doesn't bother me to the slightest. > Python programs that reuse other Python modules may come > to contain a mix of character sets such that no one can > fully read them or properly display them. We will see. I find that unlikely to happen (although not entirely impossible). > Unicode is young and unfinished. I commented on this earlier already: this is non-sense. Unicode is as old as Python (so perhaps Python is also young and unfinished). Regards, Martin From guido at python.org Sat Jun 9 00:27:51 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 8 Jun 2007 15:27:51 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? Message-ID: PEP 3127 (Integer Literal Support and Syntax) introduces new notations for octal and binary integers. This isn't implemented yet. Are there any takers? It shouldn't be particularly complicated. Separately, the 2to3 tool needs a fixer for this (and it should also accept the new notations in its input). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From collinw at gmail.com Sat Jun 9 00:36:30 2007 From: collinw at gmail.com (Collin Winter) Date: Fri, 8 Jun 2007 15:36:30 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: <43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com> On 6/8/07, Guido van Rossum wrote: > Separately, the 2to3 tool needs a fixer for this (and it should also > accept the new notations in its input). I wrote a num_literals fixer when the debate over this feature was still in progress. It's checked in, but I need to sync it with the latest version of the PEP. I'll take care of that. Collin Winter From collinw at gmail.com Sat Jun 9 00:37:55 2007 From: collinw at gmail.com (Collin Winter) Date: Fri, 8 Jun 2007 15:37:55 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: <43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com> References: <43aa6ff70706081536t73376a18t35e7c47a67bbc54a@mail.gmail.com> Message-ID: <43aa6ff70706081537gc0b218ap379d3383fa139051@mail.gmail.com> On 6/8/07, Collin Winter wrote: > On 6/8/07, Guido van Rossum wrote: > > Separately, the 2to3 tool needs a fixer for this (and it should also > > accept the new notations in its input). > > I wrote a num_literals fixer when the debate over this feature was > still in progress. It's checked in, but I need to sync it with the > latest version of the PEP. I'll take care of that. Oops, Georg Brandl was actually the fixer's original author. Sorry, Georg! From stephen at xemacs.org Sat Jun 9 06:33:07 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 09 Jun 2007 13:33:07 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6opw9dc.fsf@uwakimon.sk.tsukuba.ac.jp> <20070606084543.6F3D.JCARLSON@uci.edu> <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > The ones it absolutely prohibits in interchange are surrogates. Excuse me? Surrogates are code points with a specific interpretation if it is "purported that the stream is in UTF-16". Otherwise, Unicode 4.0 explicitly says that there is nothing illegal about an isolated surrogate (p.75, where an example is given of how such a surrogate might occur). That surrogate may not be interpreted as an abstract character (C4, p.58), but it is not a non-character (Table 2-2, p.25). I agree that it's unfortunate that some parts of Python treat Unicode strings objects purely as sequences of Unicode code points, and others purport (apparently without checking) that such strings are in UTF-16. Unicode conformance is not part of the Python language. That's life. But let's try to avoid creating difficulties that don't exist in the standard. > > So there's nothing "wrong by definition" about defining strings as > > sequences of code points, and string operations in code-point-wise > > fashion. > It's not perfect, but that's the state of the art. AFAIK this (or worse) > is what the other implementations do. My point was precisely that I don't object to this implementation. I want Unicode-ly-correct behavior to be a goal of the language, the community disagrees, and Guido disagrees. That's that. Thanks you for starting work on implementation; let's concentrate on that. From stephen at xemacs.org Sat Jun 9 09:45:02 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 9 Jun 2007 16:45:02 +0900 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <18026.23166.928863.613890@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > but I think dealing with K characters is now a "least of evils" > decision, instead of "we need them for something." Agreed. > On another note, I have no idea how Martin's name (in the Cc line) > ended up as: [scrambled stuff] That's almost surely me. The composer part of my MUA of choice handles Japanese fine, but doesn't like general Unicode much. So I've switched to a different composer, but the two MUAs differ on the protocol for passing reply information from the reader to the composer. RFC 2047 headers are one thing that often gets fumbled. From g.brandl at gmx.net Sat Jun 9 09:39:19 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 09 Jun 2007 09:39:19 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Guido van Rossum schrieb: > PEP 3127 (Integer Literal Support and Syntax) introduces new notations > for octal and binary integers. This isn't implemented yet. Are there > any takers? It shouldn't be particularly complicated. I have a patch lying around here which might be quite complete... One thing that's unclear to me though: didn't we decide to drop the uppercase string modfiers/number suffixes/prefixes? Also, I'm not sure what int() should do with "010". Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From martin at v.loewis.de Sat Jun 9 09:55:42 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sat, 09 Jun 2007 09:55:42 +0200 Subject: [Python-3000] Unicode IDs -- why NFC? Why allow ligatures? In-Reply-To: References: <4664E238.9020700@v.loewis.de> <87fy56yd14.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <466A5CFE.5040906@v.loewis.de> > On another note, I have no idea how Martin's name (in the Cc line) ended > up as: > > """ > L$(D+S(Bwis" > """ > > If I knew, it *might* have a bearing on what sorts of > canonicalizations should be performed, and what sorts of warnings the > parser ought to emit for likely corrupted text. That results from a faulty iso-2022-jp-1 conversion. ESC $ ( D switches to JIS X 0212-1990 (which apparently includes ? at code position 0x25B3); ESC ( B switches back to ASCII. I don't think this has anything to do with normalization. Regards, Martin From ncoghlan at gmail.com Sat Jun 9 13:19:00 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 09 Jun 2007 21:19:00 +1000 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: <466A8CA4.5030906@gmail.com> Georg Brandl wrote: > Also, I'm not sure what int() should do with "010". The only change would be for int(x, 0), and that should raise a ValueError, just like any other invalid string. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From guido at python.org Sat Jun 9 17:39:11 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 9 Jun 2007 08:39:11 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: On 6/9/07, Georg Brandl wrote: > Guido van Rossum schrieb: > > PEP 3127 (Integer Literal Support and Syntax) introduces new notations > > for octal and binary integers. This isn't implemented yet. Are there > > any takers? It shouldn't be particularly complicated. > > I have a patch lying around here which might be quite complete... Cool! > One thing that's unclear to me though: didn't we decide to drop the uppercase > string modfiers/number suffixes/prefixes? In the end (doesn't the PEP confirms this?) we decided to keep them and make it a style rule instead. Some folks have generated data sets using uppercase. > Also, I'm not sure what int() should do with "010". int("010") should return (decimal) 10. int("010", 0) should raise ValueError. I thought that was also in the PEP. Anyway, with these tweaks, feel free to just check it in (well, if you also fix the standard library to use the new notation). --Guido -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rauli.ruohonen at gmail.com Sat Jun 9 23:01:57 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 10 Jun 2007 00:01:57 +0300 Subject: [Python-3000] String comparison In-Reply-To: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/9/07, Stephen J. Turnbull wrote: > Rauli Ruohonen writes: > > The ones it absolutely prohibits in interchange are surrogates. > > Excuse me? Surrogates are code points with a specific interpretation > if it is "purported that the stream is in UTF-16". Otherwise, Unicode > 4.0 explicitly says that there is nothing illegal about an isolated > surrogate (p.75, where an example is given of how such a surrogate > might occur). I meant interchange instead of strings. Anything is allowed in strings. Chapter 2 (not normative, but clear) explains on page 26: Restricted interchange. [...] - Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. [...] - Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. [...] > My point was precisely that I don't object to this implementation. I > want Unicode-ly-correct behavior to be a goal of the language, the > community disagrees, and Guido disagrees. That's that. My understanding is that it is a goal, but practicality beats purity. I think the only disagreement is on what's practical. From tomerfiliba at gmail.com Sun Jun 10 01:32:05 2007 From: tomerfiliba at gmail.com (tomer filiba) Date: Sun, 10 Jun 2007 01:32:05 +0200 Subject: [Python-3000] rethinking pep 3115 Message-ID: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> pep 3115 (new metaclasses) seems overly complicated imho. it fails my understanding of "keeping it simple", among other heuristics. (1) the trivial fix-up would be to extend the type constructor to take 4 arguments: (name, bases, attrs, order), where 'attrs' is a plain old dict, and 'order' is a list, into which the names are appended in the order they were defined in the body of the class. this way, no new types are introduced and 99% of the use cases are covered. things like "forward referencing in the class namespace" are evil. and besides, it's not possible to do with functions and modules, so why should classes be allowed such a mischief? (2) the second-best solution i could think of is just passing the dict as a keyword argument to the class, like so: class Spam(metaclass = Bacon, dict = {}): ... so you could explicitly state you need a special dict. following the cosmetic change of removing the magical __metaclass__ attribute from the class body into the class header, it makes so sense to replace it by another magical method, __prepare__. the straight-forward-and-simple way would be to make it a keyword argument, just like 'metaclass'. (3) personally, i refrain from metaclasses. according to my experience, they just cause trouble, while the benefits of using them are marginal. the problem is noticeable especially when trying to understand and debug third-party code. metaclasses + bugs = blackmagic. moreover, they introduce inheritance issues. the class hierarchy becomes rigid and difficult to evolve as the need arises, which contradicts my perception of agile languages. i like to view programming as an iterative task which approaches the final objective after several loops. rigidness makes each loop longer, which is why i prefer dynamic languages to compiled ones. on the other hand, i do understand the need for metaclasses, even if for the sake of symmetry (as types are objects). but the solution proposed by pep 3115, of making metaclasses even more complicated and magical, seems all wrong to me. i understand it's already been accepted, but i'm hoping there's still time to reconsider this before 3.0 becomes final. -tomer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070610/a2c7369a/attachment.htm From stephen at xemacs.org Sun Jun 10 10:03:19 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 10 Jun 2007 17:03:19 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <-6248387165431892706@unknownmsgid> <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > On 6/9/07, Stephen J. Turnbull wrote: > > Rauli Ruohonen writes: > > > The ones it absolutely prohibits in interchange are surrogates. > > > > Excuse me? Surrogates are code points with a specific interpretation > > if it is "purported that the stream is in UTF-16". Otherwise, Unicode > > 4.0 explicitly says that there is nothing illegal about an isolated > > surrogate (p.75, where an example is given of how such a surrogate > > might occur). > > I meant interchange instead of strings. Anything is allowed in > strings. I think you misunderstand. Anything in Unicode that is normative is about interchange. Strings are also a means of interchange---between modules (separate Unicode processes) in a program (single OS process). Python language and library implementation is going to be primarily concerned with interchange in the intermodule sense. Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2" is precisely a statement that various modules in Python do not specify what encoding forms they purport to accept or emit. The purpose of the definitions in chapter 3 is to clarify the requirements of conformance. The discussion of strings is implicitly about interchange, otherwise it would be somewhere else than the chapter about conformance. > My understanding is that it is a goal, but practicality beats purity. > I think the only disagreement is on what's practical. It is not a goal of the *language*; there is no object in the *language* that we can say is buggy if it doesn't conform to the Unicode standard. Unicode conformance for Python, as of today, is a WIBNI. As Guido points out, the goal is a language that can be used to write efficient implementations of Unicode *if the users want to pay that cost*, not to provide an implementation so the users don't have to. From g.brandl at gmx.net Sun Jun 10 10:30:30 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sun, 10 Jun 2007 10:30:30 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Guido van Rossum schrieb: > On 6/9/07, Georg Brandl wrote: >> Guido van Rossum schrieb: >> > PEP 3127 (Integer Literal Support and Syntax) introduces new notations >> > for octal and binary integers. This isn't implemented yet. Are there >> > any takers? It shouldn't be particularly complicated. >> >> I have a patch lying around here which might be quite complete... > > Cool! > >> One thing that's unclear to me though: didn't we decide to drop the uppercase >> string modfiers/number suffixes/prefixes? > > In the end (doesn't the PEP confirms this?) we decided to keep them > and make it a style rule instead. Some folks have generated data sets > using uppercase. The PEP lists it as an "Open Issue". >> Also, I'm not sure what int() should do with "010". > > int("010") should return (decimal) 10. > int("010", 0) should raise ValueError. > > I thought that was also in the PEP. Yes, but rather than follow the PEP blindly, which might not have been updated to the latest discussion results, asking can't hurt :) > Anyway, with these tweaks, feel free to just check it in (well, if you > also fix the standard library to use the new notation). That should be easy enough. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From martin at v.loewis.de Sun Jun 10 10:46:12 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 10:46:12 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <73543.28835.qm@web33502.mail.mud.yahoo.com> References: <73543.28835.qm@web33502.mail.mud.yahoo.com> Message-ID: <466BBA54.3020300@v.loewis.de> > To truly enable Python in a non-English teaching > environment, I think you'd actually want to go a step > further and just internationalize the whole program. I don't know why that theory keeps popping up when people have repeatedly pointed out that it is just false. People *can* get used to the keywords of Python even if they have no clue what they mean. There is plenty of evidence for that. Likewise for the standard library. Regards, Martin From martin at v.loewis.de Sun Jun 10 11:00:17 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 11:00:17 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <466BBDA1.7070808@v.loewis.de> > Here is what I have to say (to everyone in this discussion, not > specifically to you, Stephen) in response to said labelling: Interestingly enough, we agree on the principles, and just judge the PEP differently wrt. these principles > Many of us value a *predictable* identifier character set. > Whether "predictable" means ASCII only, or user-selectable, or > restricted by default, I think we all agree in this sentiment: Indeed, PEP 3131 gives a predictable identifier character set. Adding per-site options to change the set of allowable characters makes it less predictable. > We believe that we should try to make it easier, not harder, for > programmers to understand what Python code says. This has many > benefits (reliability, readability, transparency, reviewability, > debuggability). I consider these core strengths of Python. Indeed. That was my primary motivation for the PEP: to make it easier for programmers to understand Python, and to allow people to write more transparent programs. > That is what makes these strengths so important. I hope this > helps you understand why these concerns can't and shouldn't be > brushed off as "paranoia" -- this really has to do with the > core values of the language. It just seems that the concerns don't directly follow from the principles. Something else has to be added to make that conclusion. It may not be paranoia (i.e. excessive anxiety), but there surely is some fear, no? Regards, Martin From rauli.ruohonen at gmail.com Sun Jun 10 18:20:44 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Sun, 10 Jun 2007 19:20:44 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466BBA54.3020300@v.loewis.de> References: <73543.28835.qm@web33502.mail.mud.yahoo.com> <466BBA54.3020300@v.loewis.de> Message-ID: On 6/10/07, "Martin v. L?wis" wrote: > > To truly enable Python in a non-English teaching > > environment, I think you'd actually want to go a step > > further and just internationalize the whole program. > > I don't know why that theory keeps popping up when people > have repeatedly pointed out that it is just false. It isn't contrary to the PEP either. If somebody wants to go a step further with syntax (keywords etc), then they can provide alternative BNF syntaxes for different languages. It wouldn't necessitate any changes to PEP 3131. OTOH, PEP 3131 cannot be implemented at the syntax level. > People *can* get used to the keywords of Python even if > they have no clue what they mean. There is plenty of > evidence for that. Likewise for the standard library. True, but your PEP does not preclude later implementing the "step further". For libraries the step further would mean separate wrapped versions, as there probably isn't any other general solution. Using gettext() or something for identifiers would easily break with introspection, and would in any case be complicated (which is worse than complex, which is worse than simple, and wrappers are simple :-). BTW, I submitted the normalization patch for 2.6, if you want to look at it. From martin at v.loewis.de Sun Jun 10 18:55:18 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 18:55:18 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <465667AE.2090000@v.loewis.de> <20070524215742.864E.JCARLSON@uci.edu> <46568116.202@v.loewis.de> Message-ID: <466C2CF6.30508@v.loewis.de> >> "I know what you want, and I could easily do it, but I don't feel >> like doing it, read these ten pages of text to learn more about the >> problem". >> > in one word: exit That's indeed close, and has caused grief for this exact property. However, the case is actually different: exit could *not* easily do what was requested; for that to work, exit would have to be promoted to a keyword. Regards, Martin From martin at v.loewis.de Sun Jun 10 18:59:59 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 18:59:59 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070525095511.866D.JCARLSON@uci.edu> References: <20070525091105.8663.JCARLSON@uci.edu> <19dd68ba0705250945j3dadcefcu8db91b3d2c055fdf@mail.gmail.com> <20070525095511.866D.JCARLSON@uci.edu> Message-ID: <466C2E0F.3010402@v.loewis.de> > It does, but it also refuses the temptation to guess that *everyone* > wants to use unicode identifiers by default. Please call them non-ASCII identifiers. All identifiers are Unicode, anyway, since Python 1.0 or so. They will be represented as Unicode strings in Python 3. Regards, Martin From jimjjewett at gmail.com Sun Jun 10 21:40:08 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Sun, 10 Jun 2007 15:40:08 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466BBDA1.7070808@v.loewis.de> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: On 6/10/07, "Martin v. L?wis" wrote: > > Many of us value a *predictable* identifier character set. > > Whether "predictable" means ASCII only, or user-selectable, or > > restricted by default, I think we all agree in this sentiment: > Indeed, PEP 3131 gives a predictable identifier character set. > Adding per-site options to change the set of allowable characters > makes it less predictable. Not in practice. Today, identifiers are drawn from [A-Za-z0-9], which is a fairly small set. Under the current PEP 3131 proposal, they will be drawn from a much larger set. There won't normally be many more letters actually used in any given program, but there will be many more that are possible (with very low probability). Unfortunately, some of these are visually identical. (Even with modified XID, they don't get rid of confusables; they unicode consortium is very unwilling to rule out anything which might theoretically be needed for valid reasons.) Many more are visually indistinguishable in practice, simply because the reader hasn't seen them before. While Unicode is still a finite set, it is much larger than ASCII. By allowing site modifications, the rule becomes: It will use ASCII. Local code can also use local characters. There are potential exceptions for code that gets shared beyond local groups without ASCII-fication, but this is a strict subset of the "unreadable" code used under "anything-goes". Distribution without ASCIIfication is discouraged (by the extra decision required at installation time), users have explicit notice (by accepting it at install time), and the expanded charset is still a tiny fraction of what PEP3131 currently proposes (you can accept French characters withough accepting Chinese ideographs). -jJ From martin at v.loewis.de Sun Jun 10 21:51:27 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 21:51:27 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <236066.59081.qm@web33506.mail.mud.yahoo.com> References: <236066.59081.qm@web33506.mail.mud.yahoo.com> Message-ID: <466C563F.6090305@v.loewis.de> > I think this whole debate could be put to rest by > agreeing to err on the side of ascii in 3.0 beta, and > if in real world experience, that turns out to be the > wrong decision, simply fix it in 3.0 production, 3.1, > or 3.2. Likewise, this whole debate could also be put to rest by agreeing to err on the side of unrestricted support for the PEP, and if that turns out to be the wrong decision, simply fix any problems discovered in 3.0 production, 3.1, or 3.2. IOW, any debate can be put to rest by agreeing. Regards, Martin From martin at v.loewis.de Sun Jun 10 21:55:28 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 21:55:28 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <19685.97380.qm@web33511.mail.mud.yahoo.com> References: <19685.97380.qm@web33511.mail.mud.yahoo.com> Message-ID: <466C5730.3060003@v.loewis.de> > That describes me perfectly. I am self-interested to > the extent that my employers just pay me to write > working Python code, so I want the simplicity of ASCII > only. What I don't understand is why you can't simply continue to do so, with PEP 3131 implemented? If you have no need for accessing the NIS database, or for TLS sockets, you just don't use them - no need to make these features optional in the library. Regards, Martin From martin at v.loewis.de Sun Jun 10 22:04:30 2007 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 22:04:30 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <465667AE.2090000@v.loewis.de> <20070524215742.864E.JCARLSON@uci.edu> <46568116.202@v.loewis.de> Message-ID: <466C594E.6030106@v.loewis.de> >> People should not have to read long system configuration pages >> just to run the program that they intuitively wrote correctly >> right from the start. > > It is not intuitive. One thing I learned from the discussion here > about Unicode identifiers in other languages is that, though this > support exists in several other languages, it is *different* in each > of them. And PEP 3131 is different still. They allow different > sets of characters, and even worse, use different normalization rules. This is a theoretical problem only. People intuitively know what a "word" is in their language, and now we tell them they can use words as identifiers, as long as there are no spaces in them. That different (programming) languages encode that intuition in slightly different rules makes no practical difference: the actual differences are only in boundary cases that are unlikely to occur in real life (unless somebody deliberately tries to come up with border cases). Regards, Martin From martin at v.loewis.de Sun Jun 10 22:09:39 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 22:09:39 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <4656EAB5.6080405@gmail.com> References: <465615C9.4080505@v.loewis.de> <320102.38046.qm@web33515.mail.mud.yahoo.com> <19dd68ba0705241805y52ba93fdt284a2c696b004989@mail.gmail.com> <46568CEF.2030900@v.loewis.de> <4656EAB5.6080405@gmail.com> Message-ID: <466C5A83.8090703@v.loewis.de> Nick Coghlan schrieb: > Martin v. L?wis wrote: >>> I think that's a pretty strong reason for making the new, more complex >>> behaviour optional. >> >> Thus making it simpler????? The more complex behavior still remains, >> to fully understand the language, you have to understand that behavior, >> *plus* you need to understand that it may sometimes not be present. > > It's simpler because any existing automated unit tests will flag > non-ascii identifiers without modification. Not only does it prevent > surreptitious insertion of malicious code, but existing projects don't > have to even waste any brainpower worrying about the implications of > Unicode identifiers (because library code typically doesn't care about > client code's identifiers, only about the objects the library is asked > to deal with). I don't understand why existing projects would worry about the feature, for reasons different from the malicious code issue. If you don't want to waste brainpower on it, then just don't. > A free-for-all wasn't even proposed for strings and comments in PEP 263 > - why shouldn't we be equally conservative when it comes to > progressively enabling Unicode identifiers? Unfortunately, I don't understand this sentence. What is a "free-for-all", and why could it have been proposed by PEP 263, but wasn't? Regards, Martin From martin at v.loewis.de Sun Jun 10 22:14:47 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 22:14:47 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <20070525091105.8663.JCARLSON@uci.edu> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> Message-ID: <466C5BB7.8050909@v.loewis.de> >> If it is the latter, I don't understand why the 95% ascii users need >> to run additional verification and checking tools. If they don't >> know the full language, they won't use it - why should they run >> any checking tools? > > I drop this > package into my tree, add the necessary imports and... > > ImportError: non-ascii identifier used without -U option > > Huh, apparently this 3rd party package uses non-ascii identifiers. If I > wanted to keep my codebase ascii-only (a not unlikely case), I can > choose to either look for a different package, look for a variant of > this package with only ascii identifiers, or attempt to convert the > package myself (a tool that does the unicode -> ascii transliteration > process would make this smoother). I cannot imagine this scenario as realistic. It is certain realistic that you want to keep your own code base ASCII-only - what I don't understand why such a policy would extend to libraries that you use. If the interfaces of the library are non-ASCII, you will automatically notice; if it only has some non-ASCII identifiers inside, why would you bother? > * Or I copy and paste code from the Python Cookbook, a blog, etc. You copy code from the Python Cookbook and don't notice that it contains Chinese characters in identifiers??? Regards, Martin From martin at v.loewis.de Sun Jun 10 22:16:34 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sun, 10 Jun 2007 22:16:34 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <73543.28835.qm@web33502.mail.mud.yahoo.com> <466BBA54.3020300@v.loewis.de> Message-ID: <466C5C22.4060708@v.loewis.de> > BTW, I submitted the normalization patch for 2.6, if you want to look > at it. Thanks. It might take some time until I get a chance (or somebody else may respond quicker); the 2.6 release is still ahead, so there is still plenty of time. Regards, Martin From martin at v.loewis.de Sun Jun 10 22:23:30 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 10 Jun 2007 22:23:30 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: <466C5DC2.6090109@v.loewis.de> >> Indeed, PEP 3131 gives a predictable identifier character set. >> Adding per-site options to change the set of allowable characters >> makes it less predictable. > > Not in practice. > > Today, identifiers are drawn from [A-Za-z0-9], which is a fairly small set. > > Under the current PEP 3131 proposal, they will be drawn from a much > larger set. There won't normally be many more letters actually used > in any given program, but there will be many more that are possible > (with very low probability). It's true that nobody could realistically enumerate all characters that would be allowed in identifiers. However, it is still practically easily predictable whether a given string makes an identifier *for a speaker of that language it is in*. The rule still is "letters, digits, and the underscore". It's certainly possible to come up with obscure cases where people will guess incorrectly whether they are valid syntax, but it is always possible to deliberately obfuscate code. Except for the malicious-user case (which apparently needs to be addressed), I don't see a problem with the existence of obscure cases. > By allowing site modifications, the rule becomes: > > It will use ASCII. Not universally - only on that site. I don't know what rule is in force on my buddy's machine, so predicting it becomes harder. > There are potential exceptions for code that gets shared beyond local > groups without ASCII-fication, but this is a strict subset of the > "unreadable" code used under "anything-goes". Distribution without > ASCIIfication is discouraged (by the extra decision required at > installation time), users have explicit notice (by accepting it at > install time), and the expanded charset is still a tiny fraction of > what PEP3131 currently proposes (you can accept French characters > withough accepting Chinese ideographs). I just put wording in the PEP that makes it clear that, whatever the problem, a global flag is not an acceptable solution. Regards, Martin From baptiste13 at altern.org Sun Jun 10 22:57:47 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Sun, 10 Jun 2007 22:57:47 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466BBDA1.7070808@v.loewis.de> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: Martin v. L?wis a ?crit : >> Here is what I have to say (to everyone in this discussion, not >> specifically to you, Stephen) in response to said labelling: > > Interestingly enough, we agree on the principles, and just > judge the PEP differently wrt. these principles > >> Many of us value a *predictable* identifier character set. >> Whether "predictable" means ASCII only, or user-selectable, or >> restricted by default, I think we all agree in this sentiment: > > Indeed, PEP 3131 gives a predictable identifier character set. > Adding per-site options to change the set of allowable characters > makes it less predictable. > true. However, this will only matter if you distribute code with non-ASCII identifiers to the wider public. Something that we agree is a bad idea, don't we? >> We believe that we should try to make it easier, not harder, for >> programmers to understand what Python code says. This has many >> benefits (reliability, readability, transparency, reviewability, >> debuggability). I consider these core strengths of Python. > > Indeed. That was my primary motivation for the PEP: to make > it easier for programmers to understand Python, and to allow > people to write more transparent programs. > The real question is: transparent *to whom*. Transparent to the developper himself when he rereads his own code (which I value as a developper), or transparent to the user of the program when he tries to fix a bug (which I value as a user of open-source software) ? Non-ASCII identifiers are marginally better for the first case, but can be dramatically worse for the second one. Clearly, there is a tradeoff. >> That is what makes these strengths so important. I hope this >> helps you understand why these concerns can't and shouldn't be >> brushed off as "paranoia" -- this really has to do with the >> core values of the language. > > It just seems that the concerns don't directly follow from > the principles. Something else has to be added to make that > conclusion. It may not be paranoia (i.e. excessive anxiety), > but there surely is some fear, no? > That argument is not really honest :-) Every risk can be estimated opimistically or pessimistically. In both cases, there is some part of irrationallity. > Regards, > Martin Cheers, Baptiste From santagada at gmail.com Mon Jun 11 00:06:20 2007 From: santagada at gmail.com (Leonardo Santagada) Date: Sun, 10 Jun 2007 19:06:20 -0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> Em 10/06/2007, ?s 17:57, Baptiste Carvello escreveu: >> Indeed, PEP 3131 gives a predictable identifier character set. >> Adding per-site options to change the set of allowable characters >> makes it less predictable. >> > true. However, this will only matter if you distribute code with > non-ASCII > identifiers to the wider public. Something that we agree is a bad > idea, don't we? I don't. It is a bad idea to distribute non-ASCII code for libraries that are supposed to be used by the world as a whole. But distributing a chinese code for doing something like taxes using chinese rules is ok and should be encouraged (now, I don't know they have taxes in china, but that is not the point). And not even that, in a school you would have to have all computers and students computers configured for the same "locale" to make a working code on one machine work in another > >>> We believe that we should try to make it easier, not harder, for >>> programmers to understand what Python code says. This has many >>> benefits (reliability, readability, transparency, reviewability, >>> debuggability). I consider these core strengths of Python. >> >> Indeed. That was my primary motivation for the PEP: to make >> it easier for programmers to understand Python, and to allow >> people to write more transparent programs. >> > The real question is: transparent *to whom*. Transparent to the > developper > himself when he rereads his own code (which I value as a > developper), or > transparent to the user of the program when he tries to fix a bug > (which I value > as a user of open-source software) ? Non-ASCII identifiers are > marginally better > for the first case, but can be dramatically worse for the second > one. Clearly, > there is a tradeoff. No they are not, people doing open source work are probably going to still be coding in english so that is not a problem, but that chinese tax system if it is open sourced people in china can easily help fixing bugs because identifiers are in their own language, which they can identify. > >>> That is what makes these strengths so important. I hope this >>> helps you understand why these concerns can't and shouldn't be >>> brushed off as "paranoia" -- this really has to do with the >>> core values of the language. >> >> It just seems that the concerns don't directly follow from >> the principles. Something else has to be added to make that >> conclusion. It may not be paranoia (i.e. excessive anxiety), >> but there surely is some fear, no? >> > That argument is not really honest :-) Every risk can be estimated > opimistically > or pessimistically. In both cases, there is some part of > irrationallity. The thing is, people are predicting a future for python code on the open source world. One in which devs of open source libraries and programs will start coding in different languages if you support unicode identifiers, something that is not common today (using some form of ASCIIfication of their languages) and didn't happen with the Java, C#, Javascript and Common Lisp communities. In light of all that I think this prediction is probably wrong. We are all consenting adults and we know that we should code in english if we want our code to be used and to be a first class citizen of the open source world. What do you have to support your prediction? -- Leonardo Santagada "If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands." - Douglas Adams From g.brandl at gmx.net Mon Jun 11 00:39:51 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 11 Jun 2007 00:39:51 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Guido van Rossum schrieb: > PEP 3127 (Integer Literal Support and Syntax) introduces new notations > for octal and binary integers. This isn't implemented yet. Are there > any takers? It shouldn't be particularly complicated. Okay, it's done. I'll be grateful for reviews. I've also removed traces of the "L" literal suffix where I encountered them, but may not have gotten them all. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From foom at fuhm.net Mon Jun 11 00:50:55 2007 From: foom at fuhm.net (James Y Knight) Date: Sun, 10 Jun 2007 18:50:55 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: On Jun 10, 2007, at 4:57 PM, Baptiste Carvello wrote: > >> Indeed. That was my primary motivation for the PEP: to make >> it easier for programmers to understand Python, and to allow >> people to write more transparent programs. > The real question is: transparent *to whom*. Transparent to the > developper > himself when he rereads his own code (which I value as a > developper), or > transparent to the user of the program when he tries to fix a bug > (which I value > as a user of open-source software) ? Non-ASCII identifiers are > marginally better > for the first case, but can be dramatically worse for the second > one. Clearly, > there is a tradeoff. If another developer is planning to write code in English, this whole debate is moot. So, let's take as a given that he is going to write a program in his own non-English language. Now, will he write in a asciified form of his language, or using the proper character set? Right now, the only option is the first. The PEP proposes to also allow the second. So, your question should be: is it easier to understand an ASCIIified form of another language, or the actual language itself? For me (who doesn't speak said langauge, nor perhaps even know its character set), I'm pretty sure the answer is still going to be the second: I'd rather a program written in Chinese use Chinese characters, rather than a transliteration of Chinese into ASCII. because it is actually feasible for me to do automatic translation of Chinese into something resembling English. And of course, that's even more true when talking about a language like French, which uses an alphabet quite familiar to me, but yet online translators still fail to function if it's been transliterated into ASCII. James From guido at python.org Mon Jun 11 00:54:09 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 10 Jun 2007 15:54:09 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Very cool; thanks!!! No problems so far. I wonder if we need a bin() built-in that is to 0b like oct() is to 0o and hex() to 0x? --Guido On 6/10/07, Georg Brandl wrote: > Guido van Rossum schrieb: > > PEP 3127 (Integer Literal Support and Syntax) introduces new notations > > for octal and binary integers. This isn't implemented yet. Are there > > any takers? It shouldn't be particularly complicated. > > Okay, it's done. > > I'll be grateful for reviews. I've also removed traces of the "L" literal > suffix where I encountered them, but may not have gotten them all. > > Georg > > -- > Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. > Four shall be the number of spaces thou shalt indent, and the number of thy > indenting shall be four. Eight shalt thou not indent, nor either indent thou > two, excepting that thou then proceed to four. Tabs are right out. > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Mon Jun 11 01:02:31 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 11 Jun 2007 01:02:31 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Guido van Rossum schrieb: > Very cool; thanks!!! No problems so far. > > I wonder if we need a bin() built-in that is to 0b like oct() is to 0o > and hex() to 0x? Would that also require a __bin__() special method? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From guido at python.org Mon Jun 11 01:15:06 2007 From: guido at python.org (Guido van Rossum) Date: Sun, 10 Jun 2007 16:15:06 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: On 6/10/07, Georg Brandl wrote: > Guido van Rossum schrieb: > > Very cool; thanks!!! No problems so far. > > > > I wonder if we need a bin() built-in that is to 0b like oct() is to 0o > > and hex() to 0x? > > Would that also require a __bin__() special method? If the other two use it, we might as well model it that way. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From showell30 at yahoo.com Mon Jun 11 01:21:33 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 10 Jun 2007 16:21:33 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: Message-ID: <912322.26284.qm@web33510.mail.mud.yahoo.com> --- James Y Knight wrote: > > I'm pretty sure the answer is still going to > be the second: I'd > rather a program written in Chinese use Chinese > characters, rather > than a transliteration of Chinese into ASCII. > because it is actually > feasible for me to do automatic translation of > Chinese into something > resembling English. And of course, that's even more > true when talking > about a language like French, which uses an alphabet > quite familiar > to me, but yet online translators still fail to > function if it's been > transliterated into ASCII. > This was exactly my experience with translating the German program Martin posted a while back. I used Babelfish to translate it to English, and the one word that I didn't translate properly was a word with an umlaut. (It was my own error not to use the umlaut when looking up the translation; Martin's program did include the umlaut, and once I was clued in to the errors of my ways, I went back to babelfish with the umlaut and I got the exact translation I was looking for.) ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ From showell30 at yahoo.com Mon Jun 11 01:13:08 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 10 Jun 2007 16:13:08 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C563F.6090305@v.loewis.de> Message-ID: <323254.95713.qm@web33508.mail.mud.yahoo.com> --- "Martin v. L?wis" wrote: > > I think this whole debate could be put to rest by > > agreeing to err on the side of ascii in 3.0 beta, > and > > if in real world experience, that turns out to be > the > > wrong decision, simply fix it in 3.0 production, > 3.1, > > or 3.2. > I wrote this a while back, and at the time I wrote, I felt it was a pretty reasonable statement. Having said that, after following this thread a little more... > Likewise, this whole debate could also be put to > rest > by agreeing to err on the side of unrestricted > support > for the PEP, and if that turns out to be the wrong > decision, simply fix any problems discovered in 3.0 > production, 3.1, or 3.2. > ...I am now in favor of the PEP, with no restrictions, even though it now goes a little further than I'd like. I wish the debate would turn to actual use cases. For example, one of the arguments behind PEP 3131 is that it will facilitate the use of Python in educational environments. It would be interesting to hear from actual teachers what their biggest impediments to using Python are right now. It could be that the lack of foreign language documentation is far bigger an impediment to using Python in a Chinese classroom than the current restrictions on ASCII identifiers. It could be that the standard library involves knowing too much English, which PEP 3131 won't really address. It could be that teachers simply want error messages to be internationalized, so that students can follow tracebacks, and identifiers aren't really an issue. It could be that some foreign schools actually embrace the use of an English alphabet in Python, as it allows for a more integrated education opportunity (students learn an important programming language while simultaneously mastering one of the world's most commercially important written languages...). ____________________________________________________________________________________ Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 From martin at v.loewis.de Mon Jun 11 05:07:06 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 11 Jun 2007 05:07:06 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: <466CBC5A.3050907@v.loewis.de> >> Indeed, PEP 3131 gives a predictable identifier character set. >> Adding per-site options to change the set of allowable characters >> makes it less predictable. >> > true. However, this will only matter if you distribute code with non-ASCII > identifiers to the wider public. No - it will matter for any kind of distribution, not just to the "wider public". If I move code to the next machine it may stop working, or if I upgrade to the next Python version, assuming the default is to restrict identifiers. > The real question is: transparent *to whom*. Transparent to the developper > himself when he rereads his own code (which I value as a developper), or > transparent to the user of the program when he tries to fix a bug (which I value > as a user of open-source software) ? Non-ASCII identifiers are marginally better > for the first case, but can be dramatically worse for the second one. Clearly, > there is a tradeoff. Why do you say that? Non-ASCII identifiers significantly improve the readability of code to speakers of the natural language from which the identifiers are drawn. With ASCII identifiers, the reader needs to understand the English words, or recognize the transliteration. With non-ASCII identifiers, the intended meaning of the class or function becomes immediately apparent, in the way identifiers have always been self-documentation for English-speaking people. >>> That is what makes these strengths so important. I hope this >>> helps you understand why these concerns can't and shouldn't be >>> brushed off as "paranoia" -- this really has to do with the >>> core values of the language. >> It just seems that the concerns don't directly follow from >> the principles. Something else has to be added to make that >> conclusion. It may not be paranoia (i.e. excessive anxiety), >> but there surely is some fear, no? >> > That argument is not really honest :-) Every risk can be estimated opimistically > or pessimistically. In both cases, there is some part of irrationallity. Still, what is the risk being estimated? Is it that somebody maliciously tries to provide patches that use look-alike characters? I honestly don't know what risks you see. Regards, Martin From martin at v.loewis.de Mon Jun 11 05:27:45 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 11 Jun 2007 05:27:45 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <323254.95713.qm@web33508.mail.mud.yahoo.com> References: <323254.95713.qm@web33508.mail.mud.yahoo.com> Message-ID: <466CC131.3030006@v.loewis.de> > I wish the debate would turn to actual use cases. For > example, one of the arguments behind PEP 3131 is that > it will facilitate the use of Python in educational > environments. It would be interesting to hear from > actual teachers what their biggest impediments to > using Python are right now. It could be that the lack > of foreign language documentation is far bigger an > impediment to using Python in a Chinese classroom than > the current restrictions on ASCII identifiers. I don't know whether you have seen http://groups.google.com/group/comp.lang.python/msg/ccffec1abd4dd24d which discusses these points precisely. See also a few follow-up messages in that part of the thread. FWIW, I don't think that foreign-language documentation is lacking. I don't know about Chinese, but for German, there is plenty of documentation. I wrote a German Python book myself 10 years ago, and other people have since written other books. A PowerPoint presentation discussing Python for school can be found at http://ada.rg16.asn-wien.ac.at/~python/Py4KidsFolien1.ppt Gregor Lingl is the author of "Python f?r Kids". > It > could be that the standard library involves knowing > too much English, which PEP 3131 won't really address. > It could be that teachers simply want error messages > to be internationalized, so that students can follow > tracebacks, and identifiers aren't really an issue. > It could be that some foreign schools actually embrace > the use of an English alphabet in Python, as it allows > for a more integrated education opportunity (students > learn an important programming language while > simultaneously mastering one of the world's most > commercially important written languages...). Unfortunately, teachers don't participate in python-3000, as don't many other Python users. So it's unlikely that you find a teacher posting *here*, it was pure luck that I found a Chinese teacher posting on comp.lang.python. You would need to go to places where teachers discuss in the internet, which likely isn't even Usenet. Not being a (high school) teacher myself, I don't know how to find them. Regards, Martin From showell30 at yahoo.com Mon Jun 11 06:27:29 2007 From: showell30 at yahoo.com (Steve Howell) Date: Sun, 10 Jun 2007 21:27:29 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466CC131.3030006@v.loewis.de> Message-ID: <954258.3730.qm@web33506.mail.mud.yahoo.com> --- "Martin v. L?wis" wrote: > > Unfortunately, teachers don't participate > in python-3000, as don't many other Python users. > So it's unlikely that you find a teacher posting > *here*, it was pure luck that I found a Chinese > teacher posting on comp.lang.python. You would > need to go to places where teachers discuss > in the internet, which likely isn't even Usenet. > Not being a (high school) teacher myself, I don't > know how to find them. > In high schools? :) Seriously, that's where you find high school teachers. I've been in high school environments where Python is being taught, and that's why I'm a little skeptical that folks arguing on either side of this argument are maybe a bit too much in the ivory tower, and not enough dealing with actual use cases. The Chinese teacher that you mention made some interesting points in his posts, and I take his advocacy for PEP 3131 very seriously, but I think he would actually be well served using a language more suitable for educational purposes than Python. I have experience with using a learning language in the classroom, and it was very positive for students. ____________________________________________________________________________________ Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC From ncoghlan at gmail.com Mon Jun 11 08:04:16 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 11 Jun 2007 16:04:16 +1000 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: <466CE5E0.3020106@gmail.com> Guido van Rossum wrote: > On 6/10/07, Georg Brandl wrote: >> Guido van Rossum schrieb: >>> Very cool; thanks!!! No problems so far. >>> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o >>> and hex() to 0x? >> Would that also require a __bin__() special method? > > If the other two use it, we might as well model it that way. > I must admit I've never understood why hex() and oct() don't just go through __int__() (Note that the integer formats are all defined as going through int() in PEP 3101). If we only want them to work for true integers, then we have __index__() available now. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From python at zesty.ca Mon Jun 11 08:54:26 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Mon, 11 Jun 2007 01:54:26 -0500 (CDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C563F.6090305@v.loewis.de> References: <236066.59081.qm@web33506.mail.mud.yahoo.com> <466C563F.6090305@v.loewis.de> Message-ID: Steve Howell wrote: > I think this whole debate could be put to rest by > agreeing to err on the side of ascii in 3.0 beta, and > if in real world experience, that turns out to be the > wrong decision, simply fix it in 3.0 production, 3.1, > or 3.2. On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote: > Likewise, this whole debate could also be put to rest > by agreeing to err on the side of unrestricted support > for the PEP, and if that turns out to be the wrong > decision, simply fix any problems discovered in 3.0 > production, 3.1, or 3.2. Your attempted parallel does not match: it breaks code, whereas Steve's does not. -- ?!ng From python at zesty.ca Mon Jun 11 09:20:42 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Mon, 11 Jun 2007 02:20:42 -0500 (CDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C5730.3060003@v.loewis.de> References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote: > > That describes me perfectly. I am self-interested to > > the extent that my employers just pay me to write > > working Python code, so I want the simplicity of ASCII > > only. > > What I don't understand is why you can't simply continue > to do so, with PEP 3131 implemented? > > If you have no need for accessing the NIS database, > or for TLS sockets, you just don't use them - no > need to make these features optional in the library. Because the existence of these library modules does not make it impossible to reliably read source code. We're talking about changing the definition of the language here, which is deeper than adding or removing things in the library. Python currently provides to everyone the restriction of identifiers to a character set that everyone knows and trusts. Many of us want Python to continue to provide such restriction for those who want identifiers to be in a character set they know and trust. This is not incompatible with your desire to permit alternative character sets, as long as Python offers an option to make that choice. We can continue to discuss the details of how that choice is expressed, but this general idea is a solution that would give us both what we want. Can we agree on that? -- ?!ng From aleaxit at gmail.com Mon Jun 11 11:19:44 2007 From: aleaxit at gmail.com (Alex Martelli) Date: Mon, 11 Jun 2007 11:19:44 +0200 Subject: [Python-3000] rethinking pep 3115 In-Reply-To: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> Message-ID: On 6/10/07, tomer filiba wrote: > pep 3115 (new metaclasses) seems overly complicated imho. It does look over-engineered to me, too. > it fails my understanding of "keeping it simple", among other heuristics. > > (1) > the trivial fix-up would be to extend the type constructor to take > 4 arguments: (name, bases, attrs, order), where 'attrs' is a plain > old dict, and 'order' is a list, into which the names are appended > in the order they were defined in the body of the class. this way, > no new types are introduced and 99% of the use cases are covered. Agreed, but it doesn't look very elegant. > > things like "forward referencing in the class namespace" are evil. > and besides, it's not possible to do with functions and modules, > so why should classes be allowed such a mischief? > > (2) > the second-best solution i could think of is just passing the dict as a > keyword argument to the class, like so: > > class Spam(metaclass = Bacon, dict = {}): > ... > > so you could explicitly state you need a special dict. I like this one, with classdict being the keyword (dict is the name of a builtin type and we shouldn't encourage the frequent but iffy practice of 'overriding' builtin identifiers). > > following the cosmetic change of removing the magical __metaclass__ > attribute from the class body into the class header, it makes so > sense to replace it by another magical method, __prepare__. > the straight-forward-and-simple way would be to make it a keyword > argument, just like 'metaclass'. > > (3) > personally, i refrain from metaclasses. according to my experience, > they just cause trouble, while the benefits of using them are marginal. > the problem is noticeable especially when trying to understand > and debug third-party code. metaclasses + bugs = blackmagic. > > moreover, they introduce inheritance issues. the class hierarchy > becomes rigid and difficult to evolve as the need arises, which > contradicts my perception of agile languages. i like to view programming > as an iterative task which approaches the final objective after > several loops. rigidness makes each loop longer, which is why > i prefer dynamic languages to compiled ones. > > on the other hand, i do understand the need for metaclasses, > even if for the sake of symmetry (as types are objects). > but the solution proposed by pep 3115, of making metaclasses > even more complicated and magical, seems all wrong to me. > > i understand it's already been accepted, but i'm hoping there's > still time to reconsider this before 3.0 becomes final. I agree with your observations and with your hope. Alex From murman at gmail.com Mon Jun 11 15:15:20 2007 From: murman at gmail.com (Michael Urman) Date: Mon, 11 Jun 2007 08:15:20 -0500 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <236066.59081.qm@web33506.mail.mud.yahoo.com> <466C563F.6090305@v.loewis.de> Message-ID: On 6/11/07, Ka-Ping Yee wrote: > Your attempted parallel does not match: it breaks code, > whereas Steve's does not. However the same code which would break only if we find we need to restrict the characters in identifiers further than the restrictions in the PEP, is broken off the bat in Steve's scenario because it won't run in differently configured environments. Michael -- Michael Urman From murman at gmail.com Mon Jun 11 15:29:35 2007 From: murman at gmail.com (Michael Urman) Date: Mon, 11 Jun 2007 08:29:35 -0500 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: On 6/11/07, Ka-Ping Yee wrote: > Because the existence of these library modules does not make it > impossible to reliably read source code. We're talking about > changing the definition of the language here, which is deeper > than adding or removing things in the library. This has already been demonstrated to be false - you already cannot visually inspect a printed python program and know what it will do. There is the risk of visually aliased identifiers, but how is that qualitatively worse than the truly conflicting identifiers you can import with a *, or have inserted by modules mucking with __builtins__? > permit alternative character sets, as long as Python offers an > option to make that choice. We can continue to discuss the > details of how that choice is expressed, but this general idea > is a solution that would give us both what we want. I can't agree with this. The predictability of needing only to duplicate dependencies (version of python, modules) to ensure a program that ran over there will run over here (and vice versa) is too important to me. When end users see a NameError or SyntaxError when they try to run a python script, they will generally assume it is the script at fault, not their environment. Michael -- Michael Urman From jimjjewett at gmail.com Mon Jun 11 15:37:00 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 09:37:00 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C5BB7.8050909@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> Message-ID: On 6/10/07, "Martin v. L?wis" wrote: > > * Or I copy and paste code from the Python Cookbook, a blog, etc. > You copy code from the Python Cookbook and don't notice that it > contains Chinese characters in identifiers??? Chinese in particular you would recognize as "not what I expected". Cyrillic you might not recognize, because it looks like ASCII letters. Prime (or tone) marks, you might not recognize, because they look like ASCII quote marks. If you're retyping, I'm not sure how much problem this would cause in practice. I wouldn't want to ban those letters entirely, but I would like some indication that I should expect characters in the Cyrillic range. -jJ From jimjjewett at gmail.com Mon Jun 11 15:58:58 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 09:58:58 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C5DC2.6090109@v.loewis.de> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <466C5DC2.6090109@v.loewis.de> Message-ID: On 6/10/07, "Martin v. L?wis" wrote: > >> Indeed, PEP 3131 gives a predictable identifier character set. > >> Adding per-site options to change the set of allowable characters > >> makes it less predictable. > > Not in practice. ... > > By allowing site modifications, the rule becomes: > > It will use ASCII. [and clipped "programs intended only for local use will use ASCII plus letters that locla users recognize."] > Not universally - only on that site. Yes, universally. By allowing "any unicode character", you have reason to believe the next piece of code isn't doing something strange, either by accident or by malice. By allowing "ASCII + those listed in the site config", then the rule will change from "It will use ASCII, always" (today) to "It will use ASCII if it is intended for distribution." plus "local programs can use ASCII + locally recognized letters" That is slightly more complicated than ASCII-only, but only for those who want to use the extended charsets -- and either rule is still straightforward. The rule proposed in PEP 3131 is "It will use something that is numerically a letter or number, to someone somewhere." Given the style guide of ASCII for internationally targeted open source, that will degrade to "It should use ASCII". "But it might not, since there will be no feedback or apparent downside to violating the style rule, even for distributed code." "In fact, it might even use something downright misleading, and you won't have any warning, because we thought that maybe someone, somewhere, might have wanted that character in a different context." And no, I don't think I'm exagerating with that last one; we aren't proposing rules against mixed script identifiers (or even limiting script switches to occur only at the _ character). It will be perfectly legitimate to apparently end a string with three consecutive prime characters. It will be bad style, but there will be nothing to tip off the non-paranoid. In theory, we could solve this by limiting the non-ASCII characters, but I don't we can do that in practice. The unicode consortium hasn't even tried; even XID + security modifications + NFKC still includes characters that are intended to look identical; all the security modifications do is eliminate characters that do *not* have any expected legitimate use. (Example: no living language uses them.) I don't think we want to wade too deeply into the morass of confusables detection; the unicode consortium itself says the problem is neither solved nor stable. It might be a good idea to restrict (wihtin-a-single-ID) script switches to only occur at the "_", but I'm not sure a 95% solution is worth doing. By saying "Only charcacters you or your sysadmin expected", we at least limit it to things the user will be expecting and can recognize. (Unless the sysadmin decides otherwise.) > I don't know what rule is > in force on my buddy's machine, so predicting it becomes harder. But you know ASCII will work. If he used the same local install (classroom peer, member of the same user group, etc), then your local characters will probably work too. If he is really your buddy, he probably trusts you enough to allow your charset if you tell him about it. > I just put wording in the PEP that makes it clear that, whatever > the problem, a global flag is not an acceptable solution. I agree that a single flag doesn't really solve the problem. But a global configuration does go a long way. For me personally, I would be more willing to allow Latin-1 than Hangul, because I can recognize the Latin-1 characters. (I still wouldn't allow them all by default; the difference between the various lower-case i's is small enough -- to me -- that I want a warning when one is used.) Hangul is more acceptable than Cyrillic, because at least it is obviously foreign; I won't mistake it for something. Someone who uses Cyrillic on a daily basis might well have the opposite preferences. I support letting her use Cyrillic if she wants to; I just don't want it to work on my machine without my knowing about it. But I would like to be able to accept ? and ? (French characters) without shutting off the warning for Cyrillic or Ogham. Allowing ASCII plus "chars specified by the site or user through a config file" meets that goal. -jJ From jimjjewett at gmail.com Mon Jun 11 16:09:25 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 10:09:25 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> Message-ID: On 6/10/07, Leonardo Santagada wrote: > We are all consenting > adults and we know that we should code in english if we want our code > to be used and to be a first class citizen of the open source world. I have no objection to Open Source being written in Chinese. My objection is to not knowing which file a script is using. Think of it like the coding directive. Once upon a time, if you didn't have a coding directive, but used characters outside of ASCII, the results were system-dependent. It didn't cause much of a problem, because most people stuck to ASCII, and the exceptions mostly stuck to characters that were common across codesets. Still, it was better to be explicit. I want an explicit notice of which scripts are being used. I'll settle for an explicit choice of which scripts can be used, so that I can just exclude the ones I wasn't expecting. This doesn't fully cover the malicious (or careless) user case, but it gives me the tools to set my own ease-of-use tradeoffs between "it just runs" and "it does what I think it does". -jJ From ncoghlan at gmail.com Mon Jun 11 16:10:28 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Jun 2007 00:10:28 +1000 Subject: [Python-3000] rethinking pep 3115 In-Reply-To: References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> Message-ID: <466D57D4.8070702@gmail.com> Alex Martelli wrote: >> (2) >> the second-best solution i could think of is just passing the dict as a >> keyword argument to the class, like so: >> >> class Spam(metaclass = Bacon, dict = {}): >> ... >> >> so you could explicitly state you need a special dict. > > I like this one, with classdict being the keyword (dict is the name of > a builtin type and we shouldn't encourage the frequent but iffy > practice of 'overriding' builtin identifiers). So instead of being able to write: class MyStruct(Struct): first = 1 second = 2 third = 3 everyone defining a Struct subclass has to write: class MyStruct(Struct, classdict=OrderedDict()): first = 1 second = 2 third = 3 Forgive my confusion, but exactly *how* is that meant to be an improvement? The use of a special ordered dictionary should be an internal implementation detail of the Struct class, and PEP 3115 makes it exactly that. The PEP's approach means that simple cases, while possibly being slightly harder to write, will 'just work' when it comes time to use them, while more complicated cases involving multiple metaclasses should still be possible. I will also note that the PEP allows someone to write their own base class which accepts the 'classdict' keyword argument if they so choose. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jimjjewett at gmail.com Mon Jun 11 16:29:16 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 10:29:16 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <466C5DC2.6090109@v.loewis.de> Message-ID: On 6/11/07, Jim Jewett wrote: > Yes, universally. By allowing "any unicode character", you have (oops -- apparently this posted with only half the edits) > reason to believe the next piece of code isn't doing something > strange, either by accident or by malice. By allowing "any unicode character", you have NO reason to believe... From guido at python.org Mon Jun 11 16:42:12 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Jun 2007 07:42:12 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: <466CE5E0.3020106@gmail.com> References: <466CE5E0.3020106@gmail.com> Message-ID: On 6/10/07, Nick Coghlan wrote: > Guido van Rossum wrote: > > On 6/10/07, Georg Brandl wrote: > >> Guido van Rossum schrieb: > >>> Very cool; thanks!!! No problems so far. > >>> > >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o > >>> and hex() to 0x? > >> Would that also require a __bin__() special method? > > > > If the other two use it, we might as well model it that way. > > I must admit I've never understood why hex() and oct() don't just go > through __int__() (Note that the integer formats are all defined as > going through int() in PEP 3101). > > If we only want them to work for true integers, then we have __index__() > available now. Well, maybe it's time to kill __oct__ and __hex__ then. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From jimjjewett at gmail.com Mon Jun 11 16:43:35 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 10:43:35 -0400 Subject: [Python-3000] PEP 3131: what are the risks? Message-ID: On 6/10/07, "Martin v. L?wis" wrote: > Still, what is the risk being estimated? Is it that somebody > maliciously tries to provide patches that use look-alike > characters? I honestly don't know what risks you see. Here are the top three that I see; note that none of these concerns say "Don't use non-ASCII ids". They do all say "Don't use ids from a script the user hasn't said to expect". (1) Malicious user is indeed one risk. A small probability, but a big enough loss that I want a warning when the door is unlocked. (2) Typos is another risk. Even in mono-lingual environments, it is possible to get a wrong letter. If you're expecting ?, it is fine. If you're not, then it shouldn't pass silently. (3) "Reados". When doing maintenance later, if I wasn't expecting ?, I may see it as a regular i, and code that way. Now I have two doppelganger/d?ppelganger variables (or inherited methods) serving the same purpose, but using different memory locations. Ideally, the test cases will catch this. In real life, even the python stdlib has plenty of modules with poor test coverage. I can't expect better of random code, particularly given that it has chosen to ignore the style-guide (and history) about sticking to ASCII for distributed code. (Learning to store your tests generally comes long after picking up the basic style guidelines.) -jJ From tomerfiliba at gmail.com Mon Jun 11 16:52:47 2007 From: tomerfiliba at gmail.com (tomer filiba) Date: Mon, 11 Jun 2007 16:52:47 +0200 Subject: [Python-3000] rethinking pep 3115 In-Reply-To: <466D57D4.8070702@gmail.com> References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> <466D57D4.8070702@gmail.com> Message-ID: <1d85506f0706110752x3af8f232o9a66fb11d68b7e75@mail.gmail.com> On 6/11/07, Nick Coghlan wrote: > So instead of being able to write: > > class MyStruct(Struct): > first = 1 > second = 2 > third = 3 > [...] > > Forgive my confusion, but exactly *how* is that meant to be an improvement? as your example shows, the most common use-case is an ordered dict, so as i was saying, just "upgrading" the type() constructor to accept four arguments solves almost all of the desired use cases. imho, "forward name binding" is an undesired side effect. what i'm trying to say is, this pep is an *overkill*. yes, it is "more powerful" than what i'm suggesting, but my point is we don't want to have all that "power". it's too complex and provides only a marginal benefit. you're just using classes as syntactic sugar for namespaces (because python lacks other syntactic namespaces), which is useful -- but conceptually wrong. python should have introduced a separate namespace construct, not to be confused with classes (something like the "make pep") the pep at hand is basically *overloading* classes into a generic namespace device -- to which i'm saying: (a) it's wrong and (b) it's not that frequently used to deserve complicating the interpreter for that. -tomer P.S. per your "class Something(Struct)" example above, you might want to check out how Construct solves that (see below). Construct's declarative approach is able to express more kinds of relations between data structures than simple structs, such as nested structs, arrays, swtiches, etc. http://construct.wikispaces.com http://sebulbasvn.googlecode.com/svn/trunk/construct/formats/filesystem/mbr.py http://sebulbasvn.googlecode.com/svn/trunk/construct/formats/executable/elf32.py From g.brandl at gmx.net Mon Jun 11 17:29:04 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 11 Jun 2007 17:29:04 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: <466CE5E0.3020106@gmail.com> Message-ID: Guido van Rossum schrieb: > On 6/10/07, Nick Coghlan wrote: >> Guido van Rossum wrote: >> > On 6/10/07, Georg Brandl wrote: >> >> Guido van Rossum schrieb: >> >>> Very cool; thanks!!! No problems so far. >> >>> >> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o >> >>> and hex() to 0x? >> >> Would that also require a __bin__() special method? >> > >> > If the other two use it, we might as well model it that way. >> >> I must admit I've never understood why hex() and oct() don't just go >> through __int__() (Note that the integer formats are all defined as >> going through int() in PEP 3101). >> >> If we only want them to work for true integers, then we have __index__() >> available now. > > Well, maybe it's time to kill __oct__ and __hex__ then. Sounds fine to me; using __index__ to get at the number to convert would be ideal. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From g.brandl at gmx.net Mon Jun 11 18:12:01 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 11 Jun 2007 18:12:01 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: <466CE5E0.3020106@gmail.com> Message-ID: Guido van Rossum schrieb: > On 6/10/07, Nick Coghlan wrote: >> Guido van Rossum wrote: >> > On 6/10/07, Georg Brandl wrote: >> >> Guido van Rossum schrieb: >> >>> Very cool; thanks!!! No problems so far. >> >>> >> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o >> >>> and hex() to 0x? >> >> Would that also require a __bin__() special method? >> > >> > If the other two use it, we might as well model it that way. >> >> I must admit I've never understood why hex() and oct() don't just go >> through __int__() (Note that the integer formats are all defined as >> going through int() in PEP 3101). >> >> If we only want them to work for true integers, then we have __index__() >> available now. > > Well, maybe it's time to kill __oct__ and __hex__ then. Okay, attached is a patch to do that. It adds a new abstract function, PyNumber_ToBase, that converts an __index__able integer to an arbitrary base. bin(), oct() and hex() just uses it. (I've left the slots in the PyNumberMethods struct for now.) There was not much library code to change: only tests used the special methods. Though /me wonders if we shouldn't just expose PyNumber_ToBase as a single function that mirrors int(str, base). Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. -------------- next part -------------- A non-text attachment was scrubbed... Name: no_hexoct.diff Type: text/x-patch Size: 10727 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070611/5d1f29c2/attachment.bin From rauli.ruohonen at gmail.com Mon Jun 11 18:43:58 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Mon, 11 Jun 2007 19:43:58 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <466C5DC2.6090109@v.loewis.de> Message-ID: On 6/11/07, Jim Jewett wrote: > "In fact, it might even use something downright misleading, and > you won't have any warning, because we thought that maybe someone, > somewhere, might have wanted that character in a different context." > > And no, I don't think I'm exagerating with that last one; we aren't > proposing rules against mixed script identifiers (or even limiting > script switches to occur only at the _ character). This isn't limited to identifiers, though. You can already write "tricky" code in 2.5, but the coding directive and the unicode/str separation make it obvious that something funny is going on. The former will not be necessary in 3.0 and the latter will be gone. Won't restricting identifiers only give you false sense of security? Small example using strings: authors = ['Mich?lle Mischi?-Vous', 'G?nther Gutenberg'] clearances = ['infrared', 'red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet', 'ultraviolet'] class ClearanceError(ValueError): pass def validate_clearance(clearance): if clearance not in clearances: raise ClearanceError(clearance) def big_red_button451(clearance): validate_clearance(clearance) if clearance == 'infrar?d': # cyrillic e # Even this button has *some* standards! -- Mich?lle raise ClearanceError(clearance) # Set G?nther's printer on fire def main(): try: big_red_button451('infrar?d') # cyrillic e except ClearanceError: pass else: print('BRB 451 does not check clearances properly!') if __name__ == '__main__': main() # run tests From guido at python.org Mon Jun 11 18:45:40 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Jun 2007 09:45:40 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: <466CE5E0.3020106@gmail.com> Message-ID: On 6/11/07, Georg Brandl wrote: > Guido van Rossum schrieb: > > On 6/10/07, Nick Coghlan wrote: > >> Guido van Rossum wrote: > >> > On 6/10/07, Georg Brandl wrote: > >> >> Guido van Rossum schrieb: > >> >>> Very cool; thanks!!! No problems so far. > >> >>> > >> >>> I wonder if we need a bin() built-in that is to 0b like oct() is to 0o > >> >>> and hex() to 0x? > >> >> Would that also require a __bin__() special method? > >> > > >> > If the other two use it, we might as well model it that way. > >> > >> I must admit I've never understood why hex() and oct() don't just go > >> through __int__() (Note that the integer formats are all defined as > >> going through int() in PEP 3101). > >> > >> If we only want them to work for true integers, then we have __index__() > >> available now. > > > > Well, maybe it's time to kill __oct__ and __hex__ then. > > Okay, attached is a patch to do that. > > It adds a new abstract function, PyNumber_ToBase, that converts an __index__able > integer to an arbitrary base. bin(), oct() and hex() just uses it. > (I've left the slots in the PyNumberMethods struct for now.) > > There was not much library code to change: only tests used the special methods. Beautiful. Check it in please! > Though /me wonders if we shouldn't just expose PyNumber_ToBase as a single > function that mirrors int(str, base). I think not. int(), oct(), hex() mirror the literal notations, and because of this, they can insert the 0[box] prefix. I think the discussions about this issue have revealed that there really isn't any use case for other bases. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From aleaxit at gmail.com Mon Jun 11 18:51:06 2007 From: aleaxit at gmail.com (Alex Martelli) Date: Mon, 11 Jun 2007 18:51:06 +0200 Subject: [Python-3000] rethinking pep 3115 In-Reply-To: <466D57D4.8070702@gmail.com> References: <1d85506f0706091632q7cedac6dj1ce3cb6d5954ec8@mail.gmail.com> <466D57D4.8070702@gmail.com> Message-ID: On Jun 11, 2007, at 4:10 PM, Nick Coghlan wrote: Alex Martelli wrote: (2) the second-best solution i could think of is just passing the dict as a keyword argument to the class, like so: class Spam(metaclass = Bacon, dict = {}): ... so you could explicitly state you need a special dict. I like this one, with classdict being the keyword (dict is the name of a builtin type and we shouldn't encourage the frequent but iffy practice of 'overriding' builtin identifiers). So instead of being able to write: class MyStruct(Struct): first = 1 second = 2 third = 3 everyone defining a Struct subclass has to write: class MyStruct(Struct, classdict=OrderedDict()): first = 1 second = 2 third = 3 Forgive my confusion, but exactly *how* is that meant to be an improvement? Why can't the classdict get inherited just like the metaclass can? I'm not sure, btw, if we want the classdict to be an _instance_ of a mapping, exactly because of inheritance -- a type or factory seems more natural to me. Sure, the metaclass might deal with that if necessary, but I don't see an advantage in making it have to do so. Thus, I'd use classdict=dict instead of classdict={}, etc. The use of a special ordered dictionary should be an internal implementation detail of the Struct class, and PEP 3115 makes it exactly that. The PEP's approach means that simple cases, while possibly being slightly harder to write, will 'just work' when it comes time to use them, while more complicated cases involving multiple metaclasses should still be possible. I will also note that the PEP allows someone to write their own base class which accepts the 'classdict' keyword argument if they so choose. The PEP seems to allow a whole lot of things. I'm with Tomer in wondering whether this lot may be too much. Alex From guido at python.org Mon Jun 11 19:00:02 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Jun 2007 10:00:02 -0700 Subject: [Python-3000] PEP 3135 (New Super) - what to do? Message-ID: I'm very tempted to check in my patch even though the PEP isn't updated (it's been renamed from PEP 367 though). Any objections? It is python.org/sf/1727209, use the latest (topmost) super2.diff patch. This would make the new and improved syntax super().foo(), which gets the class and object from the current frame, as the __class__ cell and the first argument, respectively. Neither __class__ nor super are keywords, but the compiler spots the use of 'super' as a free variable and makes sure the '__class__' is available in any frame where 'super' is used. Code is welcome to also use __class__ directly. It is set to the class before decoration (since this is the only way that I can figure out how to generate the code). The old syntax super(SomeClass, self) still works; also, super(__class__, self) is equivalent (assuming SomeClass is the nearest lexically containing class). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From g.brandl at gmx.net Mon Jun 11 20:48:56 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 11 Jun 2007 20:48:56 +0200 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: Georg Brandl schrieb: > Guido van Rossum schrieb: >> PEP 3127 (Integer Literal Support and Syntax) introduces new notations >> for octal and binary integers. This isn't implemented yet. Are there >> any takers? It shouldn't be particularly complicated. > > Okay, it's done. > > I'll be grateful for reviews. I've also removed traces of the "L" literal > suffix where I encountered them, but may not have gotten them all. Ah, one thing in the PEP I haven't implemented is the special helpful syntax error message if you have an old-style octal literal in your code. If someone who wouldn't have to dig into tokenizer/parser details could do that, I'd be grateful :) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From guido at python.org Mon Jun 11 20:50:39 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Jun 2007 11:50:39 -0700 Subject: [Python-3000] PEP 3127 (integer literal syntax) -- any takers? In-Reply-To: References: Message-ID: On 6/11/07, Georg Brandl wrote: > Georg Brandl schrieb: > > Guido van Rossum schrieb: > >> PEP 3127 (Integer Literal Support and Syntax) introduces new notations > >> for octal and binary integers. This isn't implemented yet. Are there > >> any takers? It shouldn't be particularly complicated. > > > > Okay, it's done. > > > > I'll be grateful for reviews. I've also removed traces of the "L" literal > > suffix where I encountered them, but may not have gotten them all. > > Ah, one thing in the PEP I haven't implemented is the special helpful > syntax error message if you have an old-style octal literal in your code. > > If someone who wouldn't have to dig into tokenizer/parser details could > do that, I'd be grateful :) Or you could leave this up to the 2.6 backport team. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From timothy.c.delaney at gmail.com Mon Jun 11 22:51:07 2007 From: timothy.c.delaney at gmail.com (Tim Delaney) Date: Tue, 12 Jun 2007 06:51:07 +1000 Subject: [Python-3000] PEP 3135 (New Super) - what to do? References: Message-ID: <00dd01c7ac6a$3e767550$0201a8c0@mshome.net> Guido van Rossum wrote: > I'm very tempted to check in my patch even though the PEP isn't > updated (it's been renamed from PEP 367 though). Any objections? Sorry - had to go visit family on the long weekend - only got back late last night. My only objection is the special-casing of the 'super' name - specifically, that it *won't* work if super is assigned to something else, and then called with the no-arg version. But I'm happy to have the changes checked in, and look at whether we can fix that without a performance penalty later. I'll update the PEP when I get the chance to reflect the new direction. Ironically, it's now gone back more towards Calvin's original approach (and my original self.super recipe). So - just clarifying the semantics for the PEP: 1. super() is a shortcut for super(__class__, first_arg). Any reason we wouldn't just emit bytecode for the above if we detect a no-arg call of super()? Ah - in case 'super' had been rebound. We could continue to make 'super' a non-rebindable name. 2. __class__ can be called directly. __class__ will be available in any frame that uses either 'super' or '__class__' (including inner functions of methods). What if the function is *not* inside a class (lexically)? Will __class__ exist, or will it be None? > It is python.org/sf/1727209, use the latest (topmost) super2.diff > patch. > This would make the new and improved syntax super().foo(), which gets > the class and object from the current frame, as the __class__ cell and > the first argument, respectively. > > Neither __class__ nor super are keywords, but the compiler spots the > use of 'super' as a free variable and makes sure the '__class__' is > available in any frame where 'super' is used. > > Code is welcome to also use __class__ directly. It is set to the class > before decoration (since this is the only way that I can figure out > how to generate the code). > > The old syntax super(SomeClass, self) still works; also, > super(__class__, self) is equivalent (assuming SomeClass is the > nearest lexically containing class). Tim Delaney From guido at python.org Mon Jun 11 23:16:44 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 11 Jun 2007 14:16:44 -0700 Subject: [Python-3000] PEP 3135 (New Super) - what to do? In-Reply-To: <00dd01c7ac6a$3e767550$0201a8c0@mshome.net> References: <00dd01c7ac6a$3e767550$0201a8c0@mshome.net> Message-ID: On 6/11/07, Tim Delaney wrote: > Guido van Rossum wrote: > > > I'm very tempted to check in my patch even though the PEP isn't > > updated (it's been renamed from PEP 367 though). Any objections? > > Sorry - had to go visit family on the long weekend - only got back late last > night. > > My only objection is the special-casing of the 'super' name - specifically, > that it *won't* work if super is assigned to something else, and then called > with the no-arg version. Well, what's the use case? I don't see that there would ever be a reason to alias super. So the use case seems to be only semantic purity. > But I'm happy to have the changes checked in, and > look at whether we can fix that without a performance penalty later. There's the rub -- it's easy to always add a reference to __class__ to every method, but that means that every method call slows down a tiny bit on account of passign the __class__ cell. Anyway, I'll check it in. > I'll update the PEP when I get the chance to reflect the new direction. > Ironically, it's now gone back more towards Calvin's original approach (and > my original self.super recipe). Working implementations talk. :-) > So - just clarifying the semantics for the PEP: > > 1. super() is a shortcut for super(__class__, first_arg). Yes. > Any reason we wouldn't just emit bytecode for the above if we detect a > no-arg call of super()? Ah - in case 'super' had been rebound. We could > continue to make 'super' a non-rebindable name. Believe me, modifying the byte code would be much harder. I don't like the idea of non-rebindable names that aren't keywords -- there are too many syntactic loopholes (e.g. someone found a way to bind __debug__ via an argument). > 2. __class__ can be called directly. But why should you? It's not there for calling, but for referencing (e.g. isinstance). Or maybe you meant "__class__ can be *used* directly." Yes, in that case. > __class__ will be available in any frame that uses either 'super' or > '__class__' (including inner functions of methods). Yeah, but that's only relevant to code digging around in the frame object. More usefully, the __class__ variable will be available in all function definitions that are lexically contained inside a class. > What if the function is *not* inside a class (lexically)? Will __class__ > exist, or will it be None? It will be undefined (i.e. give a NameError). super can still be used, but you must provide arguments as before. I should note that with the current class, while using __class__ in a nested function works, using super() in a nested function doesn't really work: while it gets the __class__ variable just fine, it gets the first argument of the nested function, which is most likely useless. Not that I can think of a use case for super() in a nested function anyway, but it should be noted. > > It is python.org/sf/1727209, use the latest (topmost) super2.diff > > patch. > > This would make the new and improved syntax super().foo(), which gets > > the class and object from the current frame, as the __class__ cell and > > the first argument, respectively. > > > > Neither __class__ nor super are keywords, but the compiler spots the > > use of 'super' as a free variable and makes sure the '__class__' is > > available in any frame where 'super' is used. > > > > Code is welcome to also use __class__ directly. It is set to the class > > before decoration (since this is the only way that I can figure out > > how to generate the code). > > > > The old syntax super(SomeClass, self) still works; also, > > super(__class__, self) is equivalent (assuming SomeClass is the > > nearest lexically containing class). > > Tim Delaney -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Mon Jun 11 23:20:17 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 11 Jun 2007 23:20:17 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <236066.59081.qm@web33506.mail.mud.yahoo.com> <466C563F.6090305@v.loewis.de> Message-ID: <466DBC91.7050101@v.loewis.de> Ka-Ping Yee schrieb: > Steve Howell wrote: >> I think this whole debate could be put to rest by >> agreeing to err on the side of ascii in 3.0 beta, and >> if in real world experience, that turns out to be the >> wrong decision, simply fix it in 3.0 production, 3.1, >> or 3.2. > > On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote: >> Likewise, this whole debate could also be put to rest >> by agreeing to err on the side of unrestricted support >> for the PEP, and if that turns out to be the wrong >> decision, simply fix any problems discovered in 3.0 >> production, 3.1, or 3.2. > > Your attempted parallel does not match: it breaks code, > whereas Steve's does not. PEP 3131 does not break any code. Regards, Martin From martin at v.loewis.de Mon Jun 11 23:23:32 2007 From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 11 Jun 2007 23:23:32 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: <466DBD54.4030006@v.loewis.de> > Python currently provides to everyone the restriction of > identifiers to a character set that everyone knows and trusts. > Many of us want Python to continue to provide such restriction > for those who want identifiers to be in a character set they > know and trust. This is not incompatible with your desire to > permit alternative character sets, as long as Python offers an > option to make that choice. We can continue to discuss the > details of how that choice is expressed, but this general idea > is a solution that would give us both what we want. > > Can we agree on that? So far, all proposals I have seen *are* incompatible, or had some other flaws, so I'm not certain that this general idea provides a non-empty solution set. Regards, Martin From martin at v.loewis.de Mon Jun 11 23:26:40 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 11 Jun 2007 23:26:40 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> Message-ID: <466DBE10.1070804@v.loewis.de> > Chinese in particular you would recognize as "not what I expected". > Cyrillic you might not recognize, because it looks like ASCII letters. Please take a look at http://ru.wikipedia.org/wiki/Python In what way does that look like ASCII letters? Cyrillic is *significantly* different from Latin. > Prime (or tone) marks, you might not recognize, because they look > like ASCII quote marks. Not to me. Quote marks are before and after letters; tone marks are above letters. Regards, Martin From martin at v.loewis.de Mon Jun 11 23:42:36 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 11 Jun 2007 23:42:36 +0200 Subject: [Python-3000] PEP 3131: what are the risks? In-Reply-To: References: Message-ID: <466DC1CC.2020208@v.loewis.de> > Here are the top three that I see; note that none of these concerns > say "Don't use non-ASCII ids". They do all say "Don't use ids from a > script the user hasn't said to expect". > > (1) Malicious user is indeed one risk. A small probability, but a > big enough loss that I want a warning when the door is unlocked. > > (2) Typos is another risk. Even in mono-lingual environments, it is > possible to get a wrong letter. If you're expecting ?, it is fine. > If you're not, then it shouldn't pass silently. > > (3) "Reados". When doing maintenance later, if I wasn't expecting ?, > I may see it as a regular i, and code that way. Now I have two > doppelganger/d?ppelganger variables (or inherited methods) serving the > same purpose, but using different memory locations. I can see 1 as a risk, and I agree it has a small probability (because the risk for the submitter of being discovered is much higher). I can't see issues 2 or 3 as a risk. It *never* happened to me that I mistakenly typed ?, as this just isn't on my keyboard. If it was on my keyboard, I would be using a natural language that actually uses that character, and then my eye would be trained to easily recognize the typo. Likewise for 3: I could *never* confuse these two words, and would always recognize both of them as typos for doppelg?nger (which is where the umlauts really belong). To elaborate on the ? issue: there is a mode for German keyboards where the accent characters are "dead", i.e. you type an accented character, then the regular character. I usually turn that mode off, but even if it was on, I would not *mistakenly* type ` first, then the i. If I type ` on a keyboard with dead keys, I get *always* puzzled about no character appearing, and then if the next vowel eats the character, I immediately recognize - I meant to type a backquote, but got none. If the backquote is part of the syntax, the vowel "eating" it actually makes the entire text a syntax error. Regards, Martin From jimjjewett at gmail.com Mon Jun 11 23:55:58 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 17:55:58 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466DBE10.1070804@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> Message-ID: On 6/11/07, "Martin v. L?wis" wrote: > > Chinese in particular you would recognize as "not what I expected". > > Cyrillic you might not recognize, because it looks like ASCII letters. > Please take a look at http://ru.wikipedia.org/wiki/Python > In what way does that look like ASCII letters? Cyrillic is > *significantly* different from Latin. In long stretches of long words, yes. In isolated abbreviations, not so much. From the second key-value pair in the top box, the value is ????????????? I can tell that isn't english, but I have to slow down a bit before I recognize that it isn't ASCII. (The "N" is backwards, and the "n"-looking thing between the "p"s isn't quite an n.) ??????????? I wouldn't recognize at all (except that for the next few weeks, I might know to check). One reason this matters -- even when the original author had good intentions -- is that I edit my code as text, rather than graphics. I will often retype rather than cutting and pasting. Since ??? and ???? are not the same as the visually similar Top and HTep, that will eventually cause problems. If I can say "I accept Latin-1, but not Cyrillic", then I won't have this problem; at the very least, I will be forewarned. -jJ From jimjjewett at gmail.com Mon Jun 11 23:59:12 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 17:59:12 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466DBC91.7050101@v.loewis.de> References: <236066.59081.qm@web33506.mail.mud.yahoo.com> <466C563F.6090305@v.loewis.de> <466DBC91.7050101@v.loewis.de> Message-ID: On 6/11/07, "Martin v. L?wis" wrote: > Ka-Ping Yee schrieb: > > Steve Howell wrote: > >> I think this whole debate could be put to rest by > >> agreeing to err on the side of ascii in 3.0 beta, and > >> if in real world experience, that turns out to be the > >> wrong decision, simply fix it in 3.0 production, 3.1, > >> or 3.2. > > On Sun, 10 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote: > >> Likewise, this whole debate could also be put to rest > >> by agreeing to err on the side of unrestricted support > >> for the PEP, and if that turns out to be the wrong > >> decision, simply fix any problems discovered in 3.0 > >> production, 3.1, or 3.2. > > Your attempted parallel does not match: it breaks code, > > whereas Steve's does not. > PEP 3131 does not break any code. Going with the widest possible set of source characters (as PEP 3131 does) and restricting them later (in 3.1) would break code. Going with a smaller set of possible source characters and expanding them later would not break code. Going with ASCII by default plus locally approved charset-extensions would not break code unless the new restrictions overrode local decisions. -jJ From martin at v.loewis.de Tue Jun 12 00:13:34 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 12 Jun 2007 00:13:34 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> Message-ID: <466DC90E.1070009@v.loewis.de> > One reason this matters -- even when the original author had good > intentions -- is that I edit my code as text, rather than graphics. I > will often retype rather than cutting and pasting. Since ??? and ???? > are not the same as the visually similar Top and HTep, that will > eventually cause problems. It's actually unlikely that you encounter "???" or "????" - they don't mean anything in Russian (FWIW, ????????????? means interpreter; so "???" is akin "ter" and "????" akin "nter"). I cannot believe that you would actually consider retyping code that contains Cyrillic characters (you won't understand what it does, would you?), and even if you did - how would an ASCII-only flag on the interpreter help? If you type Top and HTep (again, please look in my eyes and tell me that you would *actually* type in these identifiers), the error in the interpreter won't trigger. Regards, Martin From bjourne at gmail.com Tue Jun 12 00:50:22 2007 From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=) Date: Tue, 12 Jun 2007 00:50:22 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466DBD54.4030006@v.loewis.de> References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> <466DBD54.4030006@v.loewis.de> Message-ID: <740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com> On 6/11/07, "Martin v. L?wis" wrote: > > Python currently provides to everyone the restriction of > > identifiers to a character set that everyone knows and trusts. > > Many of us want Python to continue to provide such restriction > > for those who want identifiers to be in a character set they > > know and trust. This is not incompatible with your desire to > > permit alternative character sets, as long as Python offers an > > option to make that choice. We can continue to discuss the > > details of how that choice is expressed, but this general idea > > is a solution that would give us both what we want. > > > > Can we agree on that? > > So far, all proposals I have seen *are* incompatible, or had > some other flaws, so I'm not certain that this general idea > provides a non-empty solution set. python -ascii-only -- mvh Bj?rn From jimjjewett at gmail.com Tue Jun 12 01:13:14 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 11 Jun 2007 19:13:14 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466DC90E.1070009@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> Message-ID: On 6/11/07, "Martin v. L?wis" wrote: > > One reason this matters -- even when the original author had good > > intentions -- is that I edit my code as text, rather than graphics. I > > will often retype rather than cutting and pasting. Since ??? and ???? > > are not the same as the visually similar Top and HTep, that will > > eventually cause problems. > It's actually unlikely that you encounter "???" or "????" - they > don't mean anything in Russian (FWIW, ????????????? means interpreter; > so "???" is akin "ter" and "????" akin "nter"). > I cannot believe that you would actually consider retyping code > that contains Cyrillic characters Not if I realized they were Cyrillic -- and that is exactly my point. By allowing any unicode letters, we would allow Cyrillic, and I might open a file that uses Cyrillic without realizing it. By allowing ASCII + locally approved charsets, I either won't have Cyrillic indentifiers, or I will have turned them on explicitly, and will know to look out for them. > would you?), and even if you did - how would an ASCII-only flag > on the interpreter help? With ASCII-only, I would have gotten an error when I loaded the original module in the first place, so I would know that I'm dealing with Cyrillic (or at least with non-ASCII.) > If you type Top and HTep (again, please > look in my eyes and tell me that you would *actually* type in > these identifiers), the error in the interpreter won't trigger. To repeat: Yes, if I thought those were the variable names, I would type them -- and I've seen dumber variable names than those." Of course, I wouldn't type them if I knew they were wrong. With an ASCII-only install, I would get that error-check because the (remaining original uses) were in Cyrillic. With an "any unicode character" install, ... well, I might figure out my problem the next morning. -jJ From pje at telecommunity.com Tue Jun 12 01:18:40 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Mon, 11 Jun 2007 19:18:40 -0400 Subject: [Python-3000] Pre-PEP on fast imports In-Reply-To: <466DD0D8.7040407@develer.com> References: <466DD0D8.7040407@develer.com> Message-ID: <20070611231640.8A55E3A407F@sparrow.telecommunity.com> At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote: >Hi Philip, > >I'm going to submit a PEP for Python 3000 (and possibly backported >as an option off by default in Python 2). It's related to imports >and how to make them faster. Given your expertise on the subject, >I'd appreciate if you could review my ideas. I briefly spoken of it >with Alex Martelli a few days ago at PyCon Italia and he was not >negative about it. > >Problems: > >- A single import causes many syscalls (.pyo, .pyc, .py, in both >directory and .zip file). >- Situation is getting worse and worse with the advent of >easy_install which produces many .pth files (longer sys.path). >- Python startup time is slow, and a noticable fraction of it is >dominated by site.py-related stuff (a simple hello world runs takes >0.012s if run without -S, and 0.008s if run with -S). >- Many people might not be interested in this, but others are really >concerned. Eg: again at PyCon italia, I spoke with one of the >leading Sugar programmers (OLPC) who told me that one of the biggest >blocker right now is the python startup time (applications on latest >OLPC prototype take 3-4 seconds to startup). He suggested that this >was related to the large number of syscalls made for imports. > > >Proposed solution: > >- A site cache is introduced. It's a dictionary mapping module names >to absolute file paths. >- When an import occurs, for each directory/zipfile we walk in >sys.path, we read all directory entries, and update the site cache >with all the Python modules found in it (all the Python modules >found in the directory/zipfile). >- If the filepath for a certain module is found in the site cache, >the module is directly accessed. Otherwise, sys.path is walked. >- The site cache can be cleared with sys.clear_site_cache(). This >must be used after manual editing of sys.path (or could be done >automatically by making sys.path a list subclass which notices each >modification). >- The site cache must be manually cleared if a Python file is added >to a directory in sys.path after the application has started. This >is a rare-enough scenario to require an additional explicit call. >- If for whatever reason a filepath found in the site cache cannot >be accessed (unmounted device, whatever) ImportError is raised. >Again, this is something which is very rare and does not require >much attention. Here's a simpler solution, one that's easily testable using existing Python versions. Create a subclass of pkgutil.ImpImporter (Python >=2.5) that caches a listdir of its contents, and uses it to immediately reject any find_module() requests for which matching data is not in its cached listdir. Add this class to sys.path_hooks, and see if it speeds things up. If it doesn't produce an improvement, your more-ambitious version of the idea won't work. If it does produce an improvement, it's likely to be much simpler to implement at the C level than your idea is. Meanwhile, it doesn't tear up the import machinery with a new special-purpose mechanism; it simply leverages the existing hooks. The subclass might look something like this: import imp, os, sys from pkgutil import ImpImporter suffixes = set(ext for ext,mode,typ in imp.get_suffixes()) class CachedImporter(ImpImporter): def __init__(self, path): if not os.path.isdir(path): raise ImportError("Not an existing directory") super(CachedImporter, self).__init__(path) self.refresh() def refresh(self): self.cache = set() for fname in os.listdir(path): base, ext = os.path.splitext(fname) if ext in suffixes and '.' not in base: self.cache.add(base) def find_module(self, fullname, path=None): if fullname.split(".")[-1] not in self.cache: return None # no need to check further return super(CachedImporter, self).find_module(fullname, path) sys.path_hooks.append(CachedImporter) Stick this at the top of your site.py and see what happens. I'll be interested to hear the results. (Notice, by the way, that with this implementation one can easily clear the entire cache by clearing sys.path_importer_cache, or deleting the entry for a specific path, as well as by taking the entry for that path and calling its refresh() method.) From baptiste13 at altern.org Tue Jun 12 01:48:47 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 01:48:47 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> Message-ID: Leonardo Santagada a ?crit : > I don't. It is a bad idea to distribute non-ASCII code for libraries > that are supposed to be used by the world as a whole. But > distributing a chinese code for doing something like taxes using > chinese rules is ok and should be encouraged (now, I don't know they > have taxes in china, but that is not the point). > I wouldn't be so sure. In open source, you never know in advance to whom your code can be useful. Maybe some part of you chinese tax software can be refactored into a more generic library. If you write the software with non-ASCII identifiers, this refactored library won't be usable for non-chinese speakers. A good opportunity will be missed, but *you won't even know*. > No they are not, people doing open source work are probably going to > still be coding in english so that is not a problem, but that chinese > tax system if it is open sourced people in china can easily help > fixing bugs because identifiers are in their own language, which they > can identify. > good point, but I'm not sure it is so much more difficult to identify identifiers, given that you already need to know ASCII characters in order to identify the keywords. Sure, you won't understand what the identifiers mean, but you'll probably be able to tell them from one another. > The thing is, people are predicting a future for python code on the > open source world. One in which devs of open source libraries and > programs will start coding in different languages if you support > unicode identifiers, something that is not common today (using some > form of ASCIIfication of their languages) and didn't happen with the > Java, C#, Javascript and Common Lisp communities. In light of all > that I think this prediction is probably wrong. > Well that's only true for the open source libraries and programs *that we know of*. Maybe there is useful software that we don't know of, precisely because it is not "marketed" to a global audience. That's what I call lost opportunities. > We are all consenting > adults and we know that we should code in english if we want our code > to be used and to be a first class citizen of the open source world. > What do you have to support your prediction? > I have experience in another community, namely the community of physicists. Here, most people don't know in advance how you're supposed to write open source code. They learn in the doing. And if someone starts coding with non-ASCII identifiers, he won't have time to recode his program later. So he will simply not publih it. Lost opportunity again. Cheers, BC From baptiste13 at altern.org Tue Jun 12 02:00:44 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 02:00:44 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: James Y Knight a ?crit : > If another developer is planning to write code in English, this whole > debate is moot. So, let's take as a given that he is going to write a > program in his own non-English language. Now, will he write in a > asciified form of his language, or using the proper character set? > Right now, the only option is the first. The PEP proposes to also > allow the second. > that's a very nice summary of the situation. > So, your question should be: is it easier to understand an ASCIIified > form of another language, or the actual language itself? For me (who > doesn't speak said langauge, nor perhaps even know its character > set), I'm pretty sure the answer is still going to be the second: I'd > rather a program written in Chinese use Chinese characters, rather > than a transliteration of Chinese into ASCII. > This is where we strongly disagree. If an identifier is written in transliterated chinese, I cannot understand what it means, but I can recognise it when it is used in the code. I will then find out the meaning from the context. By contrast, with chineses identifiers, I will not recognise them from one another. So I won't be able to make any sense from the code without going through the complex task of translating everything. > because it is actually > feasible for me to do automatic translation of Chinese into something > resembling English. And of course, that's even more true when talking > about a language like French, which uses an alphabet quite familiar > to me, but yet online translators still fail to function if it's been > transliterated into ASCII. > dream on! Automatic translation won't work. For example, if you actually try feeding python code to a french-to-english translator, you might be surprised by what happens to the keyword "if" (just try it:-). You would have to translate the identifiers one by one, which is not practical. Cheers, BC From baptiste13 at altern.org Tue Jun 12 02:22:04 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 02:22:04 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466CBC5A.3050907@v.loewis.de> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <466CBC5A.3050907@v.loewis.de> Message-ID: Martin v. L?wis a ?crit : >>> Indeed, PEP 3131 gives a predictable identifier character set. >>> Adding per-site options to change the set of allowable characters >>> makes it less predictable. >>> >> true. However, this will only matter if you distribute code with non-ASCII >> identifiers to the wider public. > > No - it will matter for any kind of distribution, not just to the "wider > public". If I move code to the next machine it may stop working, > if that machine is controlled by you (or your sysadmin), you should be able to reconfigure Python the way you like. However, I have to agree that this is suboptimal. > or if I upgrade to the next Python version, assuming the default is > to restrict identifiers. > That would only happen if the default changes to a more strict rule. If we start with ASCII only, this is unlikely to ever happen! >> The real question is: transparent *to whom*. Transparent to the developper >> himself when he rereads his own code (which I value as a developper), or >> transparent to the user of the program when he tries to fix a bug (which I value >> as a user of open-source software) ? Non-ASCII identifiers are marginally better >> for the first case, but can be dramatically worse for the second one. Clearly, >> there is a tradeoff. > > Why do you say that? Non-ASCII identifiers significantly improve the > readability of code to speakers of the natural language from which > the identifiers are drawn. With ASCII identifiers, the reader needs > to understand the English words, or recognize the transliteration. > With non-ASCII identifiers, the intended meaning of the class or > function becomes immediately apparent, in the way identifiers have > always been self-documentation for English-speaking people. > my problem is then: what happens if the reader does not speak the same language as the author of the code? Right now, if I come across python code written in a language I don't speak, I can still try to make sense of it. Sure, I may have to do without the comments, sure, I may not understand what the identifier names mean. But I can still follow the instructions flow and try to figure out what happens. With non-ASCII identifiers, I cannot do that because I cannot recognise the identifiers from one another. >>>> That is what makes these strengths so important. I hope this >>>> helps you understand why these concerns can't and shouldn't be >>>> brushed off as "paranoia" -- this really has to do with the >>>> core values of the language. >>> It just seems that the concerns don't directly follow from >>> the principles. Something else has to be added to make that >>> conclusion. It may not be paranoia (i.e. excessive anxiety), >>> but there surely is some fear, no? >>> >> That argument is not really honest :-) Every risk can be estimated opimistically >> or pessimistically. In both cases, there is some part of irrationallity. > > Still, what is the risk being estimated? Is it that somebody > maliciously tries to provide patches that use look-alike > characters? I honestly don't know what risks you see. > Well, I have not followed acurately the discussion about security risks. However, I see a much simpler risk: the risk that I come across with code that is technically open source, but that I can't even debug in case of need because I cannot make sense of it. This would reduce the usefulness of such code, and cause fragmentation for the community. Cheers, BC From gproux+py3000 at gmail.com Tue Jun 12 02:34:26 2007 From: gproux+py3000 at gmail.com (Guillaume Proux) Date: Tue, 12 Jun 2007 09:34:26 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: <19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com> Hello, On 6/12/07, Baptiste Carvello wrote: > context. By contrast, with chineses identifiers, I will not recognise them from > one another. So I won't be able to make any sense from the code without going > through the complex task of translating everything. You would be surprised how well you can do if you would actually try to recognize a set of Chinese characters, especially if you would use some tool to put a meaning on them. Well, I never formally learned any Chinese (nor any Japanese actually) , but I can now effortlessly parse both languages now. But really, if you ever find any code with Chinese written all over it that you would believe might be very useful to you, you would have one of the following choice: (a) use a tokenizer and use some tool to do a hanzi -> ascii automatic transliteration/translation (b) try to wrap the Chinese things with an ASCII veil (which would make you work on your Chinese a bit) or you could ask your Chinese girlfriend to help you (WHAT you don't have a Chinese girlfriend yet? :)) (c) actually contact the person who submitted the code to let him know you are very much interested in the code.... In most cases, this would give you the possibility to reach out to different communities and to work together with people with whom you might never have talked to. From what we can see on English-only mailing lists, this is the kind of python users we don't normally have access to currently because they simply are secluded in their own little universe, in the comfortable realm of their own linguistic barriers. Of course, sometimes they step out and offer a plea for help on English ML in broken English... PEP3131 is unlikely to change this. However it can see it might have two ethnically interesting consequences: 1) Python usage in community where ascii has little place should find more uses because people will become enpowered with Python and able to express themselves like never before: my bet is that for example, the Japanese python commmunity will become stronger and welcome new people younger and older, and that do not know much English. 2) If ever a program written with non-ASCII character find some good usage in ascii-only communities, then the usual plea for help will be reversed. People will seek out e.g. Japanese programmers and request help, maybe in broken Japanese. From this point on, all programming communities will be on an equal footing and able to talk together on the same standpoint. I guess you know "Libert? Egalit? Fraternit?". Maybe this should be the PEP subtitle. > what happens to the keyword "if" (just try it:-). You would have to translate > the identifiers one by one, which is not practical. would be possible with the tokenizer actually :) Droit comme un if ! A bient?t, Guillaume From baptiste13 at altern.org Tue Jun 12 02:34:54 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 02:34:54 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: Michael Urman a ?crit : > On 6/11/07, Ka-Ping Yee wrote: >> Because the existence of these library modules does not make it >> impossible to reliably read source code. We're talking about >> changing the definition of the language here, which is deeper >> than adding or removing things in the library. > > This has already been demonstrated to be false - you already cannot > visually inspect a printed python program and know what it will do. > There is the risk of visually aliased identifiers, but how is that > qualitatively worse than the truly conflicting identifiers you can > import with a *, or have inserted by modules mucking with > __builtins__? > Oh come on! imports usually are located at the top of the file, so they won't clobber other names. And mucking with __builtins__ is rare and frowned upon. On the contrary, non-ASCII identifiers will be encouraged, anywhere in the code. The amount of information you get from today's python code is most of the time sufficient for debugging, or for using it as an inspiration. With non-ASCII identifiers, these features will be lost to all users who cannot read the needed characters. Denying the problem is not a good way to answer other people's concerns. Cheers, BC From baptiste13 at altern.org Tue Jun 12 02:38:06 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 02:38:06 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466C5BB7.8050909@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> Message-ID: Martin v. L?wis a ?crit : > I cannot imagine this scenario as realistic. It is certain realistic > that you want to keep your own code base ASCII-only - what I don't > understand why such a policy would extend to libraries that you use. > If the interfaces of the library are non-ASCII, you will automatically > notice; if it only has some non-ASCII identifiers inside, why would > you bother? > well, for the same reason I prefer to use open source software: because I can debug it in case of need, and because I can use it as an inspiration if I need to write a similar program. Baptiste From showell30 at yahoo.com Tue Jun 12 02:45:19 2007 From: showell30 at yahoo.com (Steve Howell) Date: Mon, 11 Jun 2007 17:45:19 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: Message-ID: <511504.28584.qm@web33502.mail.mud.yahoo.com> --- Baptiste Carvello wrote: > Leonardo Santagada a ?crit : > > I don't. It is a bad idea to distribute non-ASCII > code for libraries > > that are supposed to be used by the world as a > whole. But > > distributing a chinese code for doing something > like taxes using > > chinese rules is ok and should be encouraged (now, > I don't know they > > have taxes in china, but that is not the point). > > > I wouldn't be so sure. In open source, you never > know in advance to whom your > code can be useful. Maybe some part of you chinese > tax software can be > refactored into a more generic library. If you write > the software with non-ASCII > identifiers, this refactored library won't be usable > for non-chinese speakers. A > good opportunity will be missed, but *you won't even > know*. > A couple people have made the point that it's easier for a non-Chinese-speaking person to translate from Unicode Chinese to their target language than from ASCII pseudo-Chinese, due to the current state of the art of translation engines like Babelfish, Google, etc. A more likely translation scenarios is that somebody semi-literate in a language attempts the translation. For example, I'm not fluent in French, but I could translate a small useful French module to English without too much effort, assuming that the underlying algorithms were within my capability and I had babelfish to overcome my rusty high school French. Here Unicode would probably help me, unless my browser were just completely lame and the accents somehow encumbered by ability to copy and paste. My French spelling when it comes to accents is bad, but they don't affect me when it comes to reading. The most likely scenario of translation is that somebody truly bilingual does the translation. I'm sure there are something like 50,000 people in the U.S. alone who are Chinese/English bilingual, and then once you get something translated to English, that opens up even more doors. Having said all that, I agree with the underlying premise that the availability of Unicode will provide some mild disincentive for the original authors to publish their work in English, to the extent that the author doesn't predict (or stand to benefit from?) the utility of his module outside the Chinese-reading community. But you do have to weigh that against the disincentive to write the module in the first place, if ascii is the only option. ____________________________________________________________________________________ Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ From brett at python.org Tue Jun 12 03:53:43 2007 From: brett at python.org (Brett Cannon) Date: Mon, 11 Jun 2007 18:53:43 -0700 Subject: [Python-3000] Pre-PEP on fast imports In-Reply-To: <20070611231640.8A55E3A407F@sparrow.telecommunity.com> References: <466DD0D8.7040407@develer.com> <20070611231640.8A55E3A407F@sparrow.telecommunity.com> Message-ID: On 6/11/07, Phillip J. Eby wrote: > > At 12:46 AM 6/12/2007 +0200, Giovanni Bajo wrote: > >Hi Philip, > > > >I'm going to submit a PEP for Python 3000 (and possibly backported > >as an option off by default in Python 2). It's related to imports > >and how to make them faster. Given your expertise on the subject, > >I'd appreciate if you could review my ideas. I briefly spoken of it > >with Alex Martelli a few days ago at PyCon Italia and he was not > >negative about it. > > > >Problems: > > > >- A single import causes many syscalls (.pyo, .pyc, .py, in both > >directory and .zip file). > >- Situation is getting worse and worse with the advent of > >easy_install which produces many .pth files (longer sys.path). > >- Python startup time is slow, and a noticable fraction of it is > >dominated by site.py-related stuff (a simple hello world runs takes > >0.012s if run without -S, and 0.008s if run with -S). > >- Many people might not be interested in this, but others are really > >concerned. Eg: again at PyCon italia, I spoke with one of the > >leading Sugar programmers (OLPC) who told me that one of the biggest > >blocker right now is the python startup time (applications on latest > >OLPC prototype take 3-4 seconds to startup). He suggested that this > >was related to the large number of syscalls made for imports. > > > > > >Proposed solution: > > > >- A site cache is introduced. It's a dictionary mapping module names > >to absolute file paths. > >- When an import occurs, for each directory/zipfile we walk in > >sys.path, we read all directory entries, and update the site cache > >with all the Python modules found in it (all the Python modules > >found in the directory/zipfile). > >- If the filepath for a certain module is found in the site cache, > >the module is directly accessed. Otherwise, sys.path is walked. > >- The site cache can be cleared with sys.clear_site_cache(). This > >must be used after manual editing of sys.path (or could be done > >automatically by making sys.path a list subclass which notices each > >modification). > >- The site cache must be manually cleared if a Python file is added > >to a directory in sys.path after the application has started. This > >is a rare-enough scenario to require an additional explicit call. > >- If for whatever reason a filepath found in the site cache cannot > >be accessed (unmounted device, whatever) ImportError is raised. > >Again, this is something which is very rare and does not require > >much attention. > > Here's a simpler solution, one that's easily testable using existing > Python versions. Create a subclass of pkgutil.ImpImporter > (Python >=2.5) that caches a listdir of its contents, and uses it to > immediately reject any find_module() requests for which matching data > is not in its cached listdir. Add this class to sys.path_hooks, and > see if it speeds things up. I thought about this use case when writing importlib for lowering the penalty of importing over NFS and this is exactly how I would do it as well (except I would use the code from importlib instead of pkgutil =). -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070611/4191e4fe/attachment.html From stephen at xemacs.org Tue Jun 12 05:02:24 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 12 Jun 2007 12:02:24 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> Message-ID: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > Of course, I wouldn't type them if I knew they were wrong. With an > ASCII-only install, I would get that error-check because the > (remaining original uses) were in Cyrillic. With an "any unicode > character" install, ... well, I might figure out my problem the next > morning. But this is something that only a small subset of developers-of-Python seem to be concerned about. If that's generally the case for all developers-in-Python, shouldn't the burden be placed on those who do care? It seems to me that rather than *impose* restrictions on third parties, the sensible thing to do is to provide those restrictions to those who want them. But as Guido points out, that's outside of the scope of this PEP because it can easily be done by external tools. You object that running an auditor program would "cramp your style". I don't mean that in a pejorative way; like Josiah's desire to continue using certain tools, a developer's style is a BCP for him and should *not* be gratuitously undermined. But I see no reason why that auditor program can't be run as a PEP 263 codec. AFAICS, the following objections could be raised, and answered: 1. PEP 263 codecs delegate the decision to the code's author; an auditor shouldn't do that. You personally could modify your Python installation to replace all the codecs with a wrapper codec that processes the input by calling the "real" codec, then audits the resulting stream as it passes it back to the compiler. But it can be done with a vanilla Python executable today. This is *proof of concept*; possibly there should be a UI to install such a codec via command line flag or environment variable, although there may be other creative ways to install it without altering the current interface to PEP 263 codecs. I'm not yet familiar with the implementation to guess. 2. The auditor would have to duplicate the work of the parser, and might get it wrong. AIUI, the parser is available as a module. Use it. 3. Parsing is expensive in time and other resources. No, it's not. It's the other stuff that the compiler does that is expensive. This is going to be O(size of source) like any codec with a somewhat higher constant than typical codecs. More important, AIUI PEP 263 codecs don't get run on compiled code, so in a production environment it isn't an issue. That doesn't mollify those who think I should not be allowed to use non-ASCII identifiers at all. But I think that should work for you (modulo the UI for invoking the auditor). Does it? From murman at gmail.com Tue Jun 12 04:56:10 2007 From: murman at gmail.com (Michael Urman) Date: Mon, 11 Jun 2007 21:56:10 -0500 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: On 6/11/07, Baptiste Carvello wrote: > Michael Urman a ?crit : > > There is the risk of visually aliased identifiers, but how is that > > qualitatively worse than the truly conflicting identifiers you can > > import with a *, or have inserted by modules mucking with > > __builtins__? > > > Oh come on! imports usually are located at the top of the file, so they won't > clobber other names. And mucking with __builtins__ is rare and frowned upon. On > the contrary, non-ASCII identifiers will be encouraged, anywhere in the code. > The amount of information you get from today's python code is most of the time > sufficient for debugging, or for using it as an inspiration. With non-ASCII > identifiers, these features will be lost to all users who cannot read the needed > characters. Denying the problem is not a good way to answer other people's concerns. I think you overestimate my understanding of "the problem". To me there is no problem (equal parts blindness and YAGNI; neither feel like denial). As I am not going to be interested in trying to understand code written in Chinese, Russian, etc., I'm not bothered by the idea that someone might write code I will have a strong disincentive to read. Am I underrating this concern because it doesn't bother me? I don't see transliterated-into-ASCII as any better for comprehension. So to me that leaves the various potential aliasing problems that have been described, and those honestly feel to me on par with import * and builtins hackery. Yes these are discouraged, and aren't cause for major concern. Similarly code intentionally designed to confuse would be discouraged. I understand that Ka-Ping and several others do see visual aliasing as a problem, so that is why I asked how it's qualitatively worse. I'm hoping that seeing answers from that angle (how is the potential for aliasing worse than the potential for overriding int or str or __import__ in some module you import) will help me understand why what seems to me like a non-issue can be so important to others whose opinions I respect. Is your concern that a flood of library code you cannot read will be the only code written for things you want to do? Or something else entirely? -- Michael Urman From stephen at xemacs.org Tue Jun 12 05:35:44 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 12 Jun 2007 12:35:44 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <1A0C026D-CB40-40AE-8E89-738ECD42D01E@gmail.com> Message-ID: <87ejkhg7zj.fsf@uwakimon.sk.tsukuba.ac.jp> Baptiste Carvello writes: > I wouldn't be so sure. In open source, you never know in advance to > whom your code can be useful. Maybe some part of you chinese tax > software can be refactored into a more generic library. If you > write the software with non-ASCII identifiers, this refactored > library won't be usable for non-chinese speakers. A good > opportunity will be missed, but *you won't even know*. You won't know anyway, because you can't read the project's home page. You won't be able to refactor, because you can't read the comments. Only if the developer already has the ASCII/English discipline will it be practical for a third party to do the refactoring. Otherwise, it will be easier and more reliable to just write from scratch. Such developers, who want a global audience, will pretty quickly learn that external APIs need to be ASCII. > good point, but I'm not sure it is so much more difficult to > identify identifiers, given that you already need to know ASCII > characters in order to identify the keywords. Sure, you won't > understand what the identifiers mean, but you'll probably be able > to tell them from one another. It is much harder to do so if your everyday language does not use a Latin alphabet. At least at the student level, here in Japan Japanese (and to some extent, Chinese and Korean exchange students) make many more spelling typos and punctuation errors in their programs than do exchange students from the West. One sees a lot of right/write (or, for Japanese, right/light) errors, as well as transpositions and one-letter mis-entry that most native speakers automatically catch even in a cursory review. I don't know that it would be better if they could use Japanese identifiers, but I'd sure like to try. > > We are all consenting adults and we know that we should code in > > english if we want our code to be used and to be a first class > > citizen of the open source world. What do you have to support > > your prediction? > I have experience in another community, namely the community of > physicists. Here, most people don't know in advance how you're > supposed to write open source code. They learn in the doing. And if > someone starts coding with non-ASCII identifiers, he won't have > time to recode his program later. So he will simply not publih > it. Lost opportunity again. Why won't he publish it? The only reason I can see is that somebody has indoctrinated him with all the FUD about how non-ASCII identifiers make a program useless. If you tell him the truth, "it will be more useful to the world if you make those identifiers ASCII", won't the great majority of physicists just say "it does what people like me need, generalization is somebody else's job" and publish as-is? From martin at v.loewis.de Tue Jun 12 06:59:26 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 12 Jun 2007 06:59:26 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> Message-ID: <466E282E.5070603@v.loewis.de> Baptiste Carvello schrieb: > Martin v. L?wis a ?crit : >> I cannot imagine this scenario as realistic. It is certain >> realistic that you want to keep your own code base ASCII-only - >> what I don't understand why such a policy would extend to libraries >> that you use. If the interfaces of the library are non-ASCII, you >> will automatically notice; if it only has some non-ASCII >> identifiers inside, why would you bother? >> > well, for the same reason I prefer to use open source software: > because I can debug it in case of need, and because I can use it as > an inspiration if I need to write a similar program. Ok, but why need you then *Python* to tell you that the file has non-ASCII identifiers? Just look inside the file, and see whether you like its source code. It's not that non-ASCII identifiers *necessarily* make the file unmaintainable for your, they just do so when you cannot easily recognize or type the characters being used. Also, that all identifiers are ASCII is not sufficient for you to be able to debug the program in case of need: it also needs to be commented well, and the comments also should be in a language you understand. Furthermore, it has been demonstrated that ASCII-only identifiers are no guarantee that you can actually understand the code, if they happen to be non-English in a convoluted way, e.g. through transliteration. So I don't see that an automatic detection of non-ASCII identifiers actually helps much in determining whether you can use the source code as inspiration. But even if you wanted to enforce a strict "ASCII-only" policy, I don't see why you need Python to *reject* identifiers outside ASCII - a warning would be surely enough to indicate to you that your policy was violated. Regards, Martin From martin at v.loewis.de Tue Jun 12 06:59:31 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 12 Jun 2007 06:59:31 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <87sl9o5dvi.fsf@uwakimon.sk.tsukuba.ac.jp> <87646i5td6.fsf@uwakimon.sk.tsukuba.ac.jp> <781A2C3C-011E-4048-A72A-BE631C0C5127@fuhm.net> <87ps4p3zot.fsf@uwakimon.sk.tsukuba.ac.jp> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <466CBC5A.3050907@v.loewis.de> Message-ID: <466E2833.3090202@v.loewis.de> >> or if I upgrade to the next Python version, assuming the default is >> to restrict identifiers. >> > That would only happen if the default changes to a more strict rule. If we start > with ASCII only, this is unlikely to ever happen! It will likely happen. In 3.0, I change the installation default to allow for the characters I want. Then I install 3.1, and my code stops working. I have to remember how to change the installation default again, and locate the place in the 3.1 installation where I need to change the same setting (assuming it is a per-installation setting). In any case, global (application-wide) flags for restricting identifiers already have been ruled out as solutions to whatever the problem is they try to solve. > my problem is then: what happens if the reader does not speak the same language > as the author of the code? Right now, if I come across python code written in a > language I don't speak, I can still try to make sense of it. Sure, I may have to > do without the comments, sure, I may not understand what the identifier names > mean. But I can still follow the instructions flow and try to figure out what > happens. With non-ASCII identifiers, I cannot do that because I cannot recognise > the identifiers from one another. I think it was Ping who demonstrated that with ASCII-only identifiers, you may not be able to reasonably analyze the code, either, so restricting to ASCII is no *guarantee* that you can maintain the code. > Well, I have not followed acurately the discussion about security risks. > However, I see a much simpler risk: the risk that I come across with code that > is technically open source, but that I can't even debug in case of need because > I cannot make sense of it. This would reduce the usefulness of such code, and > cause fragmentation for the community. How do you know this hasn't happened already? I'm *fairly* certain that the community is *already* fragmented, and that there are open source developers in other parts of the world writing Python programs that will just never show up in your part of the world, because of language and culture barriers. In any case, the PEP advises that international projects should constrain themselves to ASCII and English; beyond that advise, I think there should be freedom of choice. It is not the interpreter's job to be language police. Regards, Martin From martin at v.loewis.de Tue Jun 12 07:01:38 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 12 Jun 2007 07:01:38 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com> References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> <466DBD54.4030006@v.loewis.de> <740c3aec0706111550s38e32d1dsf307f4e2c16c71e4@mail.gmail.com> Message-ID: <466E28B2.8090905@v.loewis.de> >> > Python currently provides to everyone the restriction of >> > identifiers to a character set that everyone knows and trusts. >> > Many of us want Python to continue to provide such restriction >> > for those who want identifiers to be in a character set they >> > know and trust. This is not incompatible with your desire to >> > permit alternative character sets, as long as Python offers an >> > option to make that choice. We can continue to discuss the >> > details of how that choice is expressed, but this general idea >> > is a solution that would give us both what we want. >> > >> > Can we agree on that? >> >> So far, all proposals I have seen *are* incompatible, or had >> some other flaws, so I'm not certain that this general idea >> provides a non-empty solution set. > > python -ascii-only That doesn't implement the requirement "restriction for those who want identifiers to be in a character set they know and trust", if that character set is not ASCII. It also fails Guido's requirement "no global options", which is "some other flaw". Regards, Martin From rauli.ruohonen at gmail.com Tue Jun 12 08:33:35 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 12 Jun 2007 09:33:35 +0300 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: On 6/12/07, Baptiste Carvello wrote: > This is where we strongly disagree. If an identifier is written in > transliterated chinese, I cannot understand what it means, but I can > recognise?it when it is used in the code. I will then find out the > meaning from the context. By contrast, with chineses identifiers, I > will not recognise them from one another. So I won't be able to make > any sense from the code without going through the complex task of > translating everything. I don't know any Chinese, but real Chinese is much more legible to me than transliterated one. Transliterations are complete gibberish to me, but because I know Japanese and it uses many of the same characters with the same meaning, real Chinese makes at least *some* sense and if I need to learn a few variable names in it then it's easier to do so with the proper characters. It's also much easier to look up what they mean, as others have already mentioned. The same should be true for anyone who knows Japanese, and there's a whole nation full of those. From rrr at ronadam.com Tue Jun 12 09:28:34 2007 From: rrr at ronadam.com (Ron Adam) Date: Tue, 12 Jun 2007 02:28:34 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4665EE44.2010306@ronadam.com> <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> Message-ID: <466E4B22.6020408@ronadam.com> Guido van Rossum wrote: > On 6/7/07, "Martin v. L?wis" wrote: >> >> The os.environ.get() method probably should return a unicode >> string. (?) >> > >> > Indeed -- care to contribute a patch? >> >> Ideally, such a patch would make use of the Win32 Unicode API for >> environment variables on Windows. People had already been complaining >> that they can't have "funny characters" in the value of an environment >> variable, even though the UI allows them to set the variable just fine. > > Yeah, but the Windows build of py3k is currently badly broken (e.g. > the _fileio.c extension probably doesn't work at all) -- and I don't > have access to a Windows box to work on it. I'm afraid 3.0a1 will be > released without Windows support. Of course I'm counting on others to > fix that before 3.0 final is released. > > I don't mind for now that the posix.environ variable contains 8-bit > strings -- people shouldn't be importing that anyway. Here's a diff of the patch. It looks like this may be backported to 2.6 since it isn't Unicode specific but casts to the current str type. Cast environ keys and values to current python str type in os.py Added test for environ string types to test_os.py Fixed test_update2, bug 1110478 test, that was being skipped. Test test_tmpfile in test_os.py fails. Haven't looked into it yet. Index: Lib/os.py =================================================================== --- Lib/os.py (revision 55924) +++ Lib/os.py (working copy) @@ -505,7 +505,8 @@ def copy(self): return dict(self) - + # Make sure all environment keys and values are correct str type. + environ = dict([(str(k), str(v)) for k, v in environ.items()]) environ = _Environ(environ) def getenv(key, default=None): Index: Lib/test/test_os.py =================================================================== --- Lib/test/test_os.py (revision 55924) +++ Lib/test/test_os.py (working copy) @@ -266,12 +266,25 @@ os.environ.clear() os.environ.update(self.__save) +class EnvironTests2(unittest.TestCase): + """Test os.environ for specific problems.""" + def setUp(self): + self.__save = dict(os.environ) + def tearDown(self): + os.environ.clear() + os.environ.update(self.__save) # Bug 1110478 def test_update2(self): if os.path.exists("/bin/sh"): os.environ.update(HELLO="World") value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip() self.assertEquals(value, "World") + # Verify environ keys and values from the OS are of the + # correct str type. + def test_keyvalue_types(self): + for key, val in os.environ.items(): + self.assertEquals(type(key), str) + self.assertEquals(type(val), str) class WalkTests(unittest.TestCase): """Tests for os.walk().""" @@ -466,6 +479,7 @@ TemporaryFileTests, StatAttributeTests, EnvironTests, + EnvironTests2, WalkTests, MakedirTests, DevNullTests, From python at zesty.ca Tue Jun 12 09:30:33 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Tue, 12 Jun 2007 02:30:33 -0500 (CDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466E282E.5070603@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466E282E.5070603@v.loewis.de> Message-ID: On Tue, 12 Jun 2007, [ISO-8859-1] "Martin v. L?wis" wrote: > Also, that all identifiers are ASCII is not sufficient > for you to be able to debug the program in case of need: it also > needs to be commented well, and the comments also should be in > a language you understand. Furthermore, it has been demonstrated > that ASCII-only identifiers are no guarantee that you can actually > understand the code, if they happen to be non-English in a convoluted > way, e.g. through transliteration. You keep making arguments of this type: that lacking a 100% guarantee of a desirable property is reason to abandon consideration of the property altogether. I reject such arguments, as they should be rejected by any sound application of logic. They don't belong in this debate; please stop making them. -- ?!ng From python at zesty.ca Tue Jun 12 09:40:29 2007 From: python at zesty.ca (Ka-Ping Yee) Date: Tue, 12 Jun 2007 02:40:29 -0500 (CDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tue, 12 Jun 2007, Stephen J. Turnbull wrote: > It seems to me that rather than *impose* restrictions on third > parties, the sensible thing to do is to provide those restrictions to > those who want them. Hang on a second. No one is *imposing* new restrictions. Python uses ASCII-only identifiers today and has always been that way. The proposed change is to *expand* the identifier character set, and some of us want to have control over this expansion. > But I see no reason why that auditor program can't be run as a PEP 263 > codec. AFAICS, the following objections could be raised, and answered: The big missing concern from your list is that the vast majority won't *know* that the character set is changing on them, so they won't know that they need to do any of these things. > 1. PEP 263 codecs delegate the decision to the code's author; an > auditor shouldn't do that. I'd be okay with this if the rules were adjusted so the codec declaration was guaranteed to occur within a small bounded region at the beginning of the file (e.g. the first two lines or first 80 characters, whichever is less) and that region was required to be in ASCII. Then you can easily know reliably, and at a glance, what character set you are dealing with. > 2. The auditor would have to duplicate the work of the parser, and > might get it wrong. > 3. Parsing is expensive in time and other resources. Both of these come down to the wastefulness of redoing something that the Python interpreter itself already knows how to do very well, and is, in some sense by definition, the authority on how to do it correctly. -- ?!ng From rauli.ruohonen at gmail.com Tue Jun 12 12:27:45 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 12 Jun 2007 13:27:45 +0300 Subject: [Python-3000] String comparison In-Reply-To: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/10/07, Stephen J. Turnbull wrote: > I think you misunderstand. Anything in Unicode that is normative is > about interchange. Strings are also a means of interchange---between > modules (separate Unicode processes) in a program (single OS process). Like Martin said, "what is a process?" :-) If you have a module that uses noncharacters to mean something and it documents that, then that may well be useful to its users. In my mind everything in a Python program is within a single Unicode process, unless you have a very high level component which specifies otherwise in its API documentation. > Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2" > is precisely a statement that various modules in Python do not specify > what encoding forms they purport to accept or emit. Actually, I said that there's no way to always do the right thing as long as they are mixed, but that was a too theoretical argument. Practically speaking, there's little need to interpret surrogate pairs as two code points instead of as one non-BMP code point. The best use case I could come up with was reading in an ill-formed UTF-8 file to see what makes it ill-formed, but that's best done using bytes anyway. E.g. '\xed\xa0\x80\xed\xb0\x80\xf0\x90\x80\x80' decodes to u'\ud800\udc00\U00010000' on both builds, but as on a UCS-2 build u'\U00010000' == u'\ud800\udc00', the distinction is lost there. Effectively the codec has decoded the first two code points to UCS-2 and the the last code point to UTF-16, forming a string which mixes the two interpretations instead of using one of them consistently, and because of that you can no longer recover the original code point stream. But what the decoder should really do is raise an exception anyway, as the input is ill-formed. Java and C# (and thus Jython and IronPython too) also sometimes use UCS-2, sometimes UTF-16. As long as it works as you expect, there isn't a problem, really. On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the extension that surrogates work as in UTF-16), but you get the extra complication that some equal strings don't compare equal, e.g. u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in practice, because you shouldn't have strings like u'\ud800\udc00' in the first place. From stephen at xemacs.org Tue Jun 12 13:35:22 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 12 Jun 2007 20:35:22 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> Message-ID: <87vedte77p.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > I don't know any Chinese, but real Chinese is much more legible to me > than transliterated one. Transliterations are complete gibberish to me, And will be to most Chinese, too, unless Mandarin is used, since pronunciation varies infinitely from dialect to dialect, although the characters and grammar are mostly the same. You'd have to ask a non-Beijing Chinese, but I suspect some of them feel about Mandarin as a standard about the way Japanese feel about English. From turnbull at sk.tsukuba.ac.jp Tue Jun 12 14:04:35 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Tue, 12 Jun 2007 21:04:35 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87tztde5v0.fsf@uwakimon.sk.tsukuba.ac.jp> Ka-Ping Yee writes: > On Tue, 12 Jun 2007, Stephen J. Turnbull wrote: > > It seems to me that rather than *impose* restrictions on third > > parties, the sensible thing to do is to provide those restrictions to > > those who want them. > > Hang on a second. No one is *imposing* new restrictions. Python > uses ASCII-only identifiers today and has always been that way. Who said "new"? PEP 3131 is approved, so "reimpose", if you like. But I don't want it, and definitely consider it an imposition, and have done so since the discussion of PEP 263. > The big missing concern from your list is that the vast majority > won't *know* that the character set is changing on them, so they > won't know that they need to do any of these things. Deliberate omission. Such restrictions seem unacceptable to both Guido and Martin; the *only* point of this proposal is to see if there's a way we can achieve Jim's goal of no change to his best current practice without a global setting unacceptable to Guido. If you want to use this technology to change the default, fine, but it's not part of my proposal. > > 1. PEP 263 codecs delegate the decision to the code's author; > > I'd be okay with this if [...] [I'm not sure what you mean; I've deliberately edited to show the meaning I took.] I'm not OK with it. Auditing by definition is under control of the user, not the source code. I don't see a real point in doing this if the user or site can't enforce auditing, since they *can* do so by explicitly running an external utility. > > 2. The auditor would have to duplicate the work of the parser, and > > might get it wrong. > > 3. Parsing is expensive in time and other resources. > > Both of these come down to the wastefulness of redoing something > that the Python interpreter itself already knows how to do very > well, and is, in some sense by definition, the authority on how > to do it correctly. True. However, Guido has already indicated that he favors some approach like this, as an external lint utility. My question is how to minimize impact on users who desire flexible automatic auditing. From jimjjewett at gmail.com Tue Jun 12 16:03:53 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 10:03:53 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/12/07, Rauli Ruohonen wrote: > Practically > speaking, there's little need to interpret surrogate pairs as two > code points instead of as one non-BMP code point. Depends on your definition of "practically". Python does interpret them that way to maintain O(1) positional access within strings encoded with 16 bits/char. -jJ From rauli.ruohonen at gmail.com Tue Jun 12 16:39:48 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 12 Jun 2007 17:39:48 +0300 Subject: [Python-3000] String comparison In-Reply-To: References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/12/07, Jim Jewett wrote: > On 6/12/07, Rauli Ruohonen wrote: > > Practically speaking, there's little need to interpret surrogate pairs > > as two code points instead of as one non-BMP code point. > > Depends on your definition of "practically". > > Python does interpret them that way to maintain O(1) positional access > within strings encoded with 16 bits/char. Indexing does not try to interpret the string as code points at all, it works on code units. The difference is easier to see if you imagine Python using utf-8 for strings. Indexing would still work on (8-bit) code units instead of code points. It is higher level operations such as unicodedata.normalize() that need to interpret strings as code points. For 16-bit code units there are two interpretations, depending on whether you think that surrogate pairs mean one (UTF-16) or two (UCS-2) code points. Incidentally, unicodedata.normalize() is an example that currently does interpret its input as UCS-2 instead of UTF-16. If you pass it a surrogate pair it thinks of them as two code points, and won't do any normalization for anything outside BMP on a UCS-2 build. Another example would be unichr(), which gives you TypeError if you pass it a surrogate pair (oddly enough, as strings of different length are of the same type). From rauli.ruohonen at gmail.com Tue Jun 12 16:45:20 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Tue, 12 Jun 2007 17:45:20 +0300 Subject: [Python-3000] String comparison In-Reply-To: References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/12/07, Rauli Ruohonen wrote: > Another example would be unichr(), which gives you TypeError if you > pass it a surrogate pair (oddly enough, as strings of different length > are of the same type). Sorry, I meant ord(), not unichr. Anyway, ord(unichr(i)) == i doesn't work for all code points on a UCS-2 build. From bborcic at gmail.com Tue Jun 12 18:27:11 2007 From: bborcic at gmail.com (Boris Borcic) Date: Tue, 12 Jun 2007 18:27:11 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: Ka-Ping Yee wrote: > > Hang on a second. No one is *imposing* new restrictions. Python > uses ASCII-only identifiers today and has always been that way. That restriction clearly wasn't imposed on the standard www.python.org windows distributions of Python - for quite a few versions already. See below. Cheers, Boris Borcic Python 2.4.2 (#67, Jan 17 2006, 15:36:03) [MSC v.1310 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. **************************************************************** Personal firewall software may warn about the connection IDLE makes to its subprocess using this computer's internal loopback interface. This connection is not visible on any external interface and no data is sent to or received from the Internet. **************************************************************** IDLE 1.1.2 >>> ?a_marchait = 1 >>> print ?a_marchait 1 ========================================================================== Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. **************************************************************** Personal firewall software may warn about the connection IDLE makes to its subprocess using this computer's internal loopback interface. This connection is not visible on any external interface and no data is sent to or received from the Internet. **************************************************************** IDLE 1.2 >>> ?a_marche = 2 >>> ?_l_?vidence = 3 >>> ?a_marche + ?_l_?vidence 5 From pje at telecommunity.com Tue Jun 12 18:30:46 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Tue, 12 Jun 2007 12:30:46 -0400 Subject: [Python-3000] Pre-PEP on fast imports In-Reply-To: <20070611231640.8A55E3A407F@sparrow.telecommunity.com> References: <466DD0D8.7040407@develer.com> <20070611231640.8A55E3A407F@sparrow.telecommunity.com> Message-ID: <20070612162845.5F00A3A407F@sparrow.telecommunity.com> At 07:18 PM 6/11/2007 -0400, Phillip J. Eby wrote: >The subclass might look something like this: > > import imp, os, sys > from pkgutil import ImpImporter > > suffixes = set(ext for ext,mode,typ in imp.get_suffixes()) > > class CachedImporter(ImpImporter): > def __init__(self, path): > if not os.path.isdir(path): > raise ImportError("Not an existing directory") > super(CachedImporter, self).__init__(path) > self.refresh() > > def refresh(self): > self.cache = set() > for fname in os.listdir(path): > base, ext = os.path.splitext(fname) > if ext in suffixes and '.' not in base: > self.cache.add(base) > > def find_module(self, fullname, path=None): > if fullname.split(".")[-1] not in self.cache: > return None # no need to check further > return super(CachedImporter, self).find_module(fullname, path) > > sys.path_hooks.append(CachedImporter) After a bit of reflection, it seems the refresh() method needs to be a bit different: def refresh(self): cache = set() for fname in os.listdir(self.path): base, ext = os.path.splitext(fname) if not ext or (ext in suffixes and '.' not in base): cache.add(base) self.cache = cache This version fixes two problems: first, a race condition could occur if you called refresh() while an import was taking place in another thread. This version fixes that by only updating self.cache after the new cache is completely built. Second, the old version didn't handle packages at all. This version handles them by treating extension-less filenames as possible package directories. I originally thought this should check for a subdirectory and __init__, but this could get very expensive if a sys.path directory has a lot of subdirectories (whether or not they're packages). Having false positives in the cache (i.e. names that can't actually be imported) could slow things down a bit, but *only* if those names match something you're trying to import. Thus, it seems like a reasonable trade-off versus needing to scan every subdirectory at startup or even to check whether all those names *are* subdirectories. From jimjjewett at gmail.com Tue Jun 12 18:48:46 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 12:48:46 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: On 6/11/07, Baptiste Carvello wrote: > Michael Urman a ?crit : > > ... you already cannot visually inspect ... > > There is the risk of visually aliased identifiers, but how is that > > qualitatively worse than the truly conflicting identifiers you can > > import with a *, or have inserted by modules mucking with > > __builtins__? > Oh come on! imports usually are located at the top of the file, so they won't > clobber other names. And mucking with __builtins__ is rare and frowned upon. Also, both are at least obvious. Because of the (unexpected) visually similar possibilities, a closer analogy would be a module that did def fn1(): global mydata mydata = sys.modules['__builtin__'] and later changes mydata. This is certainly possible, but it isn't common, or accepted as good style. > On the contrary, non-ASCII identifiers will be encouraged, > anywhere in the code. And that's OK with me -- but I want a warning when they are used, at least as conspicuous as import * __builtins__ I have no objection to letting people turn that warning off locally (preferably per-charset, rather than as a single switch), but I want that decision to be local and explicit. -jJ From jimjjewett at gmail.com Tue Jun 12 19:08:30 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 13:08:30 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/12/07, Rauli Ruohonen wrote: > On 6/12/07, Jim Jewett wrote: > > On 6/12/07, Rauli Ruohonen wrote: > > > Practically speaking, there's little need to interpret > > > surrogate pairs as two code points instead of as one > > > non-BMP code point. > > Depends on your definition of "practically". > > Python does interpret them that way to maintain O(1) positional > > access within strings encoded with 16 bits/char. > Indexing does not try to interpret the string as code points at all, it > works on code units. Even assuming that (when most people will assume "letters", and could maybe understand that accent marks sometimes count), it still doesn't quite work. Slicing (or iterating over) a string claims to return strings of the same type. >>> for x in u"abc": print type(x) Strictly speaking, the surrogate pairs should be returned together, rather that as separate code units. It probably won't be fixed, since those who care most are probably using 4-byte unicode characters. -jJ From g.brandl at gmx.net Tue Jun 12 22:28:21 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 12 Jun 2007 22:28:21 +0200 Subject: [Python-3000] Unicode identifiers In-Reply-To: <466EF680.10200@v.loewis.de> References: <466BCB1D.2050101@v.loewis.de> <466CCF0D.5000304@v.loewis.de> <466DBBA8.5020405@v.loewis.de> <466E3509.6020703@v.loewis.de> <466E9C08.5000704@gmx.net> <466EC238.1060502@gmx.net> <466EF680.10200@v.loewis.de> Message-ID: <466F01E5.3030204@gmx.net> [crossposting to python-3000] Martin v. L?wis schrieb: [removing string->string codecs] >>> You're not losing functionality -- these conversions will remain >>> available by importing the appropriate module. You're losing a very >>> minor amount of convenience. >> >> Of the mentioned encodings base64, uu, zlib, rot_13, hex, and quopri (bz2 >> should be in there as well) these all could work in the unicode<->bytes >> way, except for rot13. > > What does it mean to apply base64 to a string that contains characters > with ordinals > 256? Likewise for uu, zlib, hex, and quopri. > > They really encode bytes, not characters. Perhaps I may then suggest a new bytes API, transform(). b"abc".transform("base64") would then do the same as today's "abc".encode("base64"). A unified bytestring transforming API could then make many of the functions scattered across many modules obsolete. Whether the opposite way would be via a different transformer name or a untransform() method remains to be debated. Georg From baptiste13 at altern.org Tue Jun 12 22:59:35 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 22:59:35 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: Michael Urman a ?crit : > As I am not going to be interested in trying to > understand code written in Chinese, Russian, etc., I'm not bothered by > the idea that someone might write code I will have a strong > disincentive to read. > The question is: is it worth it. Will the new feature allow more useful code to be written, or will it cause unnecessary duplication of effort. Probably both, but I cannot tell in which proportions, an neither can you, I guess. I think helps a lot in this regard if developpers make a conscious choice as to whether they use non-ASCII identifiers or not in a given project (as opposed to just using the feature because it is there). Thus they will only use them when they really feel the need. Having the feature disabled by default is a way to make sure people take some time to think about it. However, maybe it s not absolutely necessary and a prominent explanation in the various documentations is sufficient? I'm not 100% sure one way or the other. Cheers, Baptiste From baptiste13 at altern.org Tue Jun 12 23:29:56 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 23:29:56 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466E282E.5070603@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466E282E.5070603@v.loewis.de> Message-ID: Martin v. L?wis a ?crit : > Baptiste Carvello schrieb: >> Martin v. L?wis a ?crit : >>> I cannot imagine this scenario as realistic. It is certain >>> realistic that you want to keep your own code base ASCII-only - >>> what I don't understand why such a policy would extend to libraries >>> that you use. If the interfaces of the library are non-ASCII, you >>> will automatically notice; if it only has some non-ASCII >>> identifiers inside, why would you bother? >>> >> well, for the same reason I prefer to use open source software: >> because I can debug it in case of need, and because I can use it as >> an inspiration if I need to write a similar program. > > Ok, but why need you then *Python* to tell you that the file has > non-ASCII identifiers? Just look inside the file, and see whether > you like its source code. > well, doing that for all code before using it is not practical. And finding out you can't read the code at the precise time when you have a bug you need to solve is a really bad surprise. > It's not that non-ASCII identifiers > *necessarily* make the file unmaintainable for your, they just > do so when you cannot easily recognize or type the characters > being used. > true, but better safe than sorry :-) > Also, that all identifiers are ASCII is not sufficient > for you to be able to debug the program in case of need: it also > needs to be commented well, and the comments also should be in > a language you understand. > comments are nice to have, but you can often figure out what the code does without them. It's not like all code is heavily commented... > Furthermore, it has been demonstrated > that ASCII-only identifiers are no guarantee that you can actually > understand the code, if they happen to be non-English in a convoluted > way, e.g. through transliteration. > This is the same as comments: if the identifier name is gibberish, you can still figure out what it stands for from the context (OK, at that point, it starts getting very unconfortable :-). > So I don't see that an automatic detection of non-ASCII identifiers > actually helps much in determining whether you can use the source > code as inspiration. But even if you wanted to enforce a strict > "ASCII-only" policy, I don't see why you need Python to *reject* > identifiers outside ASCII - a warning would be surely enough to > indicate to you that your policy was violated. > indeed, but I doubt a warning would be acceptable as a default policy, given how annoying they are. So there would still need to be a configuration option to disable (resp. enable) the warning. Also note that a warning would not solve the security problems others have discussed, because it would only be shown after the code has been executed. > Regards, > Martin > Cheers, Baptiste PS: I think I'm going to reduce my participation in this thread, as I don't have many new thoughts to add. I'm not convinced that non-ACSII identifiers allowed by default is the way to go, but I'm not 100% sure otherwise, so count me as a -0 on that. From baptiste13 at altern.org Tue Jun 12 23:34:25 2007 From: baptiste13 at altern.org (Baptiste Carvello) Date: Tue, 12 Jun 2007 23:34:25 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com> References: <19dd68ba0705120817k61788659n83da8d2c09dba0e1@mail.gmail.com> <9D017904-5A64-40EC-8A5C-23502FB1E314@fuhm.net> <87wsyu2zj7.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5ulyx1c.fsf@uwakimon.sk.tsukuba.ac.jp> <466BBDA1.7070808@v.loewis.de> <19dd68ba0706111734t1f0f12d6y6c6584e41bf6e211@mail.gmail.com> Message-ID: Guillaume Proux a ?crit : > Hello, > > On 6/12/07, Baptiste Carvello wrote: >> context. By contrast, with chineses identifiers, I will not recognise them from >> one another. So I won't be able to make any sense from the code without going >> through the complex task of translating everything. > > You would be surprised how well you can do if you would actually try > to recognize a set of Chinese characters, especially if you would use > some tool to put a meaning on them. Well, I never formally learned any > Chinese (nor any Japanese actually) , but I can now effortlessly parse > both languages now. > > But really, if you ever find any code with Chinese written all over it > that you would believe might be very useful to you, you would have one > of the following choice: > (a) use a tokenizer and use some tool to do a hanzi -> ascii automatic > transliteration/translation > (b) try to wrap the Chinese things with an ASCII veil (which would > make you work on your Chinese a bit) or you could ask your Chinese > girlfriend to help you (WHAT you don't have a Chinese girlfriend yet? > :)) > (c) actually contact the person who submitted the code to let him know > you are very much interested in the code.... > > In most cases, this would give you the possibility to reach out to > different communities and to work together with people with whom you > might never have talked to. From what we can see on English-only > mailing lists, this is the kind of python users we don't normally have > access to currently because they simply are secluded in their own > little universe, in the comfortable realm of their own linguistic > barriers. > > Of course, sometimes they step out and offer a plea for help on > English ML in broken English... > PEP3131 is unlikely to change this. However it can see it might have > two ethnically interesting consequences: > 1) Python usage in community where ascii has little place should find > more uses because people will become enpowered with Python and able to > express themselves like never before: my bet is that for example, the > Japanese python commmunity will become stronger and welcome new people > younger and older, and that do not know much English. > 2) If ever a program written with non-ASCII character find some good > usage in ascii-only communities, then the usual plea for help will be > reversed. People will seek out e.g. Japanese programmers and request > help, maybe in broken Japanese. From this point on, all programming > communities will be on an equal footing and able to talk together on > the same standpoint. I guess you know "Libert? Egalit? Fraternit?". > Maybe this should be the PEP subtitle. > >> what happens to the keyword "if" (just try it:-). You would have to translate >> the identifiers one by one, which is not practical. > > would be possible with the tokenizer actually :) > > Droit comme un if ! > > A bient?t, > > Guillaume si tu me prends par les sentiments :-) Really, you make it sound so nice I would almost change my mind. Still wondering how much of an effort it will be, though. Ciao, Baptiste From jimjjewett at gmail.com Tue Jun 12 23:46:32 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 17:46:32 -0400 Subject: [Python-3000] external dependencies (PEP 3131) Message-ID: On 6/11/07, Michael Urman wrote: > I can't agree with this. The predictability of needing only to > duplicate dependencies (version of python, modules) to ensure a > program that ran over there will run over here (and vice versa) is too > important to me. This provides almost an exact analogy for locally allowed additional scripts (scripts=writing systems, not .py files). Your cherished assumption may already be false, depending on what is in sitecustomize. sitecustomize is imported by site.py; it is not overwritten by a new install; it can do arbitrary things. In theory, you need to add sitecustomize to your list of dependencies. In practice, it almost never exists, let alone does anything. But you could use it if you wanted to... By exact analogy: In theory, you need to pre-authorize the characters used in your identifiers, just as you have to set up your sitecustomize. In practice, you generally won't need to change anything because ASCII is enough. (Or if not, there will be a standard distribution for your language community that already adds what you actually use.) But if your local community does start using Tengwar, you are free to add that too. This script-permission file can last across installations, just like sitecustomize does. And to be honest, from my perspective, a fine spelling would be to just add it right to sitecustomize, perhaps as: __allow_id_chars__("Kanji") __allow_id_chars__("0x1043..0x1059") __allow_id_chars__ should be restricted in Brett's security-conscious build, but I think it is OK to expose it to normal python. If a strange file does __allow_id_chars__("0x1043") up near the import block, that provides about the same level of warning as use of "__builtin__", or "import *". (That is, less warning than I would ideally prefer, but probably enough to prevent *accidental* charset confusion.) -jJ From jimjjewett at gmail.com Wed Jun 13 00:09:17 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 18:09:17 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/11/07, Stephen J. Turnbull wrote: > Jim Jewett writes: > > Of course, I wouldn't type them if I knew they were wrong. With an > > ASCII-only install, I would get that error-check because the > > (remaining original uses) were in Cyrillic. With an "any unicode > > character" install, ... well, I might figure out my problem the next > > morning. > But this is something that only a small subset of developers-of-Python > seem to be concerned about. (Almost) no one ever cares about typos (or fire escapes, for that matter) in advance; if it non-ASCII characters were common enough (in your local environment) that people expected and recognized them, then they wouldn't be a problem. That is why I have no objection to using Japanese on systems configured for it. That is also why I want even systems configured for Japanese to be able to still get warnings about Latin-1 (beyond ASCII). I figure if the difference between ? and i may be as subtle to them as the difference between (two of their letters that happen to be similar to me), and they might appreciate the heads-up to look carefully. > But I see no reason why that auditor program can't be run as a PEP 263 > codec. AFAICS, the following objections could be raised, and answered: This can of course be turned around. The "codec does a bit more than you expect" option has been available since 2.3 for people who want an expanded ID charset. (Just transliterate the extra characters into the moral equivalent of an escape.) It doesn't seem to have been used. I'll freely agree that it hasn't been used in part because the expanded charset is aimed largely at people not ready to use the "write or at least install a codec that cheats" level of magic. It is also partly because the use of non-ASCII IDS is expected to stay small in widely distributed code. But the same facts argue against silently allowing unrecognized characters; the use will be rare enough that people won't be expecting it, and the level of magic required to write (or even know to install) such a codec ... tends to come after someone has already found a workaround for "strange characters". > That doesn't mollify those who think I should not be allowed to use > non-ASCII identifiers at all. There is a subtle distinction there. I am among those who think you should not use non-ASCII identifiers *without an explicit declaration.* Putting that declaration at the top of the file itself would be fine. (modulo possible security issues, such as the "coding" with a cyrillic c.) -jJ From showell30 at yahoo.com Wed Jun 13 00:15:49 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 12 Jun 2007 15:15:49 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <87tztde5v0.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <636530.83129.qm@web33503.mail.mud.yahoo.com> --- "Stephen J. Turnbull" wrote: > Ka-Ping Yee writes: > > Both of these come down to the wastefulness of > redoing something > > that the Python interpreter itself already knows > how to do very > > well, and is, in some sense by definition, the > authority on how > > to do it correctly. > > True. However, Guido has already indicated that he > favors some > approach like this, as an external lint utility. My > question is how > to minimize impact on users who desire flexible > automatic auditing. > I would like to comment on both points. I am somebody who would use such an external lint utility, even it was just out of idle curiosity about the code I was importing (in other words, no fear involved). It seems like such a utility would need to be able to do the following. 1) The utility would need to tokenize my code. It seems like this could be done by the tokenizer module pretty easily, even under PEP 3131. Tokenizer.py does not tap into Python internals right now AFAIK, and I don't think it would need to under Py3K. 2) The utility should triage my identifiers according to their alphabet content. In an ideal world, since I'm not a Unicode expert, I would like it somewhat simplifed for me -- e.g. the utility would classify identifiers as ascii-only, totally mixed, definitely Cyrillic, definitely French, German, mixed Latin variant, Japanese, etc. To the extent that Python knows how to classify strings on those general levels, I would hope that those functions would be exposed at the Python level. But to the extent that CPython really shouldn't care, I don't see a big problem with some third party library implementing some kind of routine that can deduce languages from Unicode spellings. It's basically a big dictionary, and maybe a small tree structure, and something like a forty-line algorithm (walk through the letters, look up their most specific language, then with all the letters, walk up the tree until you found the most specific species/phylum/kingdom etc. of languages that encompasses all letters). 3) The utility should be able to efficiently figure out which files I want to inspect, by statically walking the import structure. To Ping's point, I think this is one area where you lose something by having to do this outside of the interpreter, but it doesn't seem to be a terribly difficult problem to solve. (To the extent that Python can dynamically import stuff at run-time, I'm willing to accept that limitation in an external lint utility, since even if CPython were doing the auditing for me, I'd still only find out at runtime.) ____________________________________________________________________________________ Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool. http://autos.yahoo.com/carfinder/ From showell30 at yahoo.com Wed Jun 13 00:36:11 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 12 Jun 2007 15:36:11 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: Message-ID: <402199.77500.qm@web33508.mail.mud.yahoo.com> --- Baptiste Carvello wrote: > > si tu me prends par les sentiments :-) Really, you > make it sound so nice I would > almost change my mind. Still wondering how much of > an effort it will be, though. > I would again make a call out for actual examples of what Python code would look like under PEP 3131. Then people would not need to speculate on the effort involved; they could try it out. In my best franglais: je pense que les avocats de PEP 3131 pourrait surmonter la doute, l'incertitude, le crainte, etc., de PEP 3131 en montrant les exemples. ____________________________________________________________________________________ Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545433 From jimjjewett at gmail.com Wed Jun 13 03:31:10 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 21:31:10 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <466E282E.5070603@v.loewis.de> References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466E282E.5070603@v.loewis.de> Message-ID: On 6/12/07, "Martin v. L?wis" wrote: > Ok, but why need you then *Python* to tell you that the file has > non-ASCII identifiers? Just look inside the file, and see whether > you like its source code. That is just what many users (including, in some environments, me) cannot do *because* of the extended charset. I can't see whether the I have an ASCII o or a Cyrillic o, because they look the same, even though they aren't. If the whole think is in Cyrillic, I may notice; if only a few identifiers are, I probably won't notice at least until I've already saved it (and possibly broken it, depending on how unicode-unaware my editor is). > I don't see why you need Python to *reject* > identifiers outside ASCII - a warning would be surely enough to > indicate to you that your policy was violated. A warning would indeed be sufficient. -jJ From jimjjewett at gmail.com Wed Jun 13 03:44:12 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 12 Jun 2007 21:44:12 -0400 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: <402199.77500.qm@web33508.mail.mud.yahoo.com> References: <402199.77500.qm@web33508.mail.mud.yahoo.com> Message-ID: On 6/12/07, Steve Howell wrote: > In my best franglais: je pense que les avocats de PEP > 3131 pourrait surmonter la doute, l'incertitude, le > crainte, etc., de PEP 3131 en montrant les exemples. Not really; I think everyone agrees that you *can* produce well-written code with non-ASCII identifiers. The concerns are with not-so-well written code, or code that is mostly ASCII with a few non-ASCII characters (or ids) thrown in around line 343. -jJ From showell30 at yahoo.com Wed Jun 13 04:01:43 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 12 Jun 2007 19:01:43 -0700 (PDT) Subject: [Python-3000] Support for PEP 3131 In-Reply-To: Message-ID: <776344.63529.qm@web33512.mail.mud.yahoo.com> --- Jim Jewett wrote: > On 6/12/07, Steve Howell > wrote: > > > In my best franglais: je pense que les avocats de > PEP > > 3131 pourrait surmonter la doute, l'incertitude, > le > > crainte, etc., de PEP 3131 en montrant les > exemples. > > Not really; I think everyone agrees that you *can* > produce > well-written code with non-ASCII identifiers. > > The concerns are with not-so-well written code, or > code that is mostly > ASCII with a few non-ASCII characters (or ids) > thrown in around line > 343. > But then I would extend the same challenge to you. Post a piece of code that follows the pattern that you predict, and then see if the actual example of not-so-well-written, non-ASCII-pure code can resonate in the minds of folks who aren't able to imagine the validity of your points without an example in front of them. I know that Ping has produced one example of deceptive non-ASCII-pure code, but it didn't really sway me, even though I've been basically sympatethic to his overall conservatism about either keeping ASCII purity or introducing Unicode only with some proper safeguards. ____________________________________________________________________________________ It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/ From martin at v.loewis.de Wed Jun 13 05:57:36 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 13 Jun 2007 05:57:36 +0200 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <19685.97380.qm@web33511.mail.mud.yahoo.com> <466C5730.3060003@v.loewis.de> Message-ID: <466F6B30.3070308@v.loewis.de> >> As I am not going to be interested in trying to >> understand code written in Chinese, Russian, etc., I'm not bothered by >> the idea that someone might write code I will have a strong >> disincentive to read. >> > The question is: is it worth it. Will the new feature allow more useful code to > be written, or will it cause unnecessary duplication of effort. Probably both, > but I cannot tell in which proportions, an neither can you, I guess. This question has already been decided; PEP 3131 is accepted. So please stop questioning it fundamentally. Regards, Martin From turnbull at sk.tsukuba.ac.jp Wed Jun 13 06:28:23 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 13 Jun 2007 13:28:23 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > In my mind everything in a Python program is within a single > Unicode process, Which is a *serious* mistake. It is *precisely* the mistake that leads to mixing UTF-16 and UCS-2 interpretations in the standard library. What you are saying is that if you write a 10-line script that claims Unicode conformance, you are responsible for the Unicode- correctness of all modules you call implicitly as well as that of the Python interpreter. This is what I mean by "Unicode conformance is not a goal of the language." Now, it's really not so bad. If you look at what MAL and MvL are doing (inter alia, it's their work I'm most familiar with), what you will see is that they are gradually implementing conformant modules here and there. Eg, I am sure it is not MvL's laziness or inability to come up with a reasonable spec himself that causes PEP 3131 to be a profile of UAX #31. > Actually, I said that there's no way to always do the right thing as long > as they are mixed, but that was a too theoretical argument. Practically > speaking, there's little need to interpret surrogate pairs as two > code points instead of as one non-BMP code point. Again, a mistake. In the standard library, the question is not "do I need this?", but "what happens if somebody else does it?" They may receive the same answer, but then again they may not. For example, suppose you have a supplier-consumer pair sharing a fixed-length buffer of 2-octet code units. If it should happen that the supplier uses the UCS-2 interpretation, then a surrogate pair may get split when the buffer is full. Will a UTF-16 consumer be prepared for this? Almost surely some will not, because that would imply maintaining an internal buffer, which is stupidly inefficient if you have an external buffer protocol. Note that an UTF-16 supplier feeding a UCS-2 consumer will have no problems (unless the UCS-2 consumer can't handle "short reads", but that's unlikely), and if you have a chain starting with a UTF-16 source, then none of the downstream UTF-16 processes have a problem. The problem is, suppose somehow you get a UCS-2 source? Whose responsibility is it to detect that? > Java and C# (and thus Jython and IronPython too) also sometimes use > UCS-2, sometimes UTF-16. As long as it works as you expect, there > isn't a problem, really. That depends on how big a penalty you face if you break a promise of conformance to your client. Death, taxes, and Murphy's Law are inescapable. > On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the > extension that surrogates work as in UTF-16), but you get the extra > complication that some equal strings don't compare equal, e.g. > u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in > practice, because you shouldn't have strings like u'\ud800\udc00' in the > first place. But the Unicode standard itself gives (the equivalent of) u'\ud800' + u'\udc00' as an example of the kind of thing you *should be able to do*. Because, you know, clients of the standard library *will* be doing half-witted[1] things like that. Footnotes: [1] What I wanted to say was ????????? From stephen at xemacs.org Wed Jun 13 06:43:33 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 13 Jun 2007 13:43:33 +0900 Subject: [Python-3000] Support for PEP 3131 In-Reply-To: References: <20070524234516.8654.JCARLSON@uci.edu> <4656920F.9040001@v.loewis.de> <20070525091105.8663.JCARLSON@uci.edu> <466C5BB7.8050909@v.loewis.de> <466DBE10.1070804@v.loewis.de> <466DC90E.1070009@v.loewis.de> <87fy4xg9j3.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87ps40ea6i.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > On 6/11/07, Stephen J. Turnbull wrote: > > But this is something that only a small subset of developers-of-Python > > seem to be concerned about. This is a statement about the politics of changing an accepted PEP. Without massive outcry, ain' agonna happ'm, Cap'n. Remember, I've been in your camp w.r.t. "Python should provide auditing" throughout. If we can't get it in the language, I'm looking for an *existing mechanism*. > The "codec does a bit more than you expect" option has been available > since 2.3 for people who want an expanded ID charset. (Just > transliterate the extra characters into the moral equivalent of an > escape.) It doesn't seem to have been used. I would have done it immediately after PEP 263 if I had known how to implement codecs. It doesn't surprise me nobody else has done it. This concept is probably sufficiently unobvious to meet the USPTO criteria for patentability. > > That doesn't mollify those who think I should not be allowed to use > > non-ASCII identifiers at all. > > There is a subtle distinction there. I am among those who think you > should not use non-ASCII identifiers *without an explicit > declaration.* So you're saying that you want to impose this check on all Python *users* (not the developers; you can refuse the developers' code yourself). Fine, if you can get that past Guido and Martin. All I'm trying to do here is find a way that *you* and *Josiah* can get what you want in *your* installations, with existing mechanisms. If you want to make it a default, discuss that with Guido and Martin; that requires modifying the PEP. From rauli.ruohonen at gmail.com Wed Jun 13 12:24:33 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Wed, 13 Jun 2007 13:24:33 +0300 Subject: [Python-3000] String comparison In-Reply-To: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/13/07, Stephen J. Turnbull wrote: > What you are saying is that if you write a 10-line script that claims > Unicode conformance, you are responsible for the Unicode-correctness of > all modules you call implicitly as well as that of the Python interpreter. If text files are by default read in normalized and noncharacters stripped, where will you get problems in practice? A higher-level string type may be useful, but there's no single obvious design. > > Practically speaking, there's little need to interpret surrogate pairs > > as two code points instead of as one non-BMP code point. > > Again, a mistake. In the standard library, the question is not "do I > need this?", but "what happens if somebody else does it?" They may > receive the same answer, but then again they may not. What I meant is that the stdlib should only have string operations that effectively work on (1) sequences of code units or (2) sequences of code points, and that the choice between these two should be made reasonably. One way to check whether a choice is reasonable is to consider what it would mean for UTF-8, as there the difference between code units (0...ff) and code points (0...10ffff) is the easiest to see. E.g. normalization doesn't make any sense on code units, but slicing does. Once you have determined that the reasonable choice is code points for some operation in general, then you shouldn't use the UCS-2 interpretation for 16-bit strings in particular, because it muddies the underlying rule, and Unicode is clear as mud without extra muddying already :-) > For example, suppose you have a supplier-consumer pair sharing a > fixed-length buffer of 2-octet code units. If it should happen that > the supplier uses the UCS-2 interpretation, then a surrogate pair may > get split when the buffer is full. I.e. you have a supplier that works on code units. If you document this, then there's no problem, especially if that's what the user expects. > Will a UTF-16 consumer be prepared for this? This also needs to be documented, especially if it isn't. The consumer is more useful if it is prepared for it. I've been excavating some Cambrian period discussions on the topic recently, and this brings one post to mind: http://mail.python.org/pipermail/i18n-sig/2001-June/001010.html > Almost surely some will not, because that would imply maintaining an > internal buffer, which is stupidly inefficient if you ave an external > buffer protocol. You only need to buffer one code unit at most, it's not inefficient. > The problem is, suppose somehow you get a UCS-2 source? Whose > responsibility is it to detect that? The user should check the API documentation. If the documentation is missing, then you have to test or UTSL it (testing is good to do anyway). If the documentation is wrong, then it's a bug. > But the Unicode standard itself gives (the equivalent of) u'\ud800' + > u'\udc00' as an example of the kind of thing you *should be able to > do*. Because, you know, clients of the standard library *will* be > doing half-witted[1] things like that. For UTF-16, yes, but for UTF-32, no. Any surrogate code units make UTF-32 ill-formed, so there's no need to use them to make UTF-32 strings. In UTF-16 surrogate pairs are allowed, and allowing isolated surrogates makes some operations simpler. Kind of like negative integers make calculations simpler, even if the end result is always non-negative. Python itself has both UTF-16 and UTF-32 behavior on UCS-4 builds, but that's an original invention probably intended to make code written for UTF-16 work unchanged on UCS-4 builds, following the rule "be lenient in what you accept and strict in what you emit". > Footnotes: > [1] What I wanted to say was ????????? ????????? From stephen at xemacs.org Wed Jun 13 21:05:35 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 14 Jun 2007 04:05:35 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Rauli Ruohonen writes: > What I meant is that the stdlib should only have string operations > that effectively work on (1) sequences of code units or (2) > sequences of code points, and that the choice between these two > should be made reasonably. I think we've reached a dead end. AIUI, that's a matter for a PEP, and the window for Python 3 is closed. I'm pretty sure that Python 3 is going to have sequences of code units only (I know, Guido said "code points", but I doubt he's read TR#17), except that people will sneak in some UTF-16 behavior where it seems useful. Until one or more of the senior developers says otherwise, I'm going to assume that. From guido at python.org Wed Jun 13 22:03:39 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Jun 2007 13:03:39 -0700 Subject: [Python-3000] String comparison In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/13/07, Stephen J. Turnbull wrote: > Rauli Ruohonen writes: > > > What I meant is that the stdlib should only have string operations > > that effectively work on (1) sequences of code units or (2) > > sequences of code points, and that the choice between these two > > should be made reasonably. > > I think we've reached a dead end. AIUI, that's a matter for a PEP, > and the window for Python 3 is closed. I'm pretty sure that Python 3 > is going to have sequences of code units only (I know, Guido said > "code points", but I doubt he's read TR#17), except that people will > sneak in some UTF-16 behavior where it seems useful. > > Until one or more of the senior developers says otherwise, I'm going > to assume that. Yeah, what's the difference between code units and points? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Wed Jun 13 22:30:09 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Jun 2007 22:30:09 +0200 Subject: [Python-3000] String comparison In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87hcpjhoyi.fsf@uwakimon.sk.tsukuba.ac.jp> <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <467053D1.7050107@v.loewis.de> > I think we've reached a dead end. AIUI, that's a matter for a PEP, > and the window for Python 3 is closed. I'm pretty sure that Python 3 > is going to have sequences of code units only (I know, Guido said > "code points", but I doubt he's read TR#17), except that people will > sneak in some UTF-16 behavior where it seems useful. > > Until one or more of the senior developers says otherwise, I'm going > to assume that. I think it is *very* likely that Python 3 will work that way. There isn't anything remotely that might look like an implementation of an alternative. Regards, Martin From martin at v.loewis.de Wed Jun 13 22:37:45 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 13 Jun 2007 22:37:45 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46705599.8090301@v.loewis.de> >> Until one or more of the senior developers says otherwise, I'm going >> to assume that. > > Yeah, what's the difference between code units and points? A code unit is the atomic base in some encoding. It is a single byte in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit quantity in UTF-32). A code point is something that has a 1:1 relationship with a logical character (in particular, a Unicode character). In UCS-2, a code point can be represented in 16 bits, and you can represent all BMP characters. The low and high surrogates don't encode characters and are reserved. In UCS-4, you need more than 16 bits to represent a code point. For example, you might use UTF-16, where you can use a single code unit for all BMP characters, and two of them for code points above U+FFFF. Ever since PEP 261, Python admits that the elements of a Unicode string are code units, and that you might need more than one of them (specifically, for non-BMP characters in a narrow build) to represent a code point. Regards, Martin From guido at python.org Wed Jun 13 23:05:21 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Jun 2007 14:05:21 -0700 Subject: [Python-3000] String comparison In-Reply-To: <46705599.8090301@v.loewis.de> References: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: On 6/13/07, "Martin v. L?wis" wrote: > >> Until one or more of the senior developers says otherwise, I'm going > >> to assume that. > > > > Yeah, what's the difference between code units and points? > > A code unit is the atomic base in some encoding. It is a single byte > in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit > quantity in UTF-32). > > A code point is something that has a 1:1 relationship with a logical > character (in particular, a Unicode character). > > In UCS-2, a code point can be represented in 16 bits, and you can > represent all BMP characters. The low and high surrogates don't > encode characters and are reserved. > > In UCS-4, you need more than 16 bits to represent a code point. > For example, you might use UTF-16, where you can use a single > code unit for all BMP characters, and two of them for code points > above U+FFFF. > > Ever since PEP 261, Python admits that the elements of a Unicode > string are code units, and that you might need more than one of > them (specifically, for non-BMP characters in a narrow build) > to represent a code point. Thanks for clearing that up. It sounds like we really use code units, not code points (except when building with the 4-byte Unicode option, when they are equivalent). Is there anywhere were we use code points, apart from the UTF-8 codecs, which encode properly matched surrogate pairs as a single code point? Is it correct to say that a surrogate in UCS-16 is two code units representing a single code point? Apart from the surrogates, are there code points that aren't characters? Are there characters that don't have a representation as a single code point? (I know some characters have multiple representations, some of which use multiple code points.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Wed Jun 13 23:53:50 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Jun 2007 14:53:50 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <466E4B22.6020408@ronadam.com> References: <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> Message-ID: I couldn't get this exact patch to apply, but I implemented something equivalent in the py3kstruni branch. See revisions 55964 and 55965. Thanks for the suggestion! --Guido On 6/12/07, Ron Adam wrote: > Guido van Rossum wrote: > > On 6/7/07, "Martin v. L?wis" wrote: > >> >> The os.environ.get() method probably should return a unicode > >> string. (?) > >> > > >> > Indeed -- care to contribute a patch? > >> > >> Ideally, such a patch would make use of the Win32 Unicode API for > >> environment variables on Windows. People had already been complaining > >> that they can't have "funny characters" in the value of an environment > >> variable, even though the UI allows them to set the variable just fine. > > > > Yeah, but the Windows build of py3k is currently badly broken (e.g. > > the _fileio.c extension probably doesn't work at all) -- and I don't > > have access to a Windows box to work on it. I'm afraid 3.0a1 will be > > released without Windows support. Of course I'm counting on others to > > fix that before 3.0 final is released. > > > > I don't mind for now that the posix.environ variable contains 8-bit > > strings -- people shouldn't be importing that anyway. > > > Here's a diff of the patch. It looks like this may be backported to 2.6 > since it isn't Unicode specific but casts to the current str type. > > > > Cast environ keys and values to current python str type in os.py > Added test for environ string types to test_os.py > Fixed test_update2, bug 1110478 test, that was being skipped. > > Test test_tmpfile in test_os.py fails. Haven't looked into it yet. > > > Index: Lib/os.py > =================================================================== > --- Lib/os.py (revision 55924) > +++ Lib/os.py (working copy) > @@ -505,7 +505,8 @@ > def copy(self): > return dict(self) > > - > + # Make sure all environment keys and values are correct str type. > + environ = dict([(str(k), str(v)) for k, v in environ.items()]) > environ = _Environ(environ) > > def getenv(key, default=None): > Index: Lib/test/test_os.py > =================================================================== > --- Lib/test/test_os.py (revision 55924) > +++ Lib/test/test_os.py (working copy) > @@ -266,12 +266,25 @@ > os.environ.clear() > os.environ.update(self.__save) > > +class EnvironTests2(unittest.TestCase): > + """Test os.environ for specific problems.""" > + def setUp(self): > + self.__save = dict(os.environ) > + def tearDown(self): > + os.environ.clear() > + os.environ.update(self.__save) > # Bug 1110478 > def test_update2(self): > if os.path.exists("/bin/sh"): > os.environ.update(HELLO="World") > value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip() > self.assertEquals(value, "World") > + # Verify environ keys and values from the OS are of the > + # correct str type. > + def test_keyvalue_types(self): > + for key, val in os.environ.items(): > + self.assertEquals(type(key), str) > + self.assertEquals(type(val), str) > > class WalkTests(unittest.TestCase): > """Tests for os.walk().""" > @@ -466,6 +479,7 @@ > TemporaryFileTests, > StatAttributeTests, > EnvironTests, > + EnvironTests2, > WalkTests, > MakedirTests, > DevNullTests, > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Thu Jun 14 00:18:25 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 14 Jun 2007 00:18:25 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: <46706D31.6030001@v.loewis.de> > Thanks for clearing that up. It sounds like we really use code units, > not code points (except when building with the 4-byte Unicode option, > when they are equivalent). Is there anywhere were we use code points, > apart from the UTF-8 codecs, which encode properly matched surrogate > pairs as a single code point? The literal syntax also supports it: \U00010000 is supported even in a narrow build, and gets transparently encoded to the corresponding two code units; likewise for repr(). There is an SF patch to make unicodedata.lookup suport them also. > Is it correct to say that a surrogate in UCS-16 is two code units > representing a single code point? That's my understanding, yes. > Apart from the surrogates, are there code points that aren't > characters? Are there characters that don't have a representation as a > single code point? (I know some characters have multiple > representations, some of which use multiple code points.) [assuming you mean "code unit" again] Not in the Unicode type, no. In the byte string type, this happens all the time with multi-byte encodings. [assuming you really mean "code point" in the first question] There are numerous unassigned code points in Unicode, i.e. they don't represent a character *yet*. There are also several code points that are "noncharacters", in particular U+FFFE and U+FFFF. These are permanently reserved and should never be interpreted as abstract characters (rule C5). FFFE is reserved because it is the byte-toggled BOM; I believe FFFF is reserved so that APIs can use -1 as an error value. (FWIW, U+FFFD *is* assigned and means "REPLACEMENT CHARACTER", ?). As for "combining characters": I think the Unicode terminology really is that they are separate characters. They get combined into a single grapheme, and different character sequences might be considered as equivalent under canonical forms - but the decomposed ? (o + combining diaeresis) actually is understood as a two-character (i.e. two-codepoint) sequence. Whether that matches the intuitive definition of "character", I don't know - and I'm sure somebody will correct me if I presented it incorrectly. Regards, Martin From jimjjewett at gmail.com Thu Jun 14 00:23:24 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 13 Jun 2007 18:23:24 -0400 Subject: [Python-3000] String comparison In-Reply-To: References: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: On 6/13/07, Guido van Rossum wrote: > On 6/13/07, "Martin v. L?wis" wrote: > > A code point is something that has a 1:1 relationship with a logical > > character (in particular, a Unicode character). and > > A code unit is the atomic base in some encoding. It is a single byte > > in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit > > quantity in UTF-32). ... > Is it correct to say that a surrogate in UCS-16 is two code units > representing a single code point? Basically, assuming you meant both halves of the surrogate pair put together. "A" surrogate often refers to only one of them. > Apart from the surrogates, are there code points that aren't > characters? Yes. The BOM mark, for one. Plenty of other code points are reserved for private use, or not yet assigned, or never will be assigned. There are also some that are explicitly not characters. (U+FD00..U+FDEF), and some that might be debatable (unprinted control characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER) > Are there characters that don't have a representation as a > single code point? (I know some characters have multiple > representations, some of which use multiple code points.) There are plenty of (mostly archaic?) characters which don't (yet?) have an assigned unicode code point. There are also plenty of things that a native speaker may view as a single character, but which unicode treats as (at most) a Named Sequence. -jJ From rrr at ronadam.com Thu Jun 14 01:49:26 2007 From: rrr at ronadam.com (Ron Adam) Date: Wed, 13 Jun 2007 18:49:26 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4667CCB2.6040405@ronadam.com> <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> Message-ID: <46708286.6090201@ronadam.com> Guido van Rossum wrote: > I couldn't get this exact patch to apply, but I implemented something > equivalent in the py3kstruni branch. See revisions 55964 and 55965. > Thanks for the suggestion! This is actually closer to how I started to do it, but I wasn't sure if it would catch everything. Looking at it again, it looks good with the exception of riscos. (The added test should catch that if it's a problem so it can be fixed later.) The reason I made a new test case for added tests is that the existing test case based on mapping_tests.BasicTestMappingProtocol doesn't run the added test methods. So I put those under a new test case based on unittest.TestCase. I can't re-verify this currently because the latest merge broke something in my build process. I'm getting a "lost stderr" message. I've seen it before so it's probably something on my end. I think the last time this happened to me I was able to clear it up by deleting the branch and re-updating it. Another suggestion is to make a change in stringobject.c to represent 8 bits strings as "str8('somes_tring')" or just s"some_string" so they can more easily be found from unicode strings. Particularly in the tests. This will force a few more tests to fail, but they are things that need to be fixed. Only about 3 or 4 additional modules fail when I tried it. I was getting failed expect/got test cases that looked exactly the same. But after changing the str8 representation those became obvious st8 vs unicode comparisons. Using the shorter 's"string"' form will cause places, where eval or exec are using str8, to cause syntax errors. Which may also be helpful. BTW, I will make a new remove_raw_escapes patch so it applies cleanly. I'm trying to track down why my patched version of test_tokenize.py passes sometimes but not at others. (I think it's either a tempfile or string io issue, or both.) This was what initiated the above suggestion. ;-) Cheers, Ron > --Guido > > On 6/12/07, Ron Adam wrote: >> Guido van Rossum wrote: >> > On 6/7/07, "Martin v. L?wis" wrote: >> >> >> The os.environ.get() method probably should return a unicode >> >> string. (?) >> >> > >> >> > Indeed -- care to contribute a patch? >> >> >> >> Ideally, such a patch would make use of the Win32 Unicode API for >> >> environment variables on Windows. People had already been complaining >> >> that they can't have "funny characters" in the value of an environment >> >> variable, even though the UI allows them to set the variable just >> fine. >> > >> > Yeah, but the Windows build of py3k is currently badly broken (e.g. >> > the _fileio.c extension probably doesn't work at all) -- and I don't >> > have access to a Windows box to work on it. I'm afraid 3.0a1 will be >> > released without Windows support. Of course I'm counting on others to >> > fix that before 3.0 final is released. >> > >> > I don't mind for now that the posix.environ variable contains 8-bit >> > strings -- people shouldn't be importing that anyway. >> >> >> Here's a diff of the patch. It looks like this may be backported to 2.6 >> since it isn't Unicode specific but casts to the current str type. >> >> >> >> Cast environ keys and values to current python str type in os.py >> Added test for environ string types to test_os.py >> Fixed test_update2, bug 1110478 test, that was being skipped. >> >> Test test_tmpfile in test_os.py fails. Haven't looked into it yet. >> >> >> Index: Lib/os.py >> =================================================================== >> --- Lib/os.py (revision 55924) >> +++ Lib/os.py (working copy) >> @@ -505,7 +505,8 @@ >> def copy(self): >> return dict(self) >> >> - >> + # Make sure all environment keys and values are correct str type. >> + environ = dict([(str(k), str(v)) for k, v in environ.items()]) >> environ = _Environ(environ) >> >> def getenv(key, default=None): >> Index: Lib/test/test_os.py >> =================================================================== >> --- Lib/test/test_os.py (revision 55924) >> +++ Lib/test/test_os.py (working copy) >> @@ -266,12 +266,25 @@ >> os.environ.clear() >> os.environ.update(self.__save) >> >> +class EnvironTests2(unittest.TestCase): >> + """Test os.environ for specific problems.""" >> + def setUp(self): >> + self.__save = dict(os.environ) >> + def tearDown(self): >> + os.environ.clear() >> + os.environ.update(self.__save) >> # Bug 1110478 >> def test_update2(self): >> if os.path.exists("/bin/sh"): >> os.environ.update(HELLO="World") >> value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip() >> self.assertEquals(value, "World") >> + # Verify environ keys and values from the OS are of the >> + # correct str type. >> + def test_keyvalue_types(self): >> + for key, val in os.environ.items(): >> + self.assertEquals(type(key), str) >> + self.assertEquals(type(val), str) >> >> class WalkTests(unittest.TestCase): >> """Tests for os.walk().""" >> @@ -466,6 +479,7 @@ >> TemporaryFileTests, >> StatAttributeTests, >> EnvironTests, >> + EnvironTests2, >> WalkTests, >> MakedirTests, >> DevNullTests, >> >> >> > > From guido at python.org Thu Jun 14 01:56:39 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Jun 2007 16:56:39 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46708286.6090201@ronadam.com> References: <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> Message-ID: On 6/13/07, Ron Adam wrote: > > > Guido van Rossum wrote: > > I couldn't get this exact patch to apply, but I implemented something > > equivalent in the py3kstruni branch. See revisions 55964 and 55965. > > Thanks for the suggestion! > > This is actually closer to how I started to do it, but I wasn't sure if it > would catch everything. Looking at it again, it looks good with the > exception of riscos. (The added test should catch that if it's a problem > so it can be fixed later.) If riscos is even still supported. ;-( > The reason I made a new test case for added tests is that the existing test > case based on mapping_tests.BasicTestMappingProtocol doesn't run the added > test methods. So I put those under a new test case based on unittest.TestCase. I don't understand this. The test_keyvalue_types() test *does* run, regardless if whether I use regrtest.py test_os or test_os.py. > I can't re-verify this currently because the latest merge broke something > in my build process. I'm getting a "lost stderr" message. I've seen it > before so it's probably something on my end. I think the last time this > happened to me I was able to clear it up by deleting the branch and > re-updating it. Your best bet is to remove all .pyc files under Lib: rm `find Lib -name \*.pyc` (make clean also works) > Another suggestion is to make a change in stringobject.c to represent 8 > bits strings as "str8('somes_tring')" or just s"some_string" so they can > more easily be found from unicode strings. Particularly in the tests. > This will force a few more tests to fail, but they are things that need to > be fixed. Only about 3 or 4 additional modules fail when I tried it. I've considered this, but then we should also support that notation on input. I've also thought of using different string quote conventions, e.g. "..." to mean Unicode and '...' to mean 8-bit. > I was getting failed expect/got test cases that looked exactly the same. > But after changing the str8 representation those became obvious st8 vs > unicode comparisons. Right. > Using the shorter 's"string"' form will cause places, where eval or exec > are using str8, to cause syntax errors. Which may also be helpful. Why would this help? > BTW, I will make a new remove_raw_escapes patch so it applies cleanly. > I'm trying to track down why my patched version of test_tokenize.py passes > sometimes but not at others. (I think it's either a tempfile or string io > issue, or both.) This was what initiated the above suggestion. Please send it as a proper attachment; somehow gmail doesn't make it easy to extract patches pasted directly into the text (nor "inline" attachments). > ;-) > > > Cheers, > Ron > > > > --Guido > > > > On 6/12/07, Ron Adam wrote: > >> Guido van Rossum wrote: > >> > On 6/7/07, "Martin v. L?wis" wrote: > >> >> >> The os.environ.get() method probably should return a unicode > >> >> string. (?) > >> >> > > >> >> > Indeed -- care to contribute a patch? > >> >> > >> >> Ideally, such a patch would make use of the Win32 Unicode API for > >> >> environment variables on Windows. People had already been complaining > >> >> that they can't have "funny characters" in the value of an environment > >> >> variable, even though the UI allows them to set the variable just > >> fine. > >> > > >> > Yeah, but the Windows build of py3k is currently badly broken (e.g. > >> > the _fileio.c extension probably doesn't work at all) -- and I don't > >> > have access to a Windows box to work on it. I'm afraid 3.0a1 will be > >> > released without Windows support. Of course I'm counting on others to > >> > fix that before 3.0 final is released. > >> > > >> > I don't mind for now that the posix.environ variable contains 8-bit > >> > strings -- people shouldn't be importing that anyway. > >> > >> > >> Here's a diff of the patch. It looks like this may be backported to 2.6 > >> since it isn't Unicode specific but casts to the current str type. > >> > >> > >> > >> Cast environ keys and values to current python str type in os.py > >> Added test for environ string types to test_os.py > >> Fixed test_update2, bug 1110478 test, that was being skipped. > >> > >> Test test_tmpfile in test_os.py fails. Haven't looked into it yet. > >> > >> > >> Index: Lib/os.py > >> =================================================================== > >> --- Lib/os.py (revision 55924) > >> +++ Lib/os.py (working copy) > >> @@ -505,7 +505,8 @@ > >> def copy(self): > >> return dict(self) > >> > >> - > >> + # Make sure all environment keys and values are correct str type. > >> + environ = dict([(str(k), str(v)) for k, v in environ.items()]) > >> environ = _Environ(environ) > >> > >> def getenv(key, default=None): > >> Index: Lib/test/test_os.py > >> =================================================================== > >> --- Lib/test/test_os.py (revision 55924) > >> +++ Lib/test/test_os.py (working copy) > >> @@ -266,12 +266,25 @@ > >> os.environ.clear() > >> os.environ.update(self.__save) > >> > >> +class EnvironTests2(unittest.TestCase): > >> + """Test os.environ for specific problems.""" > >> + def setUp(self): > >> + self.__save = dict(os.environ) > >> + def tearDown(self): > >> + os.environ.clear() > >> + os.environ.update(self.__save) > >> # Bug 1110478 > >> def test_update2(self): > >> if os.path.exists("/bin/sh"): > >> os.environ.update(HELLO="World") > >> value = os.popen("/bin/sh -c 'echo $HELLO'").read().strip() > >> self.assertEquals(value, "World") > >> + # Verify environ keys and values from the OS are of the > >> + # correct str type. > >> + def test_keyvalue_types(self): > >> + for key, val in os.environ.items(): > >> + self.assertEquals(type(key), str) > >> + self.assertEquals(type(val), str) > >> > >> class WalkTests(unittest.TestCase): > >> """Tests for os.walk().""" > >> @@ -466,6 +479,7 @@ > >> TemporaryFileTests, > >> StatAttributeTests, > >> EnvironTests, > >> + EnvironTests2, > >> WalkTests, > >> MakedirTests, > >> DevNullTests, > >> > >> > >> > > > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rasky at develer.com Tue Jun 12 18:40:01 2007 From: rasky at develer.com (Giovanni Bajo) Date: Tue, 12 Jun 2007 18:40:01 +0200 Subject: [Python-3000] Pre-PEP on fast imports In-Reply-To: <20070612162845.5F00A3A407F@sparrow.telecommunity.com> References: <466DD0D8.7040407@develer.com> <20070611231640.8A55E3A407F@sparrow.telecommunity.com> <20070612162845.5F00A3A407F@sparrow.telecommunity.com> Message-ID: <466ECC61.7070505@develer.com> On 6/12/2007 6:30 PM, Phillip J. Eby wrote: >> import imp, os, sys >> from pkgutil import ImpImporter >> >> suffixes = set(ext for ext,mode,typ in imp.get_suffixes()) >> >> class CachedImporter(ImpImporter): >> def __init__(self, path): >> if not os.path.isdir(path): >> raise ImportError("Not an existing directory") >> super(CachedImporter, self).__init__(path) >> self.refresh() >> >> def refresh(self): >> self.cache = set() >> for fname in os.listdir(path): >> base, ext = os.path.splitext(fname) >> if ext in suffixes and '.' not in base: >> self.cache.add(base) >> >> def find_module(self, fullname, path=None): >> if fullname.split(".")[-1] not in self.cache: >> return None # no need to check further >> return super(CachedImporter, self).find_module(fullname, >> path) >> >> sys.path_hooks.append(CachedImporter) > > After a bit of reflection, it seems the refresh() method needs to be a > bit different: > > def refresh(self): > cache = set() > for fname in os.listdir(self.path): > base, ext = os.path.splitext(fname) > if not ext or (ext in suffixes and '.' not in base): > cache.add(base) > self.cache = cache > > This version fixes two problems: first, a race condition could occur if > you called refresh() while an import was taking place in another > thread. This version fixes that by only updating self.cache after the > new cache is completely built. > > Second, the old version didn't handle packages at all. This version > handles them by treating extension-less filenames as possible package > directories. I originally thought this should check for a subdirectory > and __init__, but this could get very expensive if a sys.path directory > has a lot of subdirectories (whether or not they're packages). Having > false positives in the cache (i.e. names that can't actually be > imported) could slow things down a bit, but *only* if those names match > something you're trying to import. Thus, it seems like a reasonable > trade-off versus needing to scan every subdirectory at startup or even > to check whether all those names *are* subdirectories. There is another couple of things I'll fix as soon as I try it. First is that I'd call refresh() lazily on the first find_module because I don't want to listdir() directories on sys.path that will never be accessed. The idea of using sys.path_hooks is very clever (I hadn't thought of it... because I didn't know of path_hooks in the first place! It appears to be undocumented and sparsely indexed by google as well), and it will probably help me a lot in my task of fixing this problem in the 2.x serie. -- Giovanni Bajo From rrr at ronadam.com Thu Jun 14 04:13:44 2007 From: rrr at ronadam.com (Ron Adam) Date: Wed, 13 Jun 2007 21:13:44 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <46687316.8090109@v.loewis.de> <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> Message-ID: <4670A458.7050206@ronadam.com> Guido van Rossum wrote: > On 6/13/07, Ron Adam wrote: >> >> >> Guido van Rossum wrote: >> > I couldn't get this exact patch to apply, but I implemented something >> > equivalent in the py3kstruni branch. See revisions 55964 and 55965. >> > Thanks for the suggestion! >> >> This is actually closer to how I started to do it, but I wasn't sure >> if it >> would catch everything. Looking at it again, it looks good with the >> exception of riscos. (The added test should catch that if it's a problem >> so it can be fixed later.) > > If riscos is even still supported. ;-( I have no idea. Looking at the overall structure of os.py makes me think the platform specific code could be abstracted out a bit further. Possibly have one public "platform" module (or package) that is an alias or built from private _platform package files. So instead of having "import mac" or "from mac import ..." in if-else structures, just do "from platform import ...". That moves all the platform testing to either the build process or as part of site.py so it can set 'platform' to the correct platform module or package. After that everything else is platform independent (or mostly). >> The reason I made a new test case for added tests is that the existing >> test >> case based on mapping_tests.BasicTestMappingProtocol doesn't run the >> added >> test methods. So I put those under a new test case based on >> unittest.TestCase. > > I don't understand this. The test_keyvalue_types() test *does* run, > regardless if whether I use regrtest.py test_os or test_os.py. Just tested it again and you are right. I did test it earlier and it did not run those tests when I wrote the test exactly as you did. (So if it was broke, it got fixed someplace else.) >> I can't re-verify this currently because the latest merge broke something >> in my build process. I'm getting a "lost stderr" message. I've seen it >> before so it's probably something on my end. I think the last time this >> happened to me I was able to clear it up by deleting the branch and >> re-updating it. > > Your best bet is to remove all .pyc files under Lib: rm `find Lib -name > \*.pyc` > (make clean also works) You fixed this when you added the missing abc.py file. >> Another suggestion is to make a change in stringobject.c to represent 8 >> bits strings as "str8('somes_tring')" or just s"some_string" so they can >> more easily be found from unicode strings. Particularly in the tests. >> This will force a few more tests to fail, but they are things that >> need to >> be fixed. Only about 3 or 4 additional modules fail when I tried it. > > I've considered this, but then we should also support that notation on > input. I've also thought of using different string quote conventions, > e.g. "..." to mean Unicode and '...' to mean 8-bit. Are str8 types going to be part of the final distribution? I thought the goal was to eventually remove all of those where ever possible. I think "" vs '' is too subtle. >> I was getting failed expect/got test cases that looked exactly the same. >> But after changing the str8 representation those became obvious st8 vs >> unicode comparisons. > > Right. > >> Using the shorter 's"string"' form will cause places, where eval or exec >> are using str8, to cause syntax errors. Which may also be helpful. > > Why would this help? This would be only a temporary debugging aid to be removed later. Often eval and exec get their inputs from temporary files or other file like sources. So this moves the point of failure a bit closer to the problem in these cases. I don't think there should be any places where a str8 string is created by a python program will be used this way, those will be unicode strings. Think of it as just another test, but it's more general in scope than a highly specific unit test with usually very controlled inputs. And it's purpose is to help expose some harder to find problems, not the easy to fix ones. >> BTW, I will make a new remove_raw_escapes patch so it applies cleanly. >> I'm trying to track down why my patched version of test_tokenize.py >> passes >> sometimes but not at others. (I think it's either a tempfile or >> string io >> issue, or both.) This was what initiated the above suggestion. > > Please send it as a proper attachment; somehow gmail doesn't make it > easy to extract patches pasted directly into the text (nor "inline" > attachments). Ok, will do. I'll update the patch on the patch tracker since it's already started as well. Cheers, Ron From guido at python.org Thu Jun 14 04:25:15 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 13 Jun 2007 19:25:15 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4670A458.7050206@ronadam.com> References: <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> Message-ID: On 6/13/07, Ron Adam wrote: > Looking at the overall structure of os.py makes me think the platform > specific code could be abstracted out a bit further. Possibly have one > public "platform" module (or package) that is an alias or built from > private _platform package files. > > So instead of having "import mac" or "from mac import ..." in if-else > structures, just do "from platform import ...". That moves all the > platform testing to either the build process or as part of site.py so it > can set 'platform' to the correct platform module or package. After that > everything else is platform independent (or mostly). Yeah, but I'm not going to rewrite the standard library -- I'm only going to keep the current architecture working. Others will have to help with improving the architecture. You have the right idea -- can you make it work as a patch? > You fixed this when you added the missing abc.py file. Sorry about that. I think it was a svnmerge glitch; I didn't notice it until long after the merge. > Are str8 types going to be part of the final distribution? I thought the > goal was to eventually remove all of those where ever possible. I don't know yet. There's been a cry for an "immutable bytes" type -- it could be str8 (perhaps renamed). Also, much C code doesn't deal with Unicode strings yet and expects char* strings whose lifetime is the same as the Unicode string. Having a str8 permanently attached to the Unicode string is a convenient solution -- especially since it's already implemented. :-) > I think "" vs '' is too subtle. Fair enough. > >> I was getting failed expect/got test cases that looked exactly the same. > >> But after changing the str8 representation those became obvious st8 vs > >> unicode comparisons. > > > > Right. > > > >> Using the shorter 's"string"' form will cause places, where eval or exec > >> are using str8, to cause syntax errors. Which may also be helpful. > > > > Why would this help? > > This would be only a temporary debugging aid to be removed later. Often > eval and exec get their inputs from temporary files or other file like > sources. So this moves the point of failure a bit closer to the problem in > these cases. I don't think there should be any places where a str8 string > is created by a python program will be used this way, those will be > unicode strings. > > Think of it as just another test, but it's more general in scope than a > highly specific unit test with usually very controlled inputs. And it's > purpose is to help expose some harder to find problems, not the easy to fix > ones. Makes some sense. Could you come up with a patch? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From brett at python.org Thu Jun 14 06:01:43 2007 From: brett at python.org (Brett Cannon) Date: Wed, 13 Jun 2007 21:01:43 -0700 Subject: [Python-3000] Pre-PEP on fast imports In-Reply-To: <466ECC61.7070505@develer.com> References: <466DD0D8.7040407@develer.com> <20070611231640.8A55E3A407F@sparrow.telecommunity.com> <20070612162845.5F00A3A407F@sparrow.telecommunity.com> <466ECC61.7070505@develer.com> Message-ID: On 6/12/07, Giovanni Bajo wrote: > > On 6/12/2007 6:30 PM, Phillip J. Eby wrote: > > >> import imp, os, sys > >> from pkgutil import ImpImporter > >> > >> suffixes = set(ext for ext,mode,typ in imp.get_suffixes()) > >> > >> class CachedImporter(ImpImporter): > >> def __init__(self, path): > >> if not os.path.isdir(path): > >> raise ImportError("Not an existing directory") > >> super(CachedImporter, self).__init__(path) > >> self.refresh() > >> > >> def refresh(self): > >> self.cache = set() > >> for fname in os.listdir(path): > >> base, ext = os.path.splitext(fname) > >> if ext in suffixes and '.' not in base: > >> self.cache.add(base) > >> > >> def find_module(self, fullname, path=None): > >> if fullname.split(".")[-1] not in self.cache: > >> return None # no need to check further > >> return super(CachedImporter, self).find_module(fullname, > >> path) > >> > >> sys.path_hooks.append(CachedImporter) > > > > After a bit of reflection, it seems the refresh() method needs to be a > > bit different: > > > > def refresh(self): > > cache = set() > > for fname in os.listdir(self.path): > > base, ext = os.path.splitext(fname) > > if not ext or (ext in suffixes and '.' not in base): > > cache.add(base) > > self.cache = cache > > > > This version fixes two problems: first, a race condition could occur if > > you called refresh() while an import was taking place in another > > thread. This version fixes that by only updating self.cache after the > > new cache is completely built. > > > > Second, the old version didn't handle packages at all. This version > > handles them by treating extension-less filenames as possible package > > directories. I originally thought this should check for a subdirectory > > and __init__, but this could get very expensive if a sys.path directory > > has a lot of subdirectories (whether or not they're packages). Having > > false positives in the cache (i.e. names that can't actually be > > imported) could slow things down a bit, but *only* if those names match > > something you're trying to import. Thus, it seems like a reasonable > > trade-off versus needing to scan every subdirectory at startup or even > > to check whether all those names *are* subdirectories. > > There is another couple of things I'll fix as soon as I try it. First is > that I'd call refresh() lazily on the first find_module because I don't > want to listdir() directories on sys.path that will never be accessed. > > The idea of using sys.path_hooks is very clever (I hadn't thought of > it... because I didn't know of path_hooks in the first place! It appears > to be undocumented and sparsely indexed by google as well), and it will > probably help me a lot in my task of fixing this problem in the 2.x serie. PEP 302 documents all of this, but unfortunately was never documented in the official docs. I also have some pseudocode of how import (roughly) works at sandbox/trunk/import_in_py/pseudocode.py . -Brett -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070613/8998fc87/attachment.htm From rrr at ronadam.com Thu Jun 14 06:27:49 2007 From: rrr at ronadam.com (Ron Adam) Date: Wed, 13 Jun 2007 23:27:49 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <466892B7.4050108@ronadam.com> <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> Message-ID: <4670C3C5.4070907@ronadam.com> Guido van Rossum wrote: > On 6/13/07, Ron Adam wrote: >> Looking at the overall structure of os.py makes me think the platform >> specific code could be abstracted out a bit further. Possibly have one >> public "platform" module (or package) that is an alias or built from >> private _platform package files. >> >> So instead of having "import mac" or "from mac import ..." in if-else >> structures, just do "from platform import ...". That moves all the >> platform testing to either the build process or as part of site.py so it >> can set 'platform' to the correct platform module or package. After that >> everything else is platform independent (or mostly). > > Yeah, but I'm not going to rewrite the standard library -- I'm only > going to keep the current architecture working. Others will have to > help with improving the architecture. You have the right idea -- can > you make it work as a patch? Yes, I expect it would be part of the library reorganization which is still down the road a bit. I'll try look into a bit more sometime between now and then. Maybe I can get enough of it started and get others motivated to contribute to it. >> You fixed this when you added the missing abc.py file. > > Sorry about that. I think it was a svnmerge glitch; I didn't notice it > until long after the merge. > >> Are str8 types going to be part of the final distribution? I thought the >> goal was to eventually remove all of those where ever possible. > > I don't know yet. There's been a cry for an "immutable bytes" type -- > it could be str8 (perhaps renamed). Also, much C code doesn't deal > with Unicode strings yet and expects char* strings whose lifetime is > the same as the Unicode string. Having a str8 permanently attached to > the Unicode string is a convenient solution -- especially since it's > already implemented. :-) Well I can see where a str8() type with an __incoded_with__ attribute could be useful. It would use a bit more memory, but it won't be the default/primary string type anymore so maybe it's ok. Then bytes can be bytes, and unicode can be unicode, and str8 can be encoded strings for interfacing with the outside non-unicode world. Or something like that. >> I think "" vs '' is too subtle. > > Fair enough. > >> >> I was getting failed expect/got test cases that looked exactly the >> same. >> >> But after changing the str8 representation those became obvious st8 vs >> >> unicode comparisons. >> > >> > Right. >> > >> >> Using the shorter 's"string"' form will cause places, where eval or >> exec >> >> are using str8, to cause syntax errors. Which may also be helpful. >> > >> > Why would this help? >> >> This would be only a temporary debugging aid to be removed later. Often >> eval and exec get their inputs from temporary files or other file like >> sources. So this moves the point of failure a bit closer to the >> problem in >> these cases. I don't think there should be any places where a str8 >> string >> is created by a python program will be used this way, those will be >> unicode strings. >> >> Think of it as just another test, but it's more general in scope than a >> highly specific unit test with usually very controlled inputs. And it's >> purpose is to help expose some harder to find problems, not the easy >> to fix >> ones. > > Makes some sense. Could you come up with a patch? Done :-) Attached both the str8 repr as s"..." and s'...', and the latest no_raw_escape patch which I think is complete now and should apply with no problems. I tracked the random fails I am having in test_tokenize.py down to it doing a round trip on random test_*.py files. If one of those files has a problem it causes test_tokanize.py to fail also. So I added a line to the test to output the file name it does the round trip on so those can be fixed as they are found. Let me know it needs to be adjusted or something doesn't look right. Cheers, Ron -------------- next part -------------- A non-text attachment was scrubbed... Name: norawescape3.diff Type: text/x-patch Size: 18923 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070613/d4f1846d/attachment-0002.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: stingobject_str8repr.diff Type: text/x-patch Size: 758 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070613/d4f1846d/attachment-0003.bin From martin at v.loewis.de Thu Jun 14 08:28:34 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 14 Jun 2007 08:28:34 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: <4670E012.2090803@v.loewis.de> > Yes. The BOM mark, for one. Actually, the BOM *is* a character: ZERO WIDTH NO-BREAK SPACE, character class Cf. This function of the code point (as a character) is deprecated, though. > There are also some that are explicitly not characters. > (U+FD00..U+FDEF) ??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM, U+FDEF is unassigned. Regards, Martin From stephen at xemacs.org Thu Jun 14 09:43:55 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 14 Jun 2007 16:43:55 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: <877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > > Apart from the surrogates, are there code points that aren't > > characters? > Yes. The BOM mark, for one. Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK SPACE). Its byte-swapped counterpart FFFE is guaranteed *not* to be a character. (Martin wrote that correctly.) FFFF is guaranteed *not* to be a character; in fact all code points U that are equal to FFFE or FFFF modulo 0x10000 are guaranteed not to be characters (ie, the last two characters in each plane). > Plenty of other code points are reserved > for private use, or not yet assigned, Or reserved for use as surrogates, and therefore should never appear in UTF-8 or UTF-32 streams -- but if they do, AIUI they must be passed on uninterpreted unless the API explicitly says what it does with them. > or never will be assigned. There are also some that are explicitly > not characters. (U+FD00..U+FDEF), Guaranteed not to be assigned == not a character. The special range of non-characters is quite a bit smaller, FDD0..FDEF. > and some that might be debatable (unprinted control > characters, or U+FFFC: OBJECT REPLACEMENT CHARACTER) Not a good idea to classify this way. Those *are* characters, and a process may interpret them or not. Python (the language and the stdlib, except where it explicitly says otherwise) definitely should *not* worry about these things. They're characters, that's the most Python needs to know. > > Are there characters that don't have a representation as a > > single code point? (I know some characters have multiple > > representations, some of which use multiple code points.) Not a question that can be answered without reference to a specific application. An application may treat each code point as a character, or it may choose to compose code points (eg, into private space). The most Python might want to do is deal with canonical equivalence, but even then there are issues, such as the ? in the English word co?rdinate. I would consider the diaeresis as a separate diacritic (meaning "don't pronounce as 'oo', pronounce as 'oh-oh'), not a component of a single character. There may be even clearer examples. > There are also plenty of things that a native speaker may view as a > single character, but which unicode treats as (at most) a Named > Sequence. Eg, the New Line Function (Unicode's name for "universal newline"), which can be any of the usual suspects (CR, LF, CRLF) depending on context. From rauli.ruohonen at gmail.com Thu Jun 14 14:34:06 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Thu, 14 Jun 2007 15:34:06 +0300 Subject: [Python-3000] String comparison In-Reply-To: References: <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> Message-ID: On 6/14/07, Guido van Rossum wrote: > On 6/13/07, "Martin v. L?wis" wrote: > > A code point is something that has a 1:1 relationship with a logical > > character (in particular, a Unicode character). As the word "character" is ambiguous, I'd put it this way: - code point: the smallest unit Unicode deals with that's independent of encoding. Takes values in range(0, 0x110000) - grapheme (or "grapheme cluster"): what users think of as characters. May consist of multiple code points, e.g. "?" can be represented with one or two code points. Depends on the language the user speaks > It sounds like we really use code units, not code points (except when > building with the 4-byte Unicode option, when they are equivalent). Not quite equivalent in current Python. From some past discussions I thought this was by design, but now having seen this odd behavior, maybe it isn't: >>> sys.maxunicode 1114111 >>> x = u'\ud840\udc21' >>> marshal.loads(marshal.dumps(x)) == x False >>> pickle.loads(pickle.dumps(x, 2)) == x False >>> pickle.loads(pickle.dumps(x, 1)) == x False >>> pickle.loads(pickle.dumps(x)) == x True >>> Pickling should work the same way regardless of protocol, right? And probably should not modify the objects it pickles if it can help it. The reason the above happens is that binary pickles use UTF-8 to encode unicode, and this is what happens with codecs: >>> u'\ud840\udc21' == u'\U00020021' False >>> u'\ud840\udc21'.encode('utf-8').decode('utf-8') u'\U00020021' >>> u'\ud840\udc21'.encode('punycode').decode('punycode') u'\ud840\udc21' >>> u'\ud840\udc21'.encode('utf-16').decode('utf-16') u'\U00020021' >>> u'\U00020021'.encode('utf-16').decode('utf-16') u'\U00020021' >>> u'\ud840\udc21'.encode('big5hkscs').decode('big5hkscs') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'big5hkscs' codec can't encode character u'\ud840' in position 0: illegal multibyte sequence >>> u'\U00020021'.encode('big5hkscs').decode('big5hkscs') u'\U00020021' >>> Should codecs treat u'\ud840\udc21' and u'\U00020021' the same even on UCS-4 builds (like current UTF-8 and UTF-16 codecs do) or not (like current punycode and big5hkscs codecs do)? From rauli.ruohonen at gmail.com Thu Jun 14 15:51:09 2007 From: rauli.ruohonen at gmail.com (Rauli Ruohonen) Date: Thu, 14 Jun 2007 16:51:09 +0300 Subject: [Python-3000] String comparison In-Reply-To: <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/13/07, Stephen J. Turnbull wrote: > except that people will sneak in some UTF-16 behavior where it seems useful. How about sneaking these in py3k-struni: - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x110000) and ord(chr(i)) == i for all i in range(0, 0x110000) - unicodedata.name(chr(i)) returns the same result for all i on both UCS-2 and UCS-4 builds (and same for bidirectional(), category(), combining(), decimal(), decomposition(), digit(), east_asian_width(), mirrored() and numeric() in unicodedata) - return len-1 or len-2 strings on unicodedata.lookup(), instead of always len-1 strings (e.g. unicodedata.lookup('AEGEAN WORD SEPARATOR LINE') returns '\u0100' on UCS-2 builds, but '\U00010100' on UCS-4 builds) - unicodedata.normalize(s) interprets its input as UTF-16 on UCS-2 builds - use ValueError instead of TypeError in the above when passed an inappropriate string, e.g. ord('aa') Any chances? From jimjjewett at gmail.com Thu Jun 14 18:54:15 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 14 Jun 2007 12:54:15 -0400 Subject: [Python-3000] String comparison In-Reply-To: <4670E012.2090803@v.loewis.de> References: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> <4670E012.2090803@v.loewis.de> Message-ID: On 6/14/07, "Martin v. L?wis" wrote: > > There are also some that are explicitly not characters. > > (U+FD00..U+FDEF) > ??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM, > U+FDEF is unassigned. Sorry; typo on my part. The start of the range is u+fdD0, not 00. I suspect there may be others that are guaranteed never to get an assignment, because of their location. (Example: The character would have to have certain properties or be part of a specific script, but adding more such characters would violate some other stability rule.) -jJ From jimjjewett at gmail.com Thu Jun 14 19:56:20 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 14 Jun 2007 13:56:20 -0400 Subject: [Python-3000] String comparison In-Reply-To: <877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> <877iq7dlqc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 6/14/07, Stephen J. Turnbull wrote: > > There are also plenty of things that a native speaker may view as a > > single character, but which unicode treats as (at most) a Named > > Sequence. > Eg, the New Line Function (Unicode's name for "universal newline"), > which can be any of the usual suspects (CR, LF, CRLF) depending on > context. I hadn't even thought of such abstract chracters; I was thinking of (Normative Appendix) UAX 34: Unicode Named Character Sequences at http://unicode.org/reports/tr34/ These are more like ?, or the NJ digraph, except that a single-character equivalent has not been coded (and probably never will be coded -- see http://www.unicode.org/faq/ligature_digraph.html#3). The current list of named sequences is available at http://www.unicode.org/Public/UNIDATA/NamedSequences.txt -jJ From stephen at xemacs.org Thu Jun 14 20:53:52 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 15 Jun 2007 03:53:52 +0900 Subject: [Python-3000] String comparison In-Reply-To: References: <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> <46705599.8090301@v.loewis.de> <4670E012.2090803@v.loewis.de> Message-ID: <87tztacqpr.fsf@uwakimon.sk.tsukuba.ac.jp> Jim Jewett writes: > I suspect there may be others that are guaranteed never to get an > assignment, because of their location. (Example: The character would > have to have certain properties or be part of a specific script, but > adding more such characters would violate some other stability rule.) In Unicode 4.1, there are precisely 66. 34 in the highest 2 positions in each page (nnFFFE and nnFFFF), and the 32 point gap from FDD0 to FDEF. The text doesn't explain that latter gap, but does say it's a historical anomoly. I doubt there will ever be any more. From guido at python.org Fri Jun 15 01:57:28 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 14 Jun 2007 16:57:28 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4670C3C5.4070907@ronadam.com> References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> Message-ID: On 6/13/07, Ron Adam wrote: > Well I can see where a str8() type with an __incoded_with__ attribute could > be useful. It would use a bit more memory, but it won't be the > default/primary string type anymore so maybe it's ok. > > Then bytes can be bytes, and unicode can be unicode, and str8 can be > encoded strings for interfacing with the outside non-unicode world. Or > something like that. Hm... Requiring each str8 instance to have an encoding might be a problem -- it means you can't just create one from a bytes object. What would be the use of this information? What would happen on concatenation? On slicing? (Slicing can break the encoding!) > Attached both the str8 repr as s"..." and s'...', and the latest > no_raw_escape patch which I think is complete now and should apply with no > problems. I like the str8 repr patch enough to check it in. > I tracked the random fails I am having in test_tokenize.py down to it doing > a round trip on random test_*.py files. If one of those files has a > problem it causes test_tokanize.py to fail also. So I added a line to the > test to output the file name it does the round trip on so those can be > fixed as they are found. > > Let me know it needs to be adjusted or something doesn't look right. Well, I'm still philosophically uneasy with r'\' being a valid string literal, for various reasons (one being that writing a string parser becomes harder and harder). I definitely want r'\u1234' to be a 6-character string, however. Do you have a patch that does just that? (We can argue over the rest later in a larger forum.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rrr at ronadam.com Fri Jun 15 06:51:10 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 14 Jun 2007 23:51:10 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> Message-ID: <46721ABE.9090203@ronadam.com> Guido van Rossum wrote: > On 6/13/07, Ron Adam wrote: >> Well I can see where a str8() type with an __incoded_with__ attribute >> could >> be useful. It would use a bit more memory, but it won't be the >> default/primary string type anymore so maybe it's ok. >> >> Then bytes can be bytes, and unicode can be unicode, and str8 can be >> encoded strings for interfacing with the outside non-unicode world. Or >> something like that. > > Hm... Requiring each str8 instance to have an encoding might be a > problem -- it means you can't just create one from a bytes object. > What would be the use of this information? What would happen on > concatenation? On slicing? (Slicing can break the encoding!) Round trips to and from bytes should work just fine. Why would that be a problem? There really is no safety in concatenation and slicing of encoded 8bit strings now. If by accident two strings of different encodings are combined, then all bets are off. And since there is no way to ask a string what it's current encoding is, it becomes an easy to make and hard to find silent error. So we have to be very careful not to mix encoded strings with different encodings. It's not too different from trying to find the current unicode and str8 issues in the py3k-struni branch. Concatenating str8 and str types is a bit safer, as long as the str8 is in in "the" default encoding, but it may still be an unintended implicit conversion. And if it's not in the default encoding, then all bets are off again. The use would be in ensuring the integrity of encoded strings. Concatenating strings with different encodings could then produce errors. Explicit casting could automatically decode and encode as needed. Which would eliminate a lot of encode/decode confusion. This morning I was thinking all of this could be done as a module that possibly uses metaclass's or mixins to create encoded string types. Then it wouldn't need an attribute on the instances. Possibly someone has already did something along that lines? But Back to the issues at hand... >> Attached both the str8 repr as s"..." and s'...', and the latest >> no_raw_escape patch which I think is complete now and should apply >> with no >> problems. > > I like the str8 repr patch enough to check it in. > >> I tracked the random fails I am having in test_tokenize.py down to it >> doing >> a round trip on random test_*.py files. If one of those files has a >> problem it causes test_tokanize.py to fail also. So I added a line to >> the >> test to output the file name it does the round trip on so those can be >> fixed as they are found. >> >> Let me know it needs to be adjusted or something doesn't look right. > > Well, I'm still philosophically uneasy with r'\' being a valid string > literal, for various reasons (one being that writing a string parser > becomes harder and harder). Hmmm.. It looks to me the thing that makes it somewhat hard is in determining weather or not its a single-quote, empty-single-quote, or triple-quote string. I made some improvements to that in tokenize.c although it may not be clear from just looking at the unified diff. After that, it was just a matter of checking a !is_raw_str flag before always blindly accepting the following character. Before that it was a matter of doing that, and checking the quote type status, as well which wasn't intuitive since the string parsing loop was entered before the beginning quote type was confirmed. I can remove the raw string flag and flag-check and leave the other changes in or revert the whole file back. Any preference? The later makes it an easy approximate three line change to add r'\' support back in. I'll have to look at tokanize.py again to see what needs to be done there. It uses regular expressions to parse the file. I definitely want r'\u1234' to be a > 6-character string, however. Do you have a patch that does just that? > (We can argue over the rest later in a larger forum.) I can split the patch into two patches. And the second allow escape at end of strings patch can be reviewed later. What about br'\'? Should that be excluded also? Ron From martin at v.loewis.de Fri Jun 15 08:03:41 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 15 Jun 2007 08:03:41 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46721ABE.9090203@ronadam.com> References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46721ABE.9090203@ronadam.com> Message-ID: <46722BBD.5060100@v.loewis.de> >>> Then bytes can be bytes, and unicode can be unicode, and str8 can be >>> encoded strings for interfacing with the outside non-unicode world. Or >>> something like that. >> >> Hm... Requiring each str8 instance to have an encoding might be a >> problem -- it means you can't just create one from a bytes object. >> What would be the use of this information? What would happen on >> concatenation? On slicing? (Slicing can break the encoding!) > > Round trips to and from bytes should work just fine. Why would that be > a problem? I'm strongly opposed to adding encoding information to str8 objects. I think they will eventually go away, anyway; adding that kind of overhead now is both a waste of developer's time and of memory resources; plus it has all the semantic issues that Guido points out. As for creating str8 objects from bytes objects: If you want the str8 object to carry an encoding, you would have to *specify* the encoding when creating the str8 object, since the bytes object does not have that information. This is *very* hard, as you may not know what the encoding is when you need to create the str8 object. > There really is no safety in concatenation and slicing of encoded 8bit > strings now. If by accident two strings of different encodings are > combined, then all bets are off. And since there is no way to ask a > string what it's current encoding is, it becomes an easy to make and > hard to find silent error. So we have to be very careful not to mix > encoded strings with different encodings. Please answer the question: what would happen on concatenation? In particular, what is the value of the encoding for the result of the concatenated string if one input is "latin-1", and the other one is "utf-8"? It's easy to tell what happens now: the bytes of those input strings are just appended; the result string does not follow a consistent character encoding anymore. This answer does not apply to your proposed modification, as it does not answer what the value of the .encoding attribute of the str8 would be after concatenation (likewise for slicing). > It's not too different from trying to find the current unicode and str8 > issues in the py3k-struni branch. This sentence I do not understand. What is not too different from trying to find issues? > Concatenating str8 and str types is a bit safer, as long as the str8 is > in in "the" default encoding, but it may still be an unintended implicit > conversion. And if it's not in the default encoding, then all bets are > off again. Sure. However, the str8 type will go away, and along with it all these issues. > The use would be in ensuring the integrity of encoded strings. > Concatenating strings with different encodings could then produce > errors. Ok. What about slicing? Regards, Martin From martin at v.loewis.de Fri Jun 15 08:13:29 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 15 Jun 2007 08:13:29 +0200 Subject: [Python-3000] String comparison In-Reply-To: References: <87ejkmhn5b.fsf@uwakimon.sk.tsukuba.ac.jp> <877iqdhhmk.fsf@uwakimon.sk.tsukuba.ac.jp> <87tztgfd88.fsf@uwakimon.sk.tsukuba.ac.jp> <87r6ogeavs.fsf@uwakimon.sk.tsukuba.ac.jp> <87abv3eku8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <46722E09.3020502@v.loewis.de> > - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x110000) and > ord(chr(i)) == i for all i in range(0, 0x110000) This would contradict an explicit decision in PEP 261. I'm don't quite remember the rationale for that, however, the PEP mentions that ord() should be symmetric with chr(). Whether it would be acceptable to allow selected length-two strings in ord, I don't know. > - unicodedata.name(chr(i)) returns the same result for all i on both UCS-2 > and UCS-4 builds (and same for bidirectional(), category(), combining(), > decimal(), decomposition(), digit(), east_asian_width(), mirrored() and > numeric() in unicodedata) There is a patch on SF requesting such a change for .lookup. I think this should be done in 2.6, not 3.0. It doesn't have the ord/unichr issue, so I think the same concerns don't apply. > - return len-1 or len-2 strings on unicodedata.lookup(), instead of always > len-1 strings (e.g. unicodedata.lookup('AEGEAN WORD SEPARATOR LINE') > returns '\u0100' on UCS-2 builds, but '\U00010100' on UCS-4 builds) See the patch on SF. > - unicodedata.normalize(s) interprets its input as UTF-16 on UCS-2 builds Definitely; somebody would have to write the code. > - use ValueError instead of TypeError in the above when passed an > inappropriate string, e.g. ord('aa') I'm not sure about this one. The TypeError is deliberate currently. Regards, Martin From rrr at ronadam.com Sat Jun 16 00:38:58 2007 From: rrr at ronadam.com (Ron Adam) Date: Fri, 15 Jun 2007 17:38:58 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46722BBD.5060100@v.loewis.de> References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de> Message-ID: <46731502.9050400@ronadam.com> Martin v. L?wis wrote: >>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be >>>> encoded strings for interfacing with the outside non-unicode world. Or >>>> something like that. >>> Hm... Requiring each str8 instance to have an encoding might be a >>> problem -- it means you can't just create one from a bytes object. >>> What would be the use of this information? What would happen on >>> concatenation? On slicing? (Slicing can break the encoding!) >> Round trips to and from bytes should work just fine. Why would that be >> a problem? > > I'm strongly opposed to adding encoding information to str8 objects. > I think they will eventually go away, anyway; adding that kind of > overhead now is both a waste of developer's time and of memory > resources; plus it has all the semantic issues that Guido points > out. This was in the context that it is decided by the community that a st8 type is needed and it does not go away. The alternative is for str8 to be replaced by byte objects which I believe was, and still is, the plan if possible. The same semantic issues will also be present in bytes objects in one form or another when handling data acquired from sources that use encoded strings. They don't go away even if str8 does go away. It sort of depends on how someone wants to handle situations where encoded strings are encountered. Do they decode them and convert everything to unicode and then convert back as needed for any output. Or can they keep the data in the encoded form for the duration? I expect different people will feel differently on this. > As for creating str8 objects from bytes objects: If you want > the str8 object to carry an encoding, you would have to *specify* > the encoding when creating the str8 object, since the bytes object > does not have that information. This is *very* hard, as you may > not know what the encoding is when you need to create the str8 > object. True, and this also applies if you want to convert an already encoded bytes object to unicode as well. >> There really is no safety in concatenation and slicing of encoded 8bit >> strings now. If by accident two strings of different encodings are >> combined, then all bets are off. And since there is no way to ask a >> string what it's current encoding is, it becomes an easy to make and >> hard to find silent error. So we have to be very careful not to mix >> encoded strings with different encodings. > > Please answer the question: what would happen on concatenation? In > particular, what is the value of the encoding for the result > of the concatenated string if one input is "latin-1", and the > other one is "utf-8"? I was trying to avoid this becoming a long thread. If these Ideas seem worth discussing, maybe we can move the reply to python ideas and we can work out the details there. But to not avoid your questions... Concatenation of unlike encoded objects should cause an error message of course. It's just not possible to do presently. I agree that putting an attribute on a str8 object instance is only a partial solution and does waste some space. (I changed my mind on this yesterday morning after thinking about it some more.) So I offered an alternative suggestion that it may be possibly to use dynamically created encoded str types, which avoids putting an attribute on every instance, and can handle the problems of slicing, concatenation, and conversion. I didn't go into the details because it was, and is, only a general suggestion or thought. One approach is to possibly use a factory function that uses metaclass's or mixins to create these based either on a str base type or a bytes object. Latin1 = get_encoded_str_type('latin-1') s1 = Latin1('Hello ') Utf8 = get_encoded_str_type('utf-8') s2 = Utf8('World') s = s1 + s2 -> Exception Raised s = s1 + type(s1)(s2) -> latin-1 string s = type(s2)(s1) + s2 -> utf-8 string lines = [s1, s2, ..., sn] s = Utf8.join([Utf8(s) for s in lines]) In this last case the strings in s1 can even be of arbitrary encoding types and they would still all get re-encoded to utf-8 correctly. Chances are you would never have a list of strings with many different encodings, but you may have a list of strings with types unknown to a local function. There can probably be various ways of creating these types that do not require them to be built in. The advantage is they can be smarter about concatenation, slicing, and transforming to bytes and unicode and back. It's really just a higher level API. Weather it's a waste of time and effort, , I suppose that depends on who is doing it and weather or not they think so. It could also be a third party module as well. Then if it becomes popular it can be included in python some time in a future version. > It's easy to tell what happens now: the bytes of those input > strings are just appended; the result string does not follow > a consistent character encoding anymore. This answer does > not apply to your proposed modification, as it does not answer > what the value of the .encoding attribute of the str8 would be > after concatenation (likewise for slicing). And what is the use of appending unlike encoded str8 types? Most anything I can think of are hacks. I disagree about it being easy to tell what happens. Thats only true on a micro level. On a macro level, it may work out ok, or it may cause an error to be raised at some point, or it may be completely silent and the data you send out is corrupted. In which case, something even worse may happen when the data is used. Like missing mars orbiters or crashed landers. That does not sound like it is "easy to tell what happens" to me. I think what Guido is thinking is we may need keep str8 around (for a while) as a 'C' compatible string type for purposes of interfacing to 'C' code. What I was thinking about was to simplify encoding and decoding and avoiding issues that are caused by miss matched strings of *any* type. A different problem set, that may need a different solution. >> It's not too different from trying to find the current unicode and str8 >> issues in the py3k-struni branch. > > This sentence I do not understand. What is not too different from > trying to find issues? It was a general statement reflecting on the process of converting the py3k-struni branch to unicode. As I said above: >> ... it becomes an easy to make and >> hard to find silent errors ... In this case the errors are expected, but finding them is still difficult. It's not quite the same thing, but I did say "not too different", meaning there are some differences. >> Concatenating str8 and str types is a bit safer, as long as the str8 is >> in in "the" default encoding, but it may still be an unintended implicit >> conversion. And if it's not in the default encoding, then all bets are >> off again. > > Sure. However, the str8 type will go away, and along with it all these > issues. Yes, hopefully it will, eventually, along with encoded strings in the wild, as well. But probably not immediately. >> The use would be in ensuring the integrity of encoded strings. >> Concatenating strings with different encodings could then produce >> errors. > > Ok. What about slicing? Details... for which all of these can be solved. Encoded string types as I described above can also know how to slice themselves correctly. Cheers and Regards, Ron From martin at v.loewis.de Sat Jun 16 01:00:10 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 16 Jun 2007 01:00:10 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46731502.9050400@ronadam.com> References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de> <46731502.9050400@ronadam.com> Message-ID: <467319FA.8000000@v.loewis.de> > This was in the context that it is decided by the community that a st8 > type is needed and it does not go away. I think *that* context has not occurred. People wanted a read-only bytes type, not a byte-oriented character string type. > The alternative is for str8 to be replaced by byte objects which I > believe was, and still is, the plan if possible. That type is already implemented. > The same semantic issues will also be present in bytes objects in one > form or another when handling data acquired from sources that use > encoded strings. They don't go away even if str8 does go away. No they don't. The bytes type doesn't have an encoding associated with it, and it shouldn't. Values may not even represent text, but, say, image data. > It sort of depends on how someone wants to handle situations where > encoded strings are encountered. Do they decode them and convert > everything to unicode and then convert back as needed for any output. > Or can they keep the data in the encoded form for the duration? I > expect different people will feel differently on this. In Py3k, they will use the string type, because anything else will just be too tedious. >> As for creating str8 objects from bytes objects: If you want >> the str8 object to carry an encoding, you would have to *specify* >> the encoding when creating the str8 object, since the bytes object >> does not have that information. This is *very* hard, as you may >> not know what the encoding is when you need to create the str8 >> object. > > True, and this also applies if you want to convert an already encoded > bytes object to unicode as well. Right, and therefore it can never be automatic - whereas the conversion between a bytes object and a str8 object *could* be automatic otherwise (assuming the str8 type survives at all). > One approach is to possibly use a factory function that uses metaclass's > or mixins to create these based either on a str base type or a bytes > object. > > Latin1 = get_encoded_str_type('latin-1') > > s1 = Latin1('Hello ') [snip] I think I lost track now what problem precisely you are trying to solve. >> It's easy to tell what happens now: the bytes of those input >> strings are just appended; the result string does not follow >> a consistent character encoding anymore. This answer does >> not apply to your proposed modification, as it does not answer >> what the value of the .encoding attribute of the str8 would be >> after concatenation (likewise for slicing). > > And what is the use of appending unlike encoded str8 types? You may need to put encoded text into binary data, e.g. putting a file name into a zip file. Some of the bytes will be utf-8 encoded, others will be cp437 encode, others will be data structures of the zip file, and the rest will be compressed bytes. Likewise for constructing MIME messages: different pieces will use different encodings. > I think what Guido is thinking is we may need keep str8 around (for a > while) as a 'C' compatible string type for purposes of interfacing to > 'C' code. That might be. I hope not, and I have plans to eliminate the need for many such places (providing Unicode-oriented APIs in some cases, and using the bytes type in other cases). In cases where we still have char*, I think the API should specify that this must be ASCII most of them time, with UTF-8 in selected other cases; arbitrary binary data only when interfacing to the bytes type. Regards, Martin From rrr at ronadam.com Sat Jun 16 02:31:02 2007 From: rrr at ronadam.com (Ron Adam) Date: Fri, 15 Jun 2007 19:31:02 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <467319FA.8000000@v.loewis.de> References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46721ABE.9090203@ronadam.com> <46722BBD.5060100@v.loewis.de> <46731502.9050400@ronadam.com> <467319FA.8000000@v.loewis.de> Message-ID: <46732F46.8020701@ronadam.com> Martin v. L?wis wrote: >> This was in the context that it is decided by the community that a st8 >> type is needed and it does not go away. > > I think *that* context has not occurred. People wanted a read-only > bytes type, not a byte-oriented character string type. > >> The alternative is for str8 to be replaced by byte objects which I >> believe was, and still is, the plan if possible. > > That type is already implemented. But the actual replacing of str8 by bytes is still a work in progress. >> The same semantic issues will also be present in bytes objects in one >> form or another when handling data acquired from sources that use >> encoded strings. They don't go away even if str8 does go away. > > No they don't. The bytes type doesn't have an encoding associated > with it, and it shouldn't. Values may not even represent text, > but, say, image data. Right, and in the cases where the bytes are an encoded form of string data, you will need to be very careful about how it is sliced. But this isn't any different for any other byte type data. It's a low level interface meant to do low level things. Which is good, we need that. >> It sort of depends on how someone wants to handle situations where >> encoded strings are encountered. Do they decode them and convert >> everything to unicode and then convert back as needed for any output. >> Or can they keep the data in the encoded form for the duration? I >> expect different people will feel differently on this. > > In Py3k, they will use the string type, because anything else will > just be too tedious. I agree, this will be the preferred way, and should be. >>> As for creating str8 objects from bytes objects: If you want >>> the str8 object to carry an encoding, you would have to *specify* >>> the encoding when creating the str8 object, since the bytes object >>> does not have that information. This is *very* hard, as you may >>> not know what the encoding is when you need to create the str8 >>> object. >> True, and this also applies if you want to convert an already encoded >> bytes object to unicode as well. > > Right, and therefore it can never be automatic - whereas the conversion > between a bytes object and a str8 object *could* be automatic otherwise > (assuming the str8 type survives at all). But conversion between different encodings won't be automatic. It will still be as tedious and confusing as it always has been. The improvement that python3000 makes here is that maybe it won't be needed as often with unicode strings being the default. >> One approach is to possibly use a factory function that uses metaclass's >> or mixins to create these based either on a str base type or a bytes >> object. >> >> Latin1 = get_encoded_str_type('latin-1') >> >> s1 = Latin1('Hello ') > [snip] > > I think I lost track now what problem precisely you are trying to solve. A case of abstract motivation, prompting a very general idea, which illicits subjective responses, which prompts even more concrete examples, etc... The original motivation wasn't explicitly stated at the beginning and got lost. ;-) My primary reason for the suggestion is that maybe it can increase string data integrity and make finding errors easier. This was just a thought in that direction. A more specific example or issue that is much more relevant at this time might be, should the conversion to bytes be automatic when combining str8 and bytes? (str and bytes in python 2.6+) The first answer might be yes since it's a one to one conversion. But it's implicit. >>> str8('hello ') + b'world' b'hello world' >>> b'hello ' + str8('world') b'hello world' That's clear enough, but what about... >>> ''.join(slist) Traceback (most recent call last): File "", line 1, in TypeError: sequence item 0: expected string or Unicode, bytes found And so starts yet another tedious session of tracing variables back to find where the bytes type actually occurred. Which may not be obvious since it could have been an unintentional and implicit conversion. >>> It's easy to tell what happens now: the bytes of those input >>> strings are just appended; the result string does not follow >>> a consistent character encoding anymore. This answer does >>> not apply to your proposed modification, as it does not answer >>> what the value of the .encoding attribute of the str8 would be >>> after concatenation (likewise for slicing). >> And what is the use of appending unlike encoded str8 types? > > You may need to put encoded text into binary data, e.g. putting > a file name into a zip file. Some of the bytes will be utf-8 > encoded, others will be cp437 encode, others will be data structures > of the zip file, and the rest will be compressed bytes. > > Likewise for constructing MIME messages: different pieces will > use different encodings. Wouldn't you need some sort of wrapper in these cases to indicate what the encoding is and where it starts and stops? So even in binary data, extracting it to bytes and then decoding each section to it's particular encoded type should not be a problem. Same goes for the other way around. For text encoded data within other text encoded data, its a nested encoding that needs to be unencoded in the correct sequence. Not a sequential encoding that is done and appended together as is. Is that correct? And it still needs headers to indicate it's encoding, start, and length. Or something equivalent. What am I missing? Cheers, Ron >> I think what Guido is thinking is we may need keep str8 around (for a >> while) as a 'C' compatible string type for purposes of interfacing to >> 'C' code. > > That might be. I hope not, and I have plans to eliminate the need for > many such places (providing Unicode-oriented APIs in some cases, > and using the bytes type in other cases). > > In cases where we still have char*, I think the API should specify that > this must be ASCII most of them time, with UTF-8 in selected other > cases; arbitrary binary data only when interfacing to the bytes > type. > > Regards, > Martin > > From rrr at ronadam.com Sun Jun 17 04:38:12 2007 From: rrr at ronadam.com (Ron Adam) Date: Sat, 16 Jun 2007 21:38:12 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <4668D535.7020103@v.loewis.de> <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> Message-ID: <46749E94.5010301@ronadam.com> Guido van Rossum wrote: > On 6/13/07, Ron Adam wrote: >> Attached both the str8 repr as s"..." and s'...', and the latest >> no_raw_escape patch which I think is complete now and should apply >> with no >> problems. > > I like the str8 repr patch enough to check it in. > >> I tracked the random fails I am having in test_tokenize.py down to it >> doing >> a round trip on random test_*.py files. If one of those files has a >> problem it causes test_tokanize.py to fail also. So I added a line to >> the >> test to output the file name it does the round trip on so those can be >> fixed as they are found. >> >> Let me know it needs to be adjusted or something doesn't look right. > > Well, I'm still philosophically uneasy with r'\' being a valid string > literal, for various reasons (one being that writing a string parser > becomes harder and harder). I definitely want r'\u1234' to be a > 6-character string, however. Do you have a patch that does just that? > (We can argue over the rest later in a larger forum.) The str8 patch caused tokanize.py to fail again also. ;-) Those s'' == '' asserts of course. I tracked it down to cStringIO.c only returning str8 types. Fixing that *may* fix a number of other modules as well, but I'm not sure how, so I put a str() around the returned value in tokanize.py with a note for now. The attached path has various minor fix's. (But not the no raw escape stuff yet.) Cheers, Ron tokenize.py - Get rid of s'' errors by making returned value from cStringIO.c to unicode with str(). test_tokanize.py - Added printing of roundtrip file names to test. This is needed because files are a random sample and if they have errors, it causes this module to fail. Without this there is no way to tell what is going on. _fileio.c - Return unicode strings instead of str8 strings. (check this one.) smtplib - Fixed strip() without args on bytes. test_fileinput.py - Replaced bad writelines() call with a for loop with a write() call. (buffer object doesn't have a writelines method, and try-finally hid the error.) -------------- next part -------------- A non-text attachment was scrubbed... Name: variousfixes.diff Type: text/x-patch Size: 3447 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070616/acfc33d9/attachment.bin From guido at python.org Mon Jun 18 20:37:44 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 18 Jun 2007 11:37:44 -0700 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <46749E94.5010301@ronadam.com> References: <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46749E94.5010301@ronadam.com> Message-ID: Thanks for the patches! Applied, except for the change to tokenize.py; instead, I changed test_tokenize.py to use io.StringIO. --Guido On 6/16/07, Ron Adam wrote: > > > Guido van Rossum wrote: > > On 6/13/07, Ron Adam wrote: > >> Attached both the str8 repr as s"..." and s'...', and the latest > >> no_raw_escape patch which I think is complete now and should apply > >> with no > >> problems. > > > > I like the str8 repr patch enough to check it in. > > > >> I tracked the random fails I am having in test_tokenize.py down to it > >> doing > >> a round trip on random test_*.py files. If one of those files has a > >> problem it causes test_tokanize.py to fail also. So I added a line to > >> the > >> test to output the file name it does the round trip on so those can be > >> fixed as they are found. > >> > >> Let me know it needs to be adjusted or something doesn't look right. > > > > Well, I'm still philosophically uneasy with r'\' being a valid string > > literal, for various reasons (one being that writing a string parser > > becomes harder and harder). I definitely want r'\u1234' to be a > > 6-character string, however. Do you have a patch that does just that? > > (We can argue over the rest later in a larger forum.) > > The str8 patch caused tokanize.py to fail again also. ;-) Those s'' == '' > asserts of course. > > I tracked it down to cStringIO.c only returning str8 types. Fixing that > *may* fix a number of other modules as well, but I'm not sure how, so I put > a str() around the returned value in tokanize.py with a note for now. > > The attached path has various minor fix's. (But not the no raw escape > stuff yet.) > > Cheers, > Ron > > > > tokenize.py - Get rid of s'' errors by making returned value from > cStringIO.c to unicode with str(). > > test_tokanize.py - Added printing of roundtrip file names to test. This is > needed because files are a random sample and if they have errors, it causes > this module to fail. Without this there is no way to tell what is going on. > > _fileio.c - Return unicode strings instead of str8 strings. (check this one.) > > smtplib - Fixed strip() without args on bytes. > > test_fileinput.py - Replaced bad writelines() call with a for loop with a > write() call. (buffer object doesn't have a writelines method, and > try-finally hid the error.) > > > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Tue Jun 19 08:32:59 2007 From: guido at python.org (Guido van Rossum) Date: Mon, 18 Jun 2007 23:32:59 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) Message-ID: I've written up a comprehensive status report on Python 3000. Please read: http://www.artima.com/weblogs/viewpost.jsp?thread=208549 -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rrr at ronadam.com Tue Jun 19 12:04:27 2007 From: rrr at ronadam.com (Ron Adam) Date: Tue, 19 Jun 2007 05:04:27 -0500 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: References: <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46749E94.5010301@ronadam.com> Message-ID: <4677AA2B.8000704@ronadam.com> Guido van Rossum wrote: > Thanks for the patches! Applied, except for the change to > tokenize.py; instead, I changed test_tokenize.py to use io.StringIO. > > --Guido Glad to have the opportunity to help make the future happen. ;-) This next one converts unicode literals in tokenize.py and it's tests to byte literals. I've also fixed some more unicode literals in a few other places I found. By doing this first it will make the no raw escape patches not include any thing else. Cheers, Ron M Lib/tokenize.py M Lib/test/tokenize_tests.txt M Lib/test/output/test_tokenize - Removed unicode literals from test results and tokenize.py. And make it pass again. M Lib/test/output/test_pep277 - Removed unicode literals from test results. This is a windows only test, so I can't test it. M Lib/test/test_codeccallbacks.py M Objects/exceptions.c - Remove unicode literals from test_codeccallbacks.py and removed unicode litteral quoting from exceptions.c to make it pass again. M Lib/test/test_codecs.py M Lib/test/test_doctest.py M Lib/test/re_tests.py - Removed some literals from comments. -------------- next part -------------- A non-text attachment was scrubbed... Name: variousfixes2.diff Type: text/x-patch Size: 14354 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070619/749db319/attachment.bin From ncoghlan at gmail.com Tue Jun 19 13:57:44 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 19 Jun 2007 21:57:44 +1000 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: Message-ID: <4677C4B8.8010508@gmail.com> Georg Brandl wrote: > Guido van Rossum schrieb: >> I've written up a comprehensive status report on Python 3000. Please read: >> >> http://www.artima.com/weblogs/viewpost.jsp?thread=208549 > > Thank you! Now I have something to show to interested people except "read > the PEPs". > > A minuscule nit: the rot13 codec has no library equivalent, so it won't be > supported anymore :) Given that there are valid use cases for bytes-to-bytes translations, and a common API for them would be nice, does it make sense to have an additional category of codec that is invoked via specific recoding methods on bytes objects? For example: encoded = data.encode_bytes('bz2') decoded = encoded.decode_bytes('bz2') assert data == decoded Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From gabor at nekomancer.net Tue Jun 19 13:54:32 2007 From: gabor at nekomancer.net (=?ISO-8859-1?Q?G=E1bor_Farkas?=) Date: Tue, 19 Jun 2007 13:54:32 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: Message-ID: <4677C3F8.3050305@nekomancer.net> Guido van Rossum wrote: > I've written up a comprehensive status report on Python 3000. Please read: > > http://www.artima.com/weblogs/viewpost.jsp?thread=208549 > why does map and filter stay, but reduce leaves? i understand that some people think that an explicit for-loop is more understandable, but also many people claim that list-comprehensions are more understandable than map/filter.., and map/filter can be trivially written using list-comprehensions.. so why _reduce_? gabor From g.brandl at gmx.net Tue Jun 19 14:20:06 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 19 Jun 2007 14:20:06 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677C4B8.8010508@gmail.com> References: <4677C4B8.8010508@gmail.com> Message-ID: Nick Coghlan schrieb: > Georg Brandl wrote: >> Guido van Rossum schrieb: >>> I've written up a comprehensive status report on Python 3000. Please read: >>> >>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549 >> >> Thank you! Now I have something to show to interested people except "read >> the PEPs". >> >> A minuscule nit: the rot13 codec has no library equivalent, so it won't be >> supported anymore :) > > Given that there are valid use cases for bytes-to-bytes translations, > and a common API for them would be nice, does it make sense to have an > additional category of codec that is invoked via specific recoding > methods on bytes objects? For example: > > encoded = data.encode_bytes('bz2') > decoded = encoded.decode_bytes('bz2') > assert data == decoded This is exactly what I proposed a while before under the name bytes.transform(). IMO it would make a common use pattern much more convenient and should be given thought. If a PEP is called for, I'd be happy to at least co-author it. Georg From walter at livinglogic.de Tue Jun 19 14:40:57 2007 From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=) Date: Tue, 19 Jun 2007 14:40:57 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C4B8.8010508@gmail.com> Message-ID: <4677CED9.1060800@livinglogic.de> Georg Brandl wrote: > Nick Coghlan schrieb: >> Georg Brandl wrote: >>> Guido van Rossum schrieb: >>>> I've written up a comprehensive status report on Python 3000. Please read: >>>> >>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549 >>> Thank you! Now I have something to show to interested people except "read >>> the PEPs". >>> >>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be >>> supported anymore :) >> Given that there are valid use cases for bytes-to-bytes translations, >> and a common API for them would be nice, does it make sense to have an >> additional category of codec that is invoked via specific recoding >> methods on bytes objects? For example: >> >> encoded = data.encode_bytes('bz2') >> decoded = encoded.decode_bytes('bz2') >> assert data == decoded > > This is exactly what I proposed a while before under the name > bytes.transform(). > > IMO it would make a common use pattern much more convenient and > should be given thought. > > If a PEP is called for, I'd be happy to at least co-author it. Codecs are a major exception to Guido's law: Never have a parameter whose value switches between completely unrelated algorithms. Why don't we put all string transformation functions into a common module (the string module might be a good place): >>> import string >>> string.rot13('abc') Servus, Walter From mal at egenix.com Tue Jun 19 15:19:50 2007 From: mal at egenix.com (M.-A. Lemburg) Date: Tue, 19 Jun 2007 15:19:50 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677CED9.1060800@livinglogic.de> References: <4677C4B8.8010508@gmail.com> <4677CED9.1060800@livinglogic.de> Message-ID: <4677D7F6.3040304@egenix.com> On 2007-06-19 14:40, Walter D?rwald wrote: > Georg Brandl wrote: >>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be >>>> supported anymore :) >>> Given that there are valid use cases for bytes-to-bytes translations, >>> and a common API for them would be nice, does it make sense to have an >>> additional category of codec that is invoked via specific recoding >>> methods on bytes objects? For example: >>> >>> encoded = data.encode_bytes('bz2') >>> decoded = encoded.decode_bytes('bz2') >>> assert data == decoded >> This is exactly what I proposed a while before under the name >> bytes.transform(). >> >> IMO it would make a common use pattern much more convenient and >> should be given thought. >> >> If a PEP is called for, I'd be happy to at least co-author it. > > Codecs are a major exception to Guido's law: Never have a parameter > whose value switches between completely unrelated algorithms. I don't see much of a problem with that. Parameters are per-se intended to change the behavior of a function or method. Note that you are referring to the .encode() and .decode() methods - these are just easy to use interfaces to the codecs registered in the system. The codec design allows for different input and output types as it doesn't impose restrictions on these. Codecs are more general in that respect: they don't just deal with Unicode encodings, it's a more general approach that also works with other kinds of data types. The access methods, OTOH, can impose restrictions and probably should to restrict the return types to a predicable set. > Why don't we put all string transformation functions into a common > module (the string module might be a good place): > >>>> import string >>>> string.rot13('abc') I think the string module will have to go away. It doesn't really separate between text and bytes data. Adding more confusion will not really help with making this distinction clear, either, I'm afraid. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 19 2007) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2007-07-09: EuroPython 2007, Vilnius, Lithuania 19 days to go :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 From him at online.de Tue Jun 19 15:56:39 2007 From: him at online.de (=?ISO-8859-1?Q?Joachim_K=F6nig?=) Date: Tue, 19 Jun 2007 15:56:39 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: Message-ID: <4677E097.5060205@online.de> Guido van Rossum schrieb: > I've written up a comprehensive status report on Python 3000. Please read: > > http://www.artima.com/weblogs/viewpost.jsp?thread=208549 > > Nice summary, thanks. I'm sure it has been proposed before (and I've googled for it but did not find it), but could someone enlighten me why {,} can't be used for the empty set, analogous to the empty tuple (,)? No, I do not want to start a new discussion about it. Joachim From g.brandl at gmx.net Tue Jun 19 15:03:26 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 19 Jun 2007 15:03:26 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677CED9.1060800@livinglogic.de> References: <4677C4B8.8010508@gmail.com> <4677CED9.1060800@livinglogic.de> Message-ID: Walter D?rwald schrieb: > Georg Brandl wrote: >> Nick Coghlan schrieb: >>> Georg Brandl wrote: >>>> Guido van Rossum schrieb: >>>>> I've written up a comprehensive status report on Python 3000. Please read: >>>>> >>>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549 >>>> Thank you! Now I have something to show to interested people except "read >>>> the PEPs". >>>> >>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be >>>> supported anymore :) >>> Given that there are valid use cases for bytes-to-bytes translations, >>> and a common API for them would be nice, does it make sense to have an >>> additional category of codec that is invoked via specific recoding >>> methods on bytes objects? For example: >>> >>> encoded = data.encode_bytes('bz2') >>> decoded = encoded.decode_bytes('bz2') >>> assert data == decoded >> >> This is exactly what I proposed a while before under the name >> bytes.transform(). >> >> IMO it would make a common use pattern much more convenient and >> should be given thought. >> >> If a PEP is called for, I'd be happy to at least co-author it. > > Codecs are a major exception to Guido's law: Never have a parameter > whose value switches between completely unrelated algorithms. I don't think that applies here. This is more like __import__(): depending on the first parameter, completely different things can happen. Yes, the same import algorithm is used, but in the case of bytes.encode_bytes, the same algorithm is used to find and execute the codec. Georg From lists at cheimes.de Tue Jun 19 15:05:42 2007 From: lists at cheimes.de (Christian Heimes) Date: Tue, 19 Jun 2007 15:05:42 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4677C3F8.3050305@nekomancer.net> References: <4677C3F8.3050305@nekomancer.net> Message-ID: G?bor Farkas wrote: > why does map and filter stay, but reduce leaves? > > i understand that some people think that an explicit for-loop is more > understandable, but also many people claim that list-comprehensions are > more understandable than map/filter.., and map/filter can be trivially > written using list-comprehensions.. so why _reduce_? Don't worry, it wasn't complete removed. Reduce was moved to functools $ ./python Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> map >>> filter >>> from functools import reduce >>> reduce From benji at benjiyork.com Tue Jun 19 16:37:00 2007 From: benji at benjiyork.com (Benji York) Date: Tue, 19 Jun 2007 10:37:00 -0400 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677E097.5060205@online.de> References: <4677E097.5060205@online.de> Message-ID: <4677EA0C.3020107@benjiyork.com> Joachim K?nig wrote: > could someone enlighten me why > > {,} > > can't be used for the empty set, analogous to the empty tuple (,)? Partially because (,) is not the empty tuple, () is. -- Benji York http://benjiyork.com From walter at livinglogic.de Tue Jun 19 16:45:46 2007 From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=) Date: Tue, 19 Jun 2007 16:45:46 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C4B8.8010508@gmail.com> <4677CED9.1060800@livinglogic.de> Message-ID: <4677EC1A.10306@livinglogic.de> Georg Brandl wrote: > Walter D?rwald schrieb: >> Georg Brandl wrote: >>> Nick Coghlan schrieb: >>>> Georg Brandl wrote: >>>>> Guido van Rossum schrieb: >>>>>> I've written up a comprehensive status report on Python 3000. Please read: >>>>>> >>>>>> http://www.artima.com/weblogs/viewpost.jsp?thread=208549 >>>>> Thank you! Now I have something to show to interested people except "read >>>>> the PEPs". >>>>> >>>>> A minuscule nit: the rot13 codec has no library equivalent, so it won't be >>>>> supported anymore :) >>>> Given that there are valid use cases for bytes-to-bytes translations, >>>> and a common API for them would be nice, does it make sense to have an >>>> additional category of codec that is invoked via specific recoding >>>> methods on bytes objects? For example: >>>> >>>> encoded = data.encode_bytes('bz2') >>>> decoded = encoded.decode_bytes('bz2') >>>> assert data == decoded >>> This is exactly what I proposed a while before under the name >>> bytes.transform(). >>> >>> IMO it would make a common use pattern much more convenient and >>> should be given thought. >>> >>> If a PEP is called for, I'd be happy to at least co-author it. >> Codecs are a major exception to Guido's law: Never have a parameter >> whose value switches between completely unrelated algorithms. > > I don't think that applies here. This is more like __import__(): > depending on the first parameter, completely different things can happen. > Yes, the same import algorithm is used, but in the case of > bytes.encode_bytes, the same algorithm is used to find and execute the > codec. What would a registry of tranformation algorithms buy us compared to a module with transformation functions? The function version is shorter: transform.rot13('foo') compared to: 'foo'.transform('rot13') If each transformation has its own function, these functions can have their own arguments, e.g. transform.bz2encode(data: bytes, level: int=6) -> bytes Of course str.transform() could pass along all arguments to the registered function, but that's worse from a documentation viewpoint, because the real signature is hidden deep in the registry. Servus, Walter From brandon at rhodesmill.org Tue Jun 19 16:43:40 2007 From: brandon at rhodesmill.org (Brandon Craig Rhodes) Date: Tue, 19 Jun 2007 10:43:40 -0400 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677E097.5060205@online.de> (Joachim =?utf-8?Q?K=C3=B6nig's?= message of "Tue, 19 Jun 2007 15:56:39 +0200") References: <4677E097.5060205@online.de> Message-ID: <87bqfcj97n.fsf@ten22.rhodesmill.org> Joachim K?nig writes: > ... could someone enlighten me why > > {,} > > can't be used for the empty set, analogous to the empty tuple (,)? And now that someone else has broken the ice regarding questions that have probably been exhausted already, I want to comment that Python 3k seems to perpetuate a vast asymmetry. Observe: (a) Syntactic constructors [ 1,2,3 ] works { 1,2,3 } works { 1:1, 2:4, 3:9 } works (b) Generators + constructor functions list(i for i in (1,2,3)) works set(i for i in (1,2,3)) works dict((i,i*i) for i in (1,2,3)) works (c) Comprehensions [ i for i in (1,2,3) ] works { i for i in (1,2,3) ] works { i:i*i for i in (1,2,3) ] returns a SyntaxError! This seems offensive. It spells trouble for new students, who will have to simply memorize which of the three syntactically-supported containers support comprehensions and which do not. It spells trouble when trying to explain Python to seasoned programmers, who will think that they detect trouble in a language that breaks obvious symmetries over something so basic as instantiating container types. The PEP for dictionary comprehensions, when I last reviewed it, argued that dict comprehensions are unnecessary, because we have generators now. It seems to me that either: 1) The grounds for rejecting dict comprehensions are valid, and therefore should be extended so that everything in (c) above goes away. That is, if generators + built-in constructor functions are such a great solution for creating dicts, then list comprehensions and set comprehensions should both go away as well in favor of generators. The language would become simpler, the parser would become simpler, and Python would be easier to learn. 2) The grounds for rejecting dict comprehensions are invalid, so they should be introduced in Python 3k so that everything in (c) works. Given that Python 3k is making such strides in other areas where cruft and asymmetry needed to be removed, it would seem a shame to leave the container types in such disarray. -- Brandon Craig Rhodes brandon at rhodesmill.org http://rhodesmill.org/brandon From guido at python.org Tue Jun 19 17:20:25 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Jun 2007 08:20:25 -0700 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: References: Message-ID: Those are valid concerns. I'm cross-posting this to the python-3000 list in the hope that the PEP's author and defendents can respond. I'm sure we can work something out. Please keep further discussion on the python-3000 at python.org list. --Guido On 6/19/07, Chris McDonough wrote: > Wrt http://www.python.org/dev/peps/pep-3101/ > > PEP 3101 says Py3K should allow item and attribute access syntax > within string templating expressions but "to limit potential security > issues", access to underscore prefixed names within attribute/item > access expressions will be disallowed. > > I am a person who has lived with the aftermath of a framework > designed to prevent data access by restricting access to underscore- > prefixed names (Zope 2, ahem), and I've found it's very hard to > explain and justify. As a result, I feel that this is a poor default > policy choice for a framework. > > In some cases, underscore names must become part of an object's > external interface. Consider a URL with one or more underscore- > prefixed path segment elements (because prefixing a filename with an > underscore is a perfectly reasonable thing to do on a filesystem, and > path elements are often named after file names) fed to a traversal > algorithm that attempts to resolve each path element into an object > by calling __getitem__ against the parent found by the last path > element's traversal result. Perhaps this is poor design and > __getitem__ should not be consulted here, but I doubt that highly > because there's nothing particularly special about calling a method > named __getitem__ as opposed to some method named "traverse". > > The only precedent within Python 2 for this sort of behavior is > limiting access to variables that begin with __ and which do not end > with __ to the scope defined by a class and its instances. I > personally don't believe this is a very useful feature, but it's > still only an advisory policy and you can worm around it with enough > gyrations. > > Given that security is a concern at all, the only truly reasonable > way to "limit security issues" is to disallow item and attribute > access completely within the string templating expression syntax. It > seems gratuituous to me to encourage string templating expressions > with item/attribute access, given that you could do it within the > format arguments just as easily in the 99% case, and we've (well... > I've) happily been living with that restriction for years now. > > But if this syntax is preserved, there really should be no *default* > restrictions on the traversable names within an expression because > this will almost certainly become a hard-to-explain, hard-to-justify > bug magnet as it has become in Zope. > > - C > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From him at online.de Tue Jun 19 16:49:10 2007 From: him at online.de (=?ISO-8859-1?Q?Joachim_K=F6nig?=) Date: Tue, 19 Jun 2007 16:49:10 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4677EA0C.3020107@benjiyork.com> References: <4677E097.5060205@online.de> <4677EA0C.3020107@benjiyork.com> Message-ID: <4677ECE6.2040402@online.de> Benji York schrieb: > Joachim K?nig wrote: >> could someone enlighten me why >> >> {,} >> >> can't be used for the empty set, analogous to the empty tuple (,)? > > Partially because (,) is not the empty tuple, () is. Oh, yes, of course. I was thinking of (x) vs. (x,), and that the comma after the last element is optional if len() > 1, but required when len() == 1 and forgot that is is forbidden when len() == 0. Sorry about that. Joachim From jimjjewett at gmail.com Tue Jun 19 17:29:30 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Tue, 19 Jun 2007 11:29:30 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4677C3F8.3050305@nekomancer.net> References: <4677C3F8.3050305@nekomancer.net> Message-ID: On 6/19/07, G?bor Farkas wrote: > Guido van Rossum wrote: > > I've written up a comprehensive status report on Python 3000. Please read: > > http://www.artima.com/weblogs/viewpost.jsp?thread=208549 > why does map and filter stay, but reduce leaves? > i understand that some people think that an explicit for-loop is more > understandable, but also many people claim that list-comprehensions are > more understandable than map/filter.., and map/filter can be trivially > written using list-comprehensions.. so why _reduce_? Note: these are my opinions, which may be unrelated to Guido's reasoning. In practice, reduce is almost always difficult to read and understand. There are counterexamples, but they tend to already be written without reduce. (They may use sum instead of a for loop, but they don't use reduce, unless they are intended as an example of reduce usage.) filter is at least well-named; no one has any doubts over what it is doing. map over a simple function is better written as a list comprehension, but if the function is complicated, has side effects, sends output to multiple places ... then map is probably a less-bad choice. -jJ From janssen at parc.com Tue Jun 19 18:34:32 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 09:34:32 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> Message-ID: <07Jun19.093441pdt."57996"@synergy1.parc.xerox.com> > > written using list-comprehensions.. so why _reduce_? > > Don't worry, it wasn't complete removed. Reduce was moved to functools Though, really, same question! There are functional equivalents (list comprehensions) for "map" and "filter", but not for "reduce". Shouldn't "reduce" stay in the 'built-in' space, while the other two move to "functools"? Or move them all to "functools"? Bizarre recombination, IMO. Bill From collinw at gmail.com Tue Jun 19 18:46:08 2007 From: collinw at gmail.com (Collin Winter) Date: Tue, 19 Jun 2007 09:46:08 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <7301715244131583311@unknownmsgid> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> Message-ID: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> On 6/19/07, Bill Janssen wrote: > > > written using list-comprehensions.. so why _reduce_? > > > > Don't worry, it wasn't complete removed. Reduce was moved to functools > > Though, really, same question! There are functional equivalents (list > comprehensions) for "map" and "filter", but not for "reduce". There is a range of list comprehensions that are more readably/concisely expressed as calls to map or filter: [f(x) for x in y] -> map(f, y) [x for x in y if f(x)] -> filter(f, y) Turning a for loop into the equivalent reduce() may be more concise, but as Guido has remarked before, someone new to your code generally has to break out pen and paper to figure out what's going on. > Shouldn't "reduce" stay in the 'built-in' space, while the other two > move to "functools"? Or move them all to "functools"? Bizarre > recombination, IMO. Arguing from the standpoint of purity, that, "these functions are builtins, why not this other one" isn't going to get you very far. Another data point to consider is that map and filter are used far, far more often than reduce (100000 and 62000 usages to 10000, says Google Code Search), so there's more resistance to moving them. Collin Winter From janssen at parc.com Tue Jun 19 19:51:15 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 10:51:15 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> Message-ID: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> > > Shouldn't "reduce" stay in the 'built-in' space, while the other two > > move to "functools"? Or move them all to "functools"? Bizarre > > recombination, IMO. > > Arguing from the standpoint of purity, that, "these functions are > builtins, why not this other one" isn't going to get you very far. If you think that's what I was arguing, you'd better re-read that message. Though, from the standpoint of pragmatism, removing "reduce" from the built-in space will break code (*my* code, among others), and leaving it in will not affect "purity", as both "map" and "reduce" are being left in. So leaving it alone seems the more Pythonic response to me. Guido's argument (http://www.artima.com/weblogs/viewpost.jsp?thread=98196) is that "any" and "all" (and "filter", of course) are better ways to do the same thing. I'm not sure, but it's an interesting hypothesis. But while we run the experiment, why not leave "reduce" where it is? Bill From eric+python-dev at trueblade.com Tue Jun 19 19:55:07 2007 From: eric+python-dev at trueblade.com (Eric V. Smith) Date: Tue, 19 Jun 2007 13:55:07 -0400 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: Message-ID: <4678187B.2060402@trueblade.com> Guido van Rossum wrote: > I've written up a comprehensive status report on Python 3000. Please read: > > http://www.artima.com/weblogs/viewpost.jsp?thread=208549 I think this sentence: "Python 2.6 will contain backported versions of many Py3k features, either enabled through __future__ statements or simply by allowing old and new syntax to be used side-by-side (if the new syntax would be a syntax error in 2.x)." Should end with "syntax error in 2.5", not "syntax error in 2.x". Or, state that x <= 5, in this sentence only. But I think we really mean exactly 2.5. Eric. From janssen at parc.com Tue Jun 19 20:02:32 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 11:02:32 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> Message-ID: <07Jun19.110237pdt."57996"@synergy1.parc.xerox.com> And, while I'm at it, why isn't there a built-in function called "output()", which matches "input()", that is, it's equivalent to import sys sys.stdout.write(MESSAGE) It could be easily implemented in terms of the built-in function called "print". The fact that it's not there is going to confuse the heck out of the same audience "input" was designed for. I realize that there are good individual reasons for each of these point decisions; my fear is that by making them individually, we make the task of keeping Python in one's head unacceptably complex. Bill From mike.klaas at gmail.com Tue Jun 19 20:19:09 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Tue, 19 Jun 2007 11:19:09 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> Message-ID: <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> On 19-Jun-07, at 10:51 AM, Bill Janssen wrote: > > Though, from the standpoint of pragmatism, removing "reduce" from the > built-in space will break code (*my* code, among others), and leaving > it in will not affect "purity", as both "map" and "reduce" are being > left in. So leaving it alone seems the more Pythonic response to me. map (especially the new iterized version) is a frequently-used builtin, while reduce is a rarely-used builtin that requires some head-wrapping. It makes sense to me to move it out of builtins. To pick a codebase at random (mine): $ find -name \*.py | xargs wc -l 137952 total $ pygrep map\( | wc -l 220 $ pygrep imap\( | wc -l 13 $ pygrep reduce\( | wc -l 2 Of the two uses of reduce(), one is in a unittest that should be using any(): self.assertTrue(not reduce((lambda b1, b2: b1 or b2), ... and the other is a tricky combination of callable "iterator filters" that looks something like this: df = lambda itr: reduce(lambda x, f: f(x), filter_list, itr) this isn't the clearest piece of code, even with more explanation. It would require a multi-line inner-function generator to replace it. I'm have no qualms importing reduce for such uses. In contrast, partial(), which should have less use as most of the codebase was written pre-2.5, and requires an import, is used four times: $ pygrep partial\( | wc -l 4 -Mike From janssen at parc.com Tue Jun 19 20:47:46 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 11:47:46 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes Message-ID: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> A few comments here: I'd get rid of "readinto", and just make the buffer an optional argument to "read". If you keep "readinto", why not rename "write" to "writefrom"? The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with the "whence" argument. It could be made considerably more Pythonic by splitting it into two methods: .seek(POS: int) where positive values for POS are from the beginning of the file, and negative values of POS are from the end of the file, and .nudge(POS: int) where the value of POS, positive or negative, is from the current location. Or call the two methods "seek_absolute" and "seek_relative". Of course, you don't really need "nudge" if you have "tell". Might even rename "seek" to "position". And I'd consider putting these two methods in a separate mixin; lots of file-like things can't seek. =============================================== ``If and only if a RawIOBase implementation operates on an underlying file descriptor, it must additionally provide a .fileno() member function. This could be defined specifically by the implementation, or a mix-in class could be used (need to decide about this).'' I'd suggest a mixin. =============================================== TextIOBase: this seems an odd mix of high-level and low-level. I'd remove "seek", "tell", "read", and "write". Remember that in Python, mixins actually work, so that you can provide a file object that combines several different I/O classes. And The Java-ish notion in TextIOBase.read(), that you can specify a count for the number of characters (or is that the number of UTF-8 bytes, etc... -- rich source of subtle bugs), just doesn't work in practice. And the "codecs" module already provides a way of doing this, for those who feel the need. Stick to just "readline" and "writeline" for text I/O. Bill From guido at python.org Tue Jun 19 20:49:10 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Jun 2007 11:49:10 -0700 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4678187B.2060402@trueblade.com> References: <4678187B.2060402@trueblade.com> Message-ID: Thanks, you're right, I've fixed it. On 6/19/07, Eric V. Smith wrote: > Guido van Rossum wrote: > > I've written up a comprehensive status report on Python 3000. Please read: > > > > http://www.artima.com/weblogs/viewpost.jsp?thread=208549 > > I think this sentence: > > "Python 2.6 will contain backported versions of many Py3k features, > either enabled through __future__ statements or simply by allowing old > and new syntax to be used side-by-side (if the new syntax would be a > syntax error in 2.x)." > > Should end with "syntax error in 2.5", not "syntax error in 2.x". Or, > state that x <= 5, in this sentence only. But I think we really mean > exactly 2.5. > > Eric. > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Tue Jun 19 21:13:24 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 12:13:24 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> Message-ID: <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> > map (especially the new iterized version) is a frequently-used > builtin, while reduce is a rarely-used builtin that requires some > head-wrapping. It makes sense to me to move it out of builtins. I've never understood this kind of argument. Because most people don't program in Python, we should abandon the project as a whole? For those who have "wrapped their head" around functional programming, "reduce" is a very clear and easy-to-understand primitive. But posting results gleaned from grepping over some random codebase written by someone who may or may not have done that head-wrapping at various points in time where some feature X may more may not have been available, seems even less of an argument. As I said, Guido's argument that "filter" (in the guise of [x for x in y if f(x)]), "any", and "all" are sufficient for almost every case seems like an interesting one to me, and he may well be right, but while we find out... Bill From lists at cheimes.de Tue Jun 19 21:17:06 2007 From: lists at cheimes.de (Christian Heimes) Date: Tue, 19 Jun 2007 21:17:06 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> Message-ID: Bill Janssen wrote: > Though, from the standpoint of pragmatism, removing "reduce" from the > built-in space will break code (*my* code, among others), and leaving > it in will not affect "purity", as both "map" and "reduce" are being > left in. So leaving it alone seems the more Pythonic response to me. Python 3000 tries to reduce (hehe) the amount of builtins so reduce was removed since it is rarely used. I don't understand why map and filter wasn't moved to functools, too. You made one good point. At the moment you can't write code that utilizes reduce and works under 2.6 and 3.0. from functools import reduce fails in 2.6. The 2to3 suite has no fixer for reduce. My patch removes the flaw: http://www.python.org/sf/1739906 Christian From martin at v.loewis.de Tue Jun 19 22:53:09 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Tue, 19 Jun 2007 22:53:09 +0200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C4B8.8010508@gmail.com> <4677CED9.1060800@livinglogic.de> <4677EC1A.10306@livinglogic.de> Message-ID: <46784235.5050102@v.loewis.de> >> What would a registry of tranformation algorithms buy us compared to a >> module with transformation functions? > > Easier registering of custom transformations. Without a registry, you'd have > to monkey-patch a module. Or users would have to invoke the module directly. I think a convention would be enough: rot13.encode(foo) rot13.decode(bar) Then, "registration" would require to put the module on sys.path, which it would for any other kind of registry as well. My main objection to using an encoding is that for these, the algorithm name will *always* be a string literal, completely unlike "real" codecs, where the encoding name often comes from the environment (either from the process environment, or from some kind of input). Regards, Martin From mike.klaas at gmail.com Tue Jun 19 22:53:35 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Tue, 19 Jun 2007 13:53:35 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> Message-ID: <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> On 19-Jun-07, at 12:13 PM, Bill Janssen wrote: >> map (especially the new iterized version) is a frequently-used >> builtin, while reduce is a rarely-used builtin that requires some >> head-wrapping. It makes sense to me to move it out of builtins. > > I've never understood this kind of argument. Because most people > don't program in Python, we should abandon the project as a whole? No, but it certainly is an argument for not pre-installing a python dev environment on all windows boxes, for instance. Surely frequency of use should be at least _one_ of the criteria involved in making decisions about including functionality as a builtin? > For those who have "wrapped their head" around functional programming, > "reduce" is a very clear and easy-to-understand primitive. Granted. However, I claim that most python users would find a reduce ()-based construct less natural than an alternative, even if they understand what it does. The suggestion is simply to move the function to the "FP toolbox". > But posting results gleaned from grepping over some random codebase > written by someone who may or may not have done that head-wrapping at > various points in time where some feature X may more may not have been > available, seems even less of an argument. reduce() was always available. But that isn't the point: I'm not presenting the statistics as evidence of the entire python world, but I think they still indicate _something_, if not only what are the usage patterns of some type of python programmer (namely, one who is familiar with FP and uses many of its concepts in their python programming, though is by no means a disciple). Stats from _any_ large python project is better than anecdotes. Perhaps it would be better to turn to the stdlib (367289 lines)? Python2.5/Lib $ pygrep -E '\breduce\(' | wc -l 31 15 of those are tests for reduce()/iterators 7 are in pickle.py (nomenclature clash) Which leaves a few uses over binary operators: ./test/test_random.py: return reduce(int.__mul__, xrange (1, n), 1) ./idlelib/MultiCall.py:_state_names = [reduce(lambda x, y: x + y, ./idlelib/MultiCall.py:_state_codes = [reduce(lambda x, y: x | y, ./idlelib/AutoCompleteWindow.py: elif reduce(lambda x, y: x or y, ./difflib.py: matches = reduce(lambda sum, triple: sum + triple [-1], self.get_matching_blocks(), 0) Some trickiness in csv.py: quotechar = reduce(lambda a, b, quotes = quotes: (quotes[a] > quotes[b]) and a or b, quotes.keys()) delim = reduce(lambda a, b, delims = delims: (delims[a] > delims[b]) and a or b, delims.keys()) modes[char] = reduce(lambda a, b: a[1] > b[1] and a or b, items) (which can be replaced with max(..., key=...) reduce(lambda a, b: (0, a[1] + b[1]), items)[1] (which could be written sum(x[1] for x in items) > As I said, Guido's > argument that "filter" (in the guise of [x for x in y if f(x)]), > "any", and "all" are sufficient for almost every case seems like an > interesting one to me, and he may well be right, but while we find > out... How will we find out, if reduce() continues to be availabe? Regardless, that's my 2c. I don't think I have anythign further to add to this (settled) matter. -Mike From brett at python.org Tue Jun 19 23:13:56 2007 From: brett at python.org (Brett Cannon) Date: Tue, 19 Jun 2007 14:13:56 -0700 Subject: [Python-3000] How best to handle failing tests in struni? Message-ID: After reading Guido's blog post and noticing his comment about lack of delegation, I decided to delegate to myself a look at struni and what tests were failing (which turned out to be a lot). I just started at the beginning and so that meant looking at test_anydbm. That's failing because _bsddb.c requires PyInt_Check or PyString_Check to pass for keys. That doesn't work in a world where string constants are all Unicode. =) So, my question is how best to handle this test (and thus other tests like it). Should it just continue to fail until someone fixes _bsddb.c to accept Unicode keys (and thus start up a FAILING file listing the various tests that are failing and doc which ones are expected to fail until something specific changes)? Or do we silence the failure by making the constants pass through str8? Or should str8 not even be used at all since (I assume) it won't survive the merge back into p3yk? -Brett From guido at python.org Wed Jun 20 01:22:06 2007 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Jun 2007 16:22:06 -0700 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: References: Message-ID: Check out what the dbm-based modules do. I believe they use strings for keys and bytes for values, and if the keys are unicode, it converts them to UTF-8. On 6/19/07, Brett Cannon wrote: > After reading Guido's blog post and noticing his comment about lack of > delegation, I decided to delegate to myself a look at struni and what > tests were failing (which turned out to be a lot). > > I just started at the beginning and so that meant looking at > test_anydbm. That's failing because _bsddb.c requires PyInt_Check or > PyString_Check to pass for keys. That doesn't work in a world where > string constants are all Unicode. =) > > So, my question is how best to handle this test (and thus other tests > like it). Should it just continue to fail until someone fixes > _bsddb.c to accept Unicode keys (and thus start up a FAILING file > listing the various tests that are failing and doc which ones are > expected to fail until something specific changes)? Or do we silence > the failure by making the constants pass through str8? Or should str8 > not even be used at all since (I assume) it won't survive the merge > back into p3yk? > > -Brett > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From gareth.mccaughan at pobox.com Wed Jun 20 01:26:16 2007 From: gareth.mccaughan at pobox.com (Gareth McCaughan) Date: Wed, 20 Jun 2007 00:26:16 +0100 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> Message-ID: <200706200026.16568.gareth.mccaughan@pobox.com> On Tuesday 19 June 2007 19:47, Bill Janssen wrote: > The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with > the "whence" argument. It could be made considerably more Pythonic by > splitting it into two methods: > > .seek(POS: int) > > where positive values for POS are from the beginning of the file, and > negative values of POS are from the end of the file, and > > .nudge(POS: int) > > where the value of POS, positive or negative, is from the current > location. Presumably this would go along with introducing a new "wink" method. I wonder what it would do. (Close the file briefly?) -- Gareth McCaughan From showell30 at yahoo.com Wed Jun 20 02:19:40 2007 From: showell30 at yahoo.com (Steve Howell) Date: Tue, 19 Jun 2007 17:19:40 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> Message-ID: <931752.18195.qm@web33504.mail.mud.yahoo.com> +1 on deciding on keeping builtins built in based on populuarity within actual source code. Stats will never be perfect, and nobody can practically sample all Python code ever written, but anybody who measures a large codebase to argue for keeping a builtin built in gets a +1 from me. Regarding map and filter, I never use them myself, but I also never collide with the keywords, even though a lot of my code really comes downs to mapping and filtering. ___________________________________________________________________________________ You snooze, you lose. Get messages ASAP with AutoCheck in the all-new Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_html.html From lists at cheimes.de Wed Jun 20 02:29:01 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 20 Jun 2007 02:29:01 +0200 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> Message-ID: Bill Janssen wrote: > The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with > the "whence" argument. It could be made considerably more Pythonic by > splitting it into two methods: > > .seek(POS: int) > > where positive values for POS are from the beginning of the file, and > negative values of POS are from the end of the file, and How would I seek to EOF with your proposal? seek(-0)? Christian From benji at benjiyork.com Wed Jun 20 03:11:03 2007 From: benji at benjiyork.com (Benji York) Date: Tue, 19 Jun 2007 21:11:03 -0400 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <200706200026.16568.gareth.mccaughan@pobox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <200706200026.16568.gareth.mccaughan@pobox.com> Message-ID: <46787EA7.6050903@benjiyork.com> Gareth McCaughan wrote: > On Tuesday 19 June 2007 19:47, Bill Janssen wrote: > >> The "seek" method on RawIOBase is awfully quaint and UNIX-y, what with >> the "whence" argument. It could be made considerably more Pythonic by >> splitting it into two methods: >> >> .seek(POS: int) >> >> where positive values for POS are from the beginning of the file, and >> negative values of POS are from the end of the file, and >> >> .nudge(POS: int) >> >> where the value of POS, positive or negative, is from the current >> location. > > Presumably this would go along with introducing a new "wink" method. > I wonder what it would do. (Close the file briefly?) That's a great idea! It can be called in response to a HUP to rotate log files. me.wink()-ly y'rs -- Benji York http://benjiyork.com From janssen at parc.com Wed Jun 20 03:46:50 2007 From: janssen at parc.com (Bill Janssen) Date: Tue, 19 Jun 2007 18:46:50 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> Message-ID: <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> > How would I seek to EOF with your proposal? seek(-0)? Good point. Though I just grepped all my Python sources, and I never do that, so presumably the obvious workaround of seek_eof = lambda fp: fp.seek(-1), fp.nudge(+1) would be OK for that case. Bill From foom at fuhm.net Wed Jun 20 06:19:26 2007 From: foom at fuhm.net (James Y Knight) Date: Wed, 20 Jun 2007 00:19:26 -0400 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> Message-ID: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> On Jun 19, 2007, at 2:47 PM, Bill Janssen wrote: > TextIOBase: this seems an odd mix of high-level and low-level. I'd > remove "seek", "tell", "read", and "write". Remember that in Python, > mixins actually work, so that you can provide a file object that > combines several different I/O classes. Huh? All those operations you want to remove are entirely necessary for a number of applications. I'm not sure what you meant about mixins? > And The Java-ish notion in > TextIOBase.read(), that you can specify a count for the number of > characters (or is that the number of UTF-8 bytes, etc... -- rich > source of subtle bugs), just doesn't work in practice. It doesn't work? Why not? Of course read() should take the number of characters as a parameter, not number of bytes. > And the > "codecs" module already provides a way of doing this, for those who > feel the need. Stick to just "readline" and "writeline" for text I/O. Ah, not everyone dealing with text is dealing with line-delimited text, you know... James From aurelien.campeas at logilab.fr Wed Jun 20 10:57:01 2007 From: aurelien.campeas at logilab.fr (=?iso-8859-1?Q?Aur=E9lien_Camp=E9as?=) Date: Wed, 20 Jun 2007 10:57:01 +0200 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: References: Message-ID: <20070620085701.GA31968@crater.logilab.fr> On Tue, Jun 19, 2007 at 08:20:25AM -0700, Guido van Rossum wrote: > Those are valid concerns. I'm cross-posting this to the python-3000 > list in the hope that the PEP's author and defendents can respond. I'm > sure we can work something out. Thanks to raise this. It is horrible enough that I feel obliged to de-lurk. -10 on this part of PEP3101. > > Please keep further discussion on the python-3000 at python.org list. > > --Guido > > On 6/19/07, Chris McDonough wrote: > > Wrt http://www.python.org/dev/peps/pep-3101/ > > > > PEP 3101 says Py3K should allow item and attribute access syntax > > within string templating expressions but "to limit potential security > > issues", access to underscore prefixed names within attribute/item > > access expressions will be disallowed. People talking about potential security issues should have an obligation to show how their proposals *really* improve security (in general); this is of course, a hard thing to do; mere hand-waving is not sufficient. > > I am a person who has lived with the aftermath of a framework > > designed to prevent data access by restricting access to underscore- > > prefixed names (Zope 2, ahem), and I've found it's very hard to > > explain and justify. As a result, I feel that this is a poor default > > policy choice for a framework. And it's even poorer in the context of a language (for it's probably harder to escape language-level restrictions than framework obscurities ...). > > In some cases, underscore names must become part of an object's > > external interface. Consider a URL with one or more underscore- > > prefixed path segment elements (because prefixing a filename with an > > underscore is a perfectly reasonable thing to do on a filesystem, and > > path elements are often named after file names) fed to a traversal > > algorithm that attempts to resolve each path element into an object > > by calling __getitem__ against the parent found by the last path > > element's traversal result. Perhaps this is poor design and > > __getitem__ should not be consulted here, but I doubt that highly > > because there's nothing particularly special about calling a method > > named __getitem__ as opposed to some method named "traverse". This is trying to make a technical argument, but the 'consenting adults' policy might be enough. In my experience, zope forbiding access to _ prefixed attributes just led to work around the limitation, thus adding more useless indirection to an already crufty code base. The result is more obfuscation and probably even less security (as in auditability of the code). > > > > The only precedent within Python 2 for this sort of behavior is > > limiting access to variables that begin with __ and which do not end > > with __ to the scope defined by a class and its instances. I > > personally don't believe this is a very useful feature, but it's > > still only an advisory policy and you can worm around it with enough > > gyrations. FWIW I've come to never use __attrs. The obfuscation feature seems to bring nothing but pain (the few times I've fell into that trap as a beginner python programmer). > > > > Given that security is a concern at all, the only truly reasonable > > way to "limit security issues" is to disallow item and attribute > > access completely within the string templating expression syntax. It > > seems gratuituous to me to encourage string templating expressions > > with item/attribute access, given that you could do it within the > > format arguments just as easily in the 99% case, and we've (well... > > I've) happily been living with that restriction for years now. > > > > But if this syntax is preserved, there really should be no *default* > > restrictions on the traversable names within an expression because > > this will almost certainly become a hard-to-explain, hard-to-justify > > bug magnet as it has become in Zope. I'd add that Zope in general looks to me like a giant collection of python anti-patterns and as such can be used as a clue source about what not to do, especially what not to include in Py3k. I don't want to offense people, well no more than necessary (imho zope *is* an offense to common sense in many ways), but that's the opinion from someone who earns its living mostly from zope/plone products dev. and maintenance (these days, anyway). Regards, Aur?lien. From walter at livinglogic.de Wed Jun 20 11:26:59 2007 From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=) Date: Wed, 20 Jun 2007 11:26:59 +0200 Subject: [Python-3000] setup.py fails in the py3k-struni branch In-Reply-To: <4677AA2B.8000704@ronadam.com> References: <466E4B22.6020408@ronadam.com> <46708286.6090201@ronadam.com> <4670A458.7050206@ronadam.com> <4670C3C5.4070907@ronadam.com> <46749E94.5010301@ronadam.com> <4677AA2B.8000704@ronadam.com> Message-ID: <4678F2E3.7080900@livinglogic.de> Ron Adam wrote: > [...] > M Lib/tokenize.py > M Lib/test/tokenize_tests.txt > M Lib/test/output/test_tokenize > - Removed unicode literals from test results and tokenize.py. And make > it pass again. > > > M Lib/test/output/test_pep277 > - Removed unicode literals from test results. This is a windows only > test, so I can't test it. > > M Lib/test/test_codeccallbacks.py > M Objects/exceptions.c > - Remove unicode literals from test_codeccallbacks.py and removed > unicode litteral quoting from exceptions.c to make it pass again. > > M Lib/test/test_codecs.py > M Lib/test/test_doctest.py > M Lib/test/re_tests.py > - Removed some literals from comments. The following changes looked good to me: M Lib/test/test_codeccallbacks.py M Objects/exceptions.c M Lib/test/test_codecs.py so I checked them in. No opinion about the rest. Servus, Walter From ncoghlan at gmail.com Wed Jun 20 12:31:38 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 20 Jun 2007 20:31:38 +1000 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> Message-ID: <4679020A.8020609@gmail.com> Christian Heimes wrote: > Bill Janssen wrote: >> Though, from the standpoint of pragmatism, removing "reduce" from the >> built-in space will break code (*my* code, among others), and leaving >> it in will not affect "purity", as both "map" and "reduce" are being >> left in. So leaving it alone seems the more Pythonic response to me. > > Python 3000 tries to reduce (hehe) the amount of builtins so reduce was > removed since it is rarely used. I don't understand why map and filter > wasn't moved to functools, too. Because (str(x) for x in seq) is not an improvement over map(str, x) - applying a single existing function to a sequence is a very common operation. map() accepts any function (given an appropriate number of sequences), and thus has wide applicability. filter() accepts any single argument predicate function (using bool() by default), and thus also has wide applicability. reduce(), on the other hand, works only with functions that are specially designed to be fed to it - you are unlikely to have an appropriate function just lying around. Given the likely need to write a special function to perform the desired reduction importing the reduce function itself isn't going to be much additional overhead. From the point of view of readability, it is probably going to be better to hide the fact that reduce is being used at all behind a named reduction function (or, where possible, just use one of the builtin sequence reduction functions like any(), all(), sum(), min(), max()). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From nicko at nicko.org Wed Jun 20 12:22:11 2007 From: nicko at nicko.org (Nicko van Someren) Date: Wed, 20 Jun 2007 11:22:11 +0100 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> Message-ID: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> On 19 Jun 2007, at 21:53, Mike Klaas wrote: > ... > Stats from _any_ large python project is better than anecdotes. > Perhaps it would be better to turn to the stdlib (367289 lines)? ... > reduce(lambda a, b: (0, a[1] + b[1]), items)[1] > > (which could be written sum(x[1] for x in items) Only if the items at index 1 happen to be numbers. That's another bugbear of mine. The sum(l) built-in is NOT equivalent to reduce (operator.add, l) in Python 2.x: >>> reduce(operator.add, [1,2,3]) 6 >>> reduce(operator.add, ['a','b','c']) 'abc' >>> reduce(operator.add, [["a"],[u'b'],[3]]) ['a', u'b', 3] >>> sum([1,2,3]) 6 >>> sum(['a','b','c']) Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for +: 'int' and 'str' >>> sum([["a"],[u'b'],[3]]) Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for +: 'int' and 'list' Given that reduce is moving one step further away in Python 3, and given that it's use seems to be somewhat discouraged these days anyway, perhaps the sum() function could be made properly polymorphic so as to remove one more class of use cases for reduce(). Nicko From ncoghlan at gmail.com Wed Jun 20 16:44:10 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Jun 2007 00:44:10 +1000 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> Message-ID: <46793D3A.8020303@gmail.com> Nicko van Someren wrote: > >>> sum(['a','b','c']) > Traceback (most recent call last): > File "", line 1, in > TypeError: unsupported operand type(s) for +: 'int' and 'str' > >>> sum([["a"],[u'b'],[3]]) > Traceback (most recent call last): > File "", line 1, in > TypeError: unsupported operand type(s) for +: 'int' and 'list' You can already make the second example work properly by supplying an appropriate starting value: >>> sum([["a"],[u'b'],[3]], []) ['a', u'b', 3] (and a similar call will also work for the new bytes type, as well as other sequences) Strings are explicitly disallowed (because Guido doesn't want a second way to spell ''.join(seq), as far as I know): >>> sum(['a','b','c'], '') Traceback (most recent call last): File "", line 1, in TypeError: sum() can't sum strings [use ''.join(seq) instead] Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Wed Jun 20 16:49:41 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Jun 2007 00:49:41 +1000 Subject: [Python-3000] Issues with PEP 3101 (string formatting) In-Reply-To: References: Message-ID: <46793E85.4000402@gmail.com> Chris McDonough wrote: > Wrt http://www.python.org/dev/peps/pep-3101/ > > PEP 3101 says Py3K should allow item and attribute access syntax > within string templating expressions but "to limit potential security > issues", access to underscore prefixed names within attribute/item > access expressions will be disallowed. Personally, I'd be fine with leaving at least the embedded attribute access out of the initial implementation of the PEP. I'd even be OK with leaving out the embedded item access, but if we leave it in "vars(obj)" and the embedded item access would still provide a shorthand notation for access to instance variable attributes in a format string. So +1 for leaving out embedded attribute access from the initial implementation of PEP 3101, and -0 for leaving out the embedded item access. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From thomas at python.org Wed Jun 20 17:13:00 2007 From: thomas at python.org (Thomas Wouters) Date: Wed, 20 Jun 2007 08:13:00 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <46793D3A.8020303@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> <46793D3A.8020303@gmail.com> Message-ID: <9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com> On 6/20/07, Nick Coghlan wrote: > Strings are explicitly disallowed (because Guido doesn't want a second > way to spell ''.join(seq), as far as I know): More importantly, because it has positively abysmal performance, just like the reduce() solution (and, in fact, many reduce solutions to problems better solved otherwise :-) Like the old input(), backticks and allowing the mixing of tabs and spaces, while it has uses, the ease and frequency with which it is misused outweigh the utility enough that it should not be in such a prominent place. -- Thomas Wouters Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20070620/9ca59b39/attachment.htm From lists at cheimes.de Wed Jun 20 17:32:27 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 20 Jun 2007 17:32:27 +0200 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> Message-ID: Bill Janssen wrote: > Good point. Though I just grepped all my Python sources, and I never > do that, so presumably the obvious workaround of I'm using seek(0, 2) + tell() sometimes when I need to know the file size and don't want to worry about buffers. pos = fd.tell() size = None try: fd.seek(0, 2) size = fd.tell() finally: fd.seek(pos) IMO you made a good point. The seek() arguments are really too UNIX centric and hard to understand for newbies. The os module contains three aliases for seek (SEEK_CUR, SEEK_END, SEEK_SET) (why is it called SET and not START?) but they are rarely used. What do you think about adding two additional functions which act as alias from whence = 1 and whence = 2? def seek(self, pos: int, whence: int = 0) -> int: """Change stream position. Seek to byte offset pos relative to position indicated by whence: 0 Start of stream (the default). pos should be >= 0; 1 Current position - whence may be negative; 2 End of stream - whence usually negative. Returns the new absolute position. """ def seekcur(self, pos: int) -> int: """seek relative to current position alternative names: seekrel, seek_relative """ return self.seek(pos, 1) def seekend(self, pos: int) -> int: """seek from end of stream alternative names: seekeof """ return self.seek(pos, 2) From lists at cheimes.de Wed Jun 20 17:43:51 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 20 Jun 2007 17:43:51 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4679020A.8020609@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <4679020A.8020609@gmail.com> Message-ID: Nick Coghlan wrote: > Because (str(x) for x in seq) is not an improvement over map(str, x) - > applying a single existing function to a sequence is a very common > operation. > > map() accepts any function (given an appropriate number of sequences), > and thus has wide applicability. > > filter() accepts any single argument predicate function (using bool() by > default), and thus also has wide applicability. But map() and filter() can be easily replaced with a generator or list comprehensive expression. [str(x) for x in seq] and [x for x in seq if func(x)] are consider easier to read these days. IIRC map and filter are a bit slower than list comprehensive and may use much more memory than a generator expression since map and filter are returning a list of results. Personally I don't see the reason why map and filter are still builtins when they can be replaced with easier to read, faster and less memory consuming code. OK, I have to type some more characters but that's not an issue for me. Christian From chrism at plope.com Wed Jun 20 17:52:47 2007 From: chrism at plope.com (Chris McDonough) Date: Wed, 20 Jun 2007 11:52:47 -0400 Subject: [Python-3000] Issues with PEP 3101 (string formatting) In-Reply-To: <46793E85.4000402@gmail.com> References: <46793E85.4000402@gmail.com> Message-ID: Allowing attribute and/or item access within templating expressions has historically been the domain of full-on templating languages (which invariably also have a way to do repeats, conditionals, arbitrary method calls, etc). I think it should probably stay that way because to me, at least, there's not much more compelling about being able to do item/ attribute access within a template expression than there is to be able to do replacements using results from arbitrary method calls. It's fairly arbitrary to allow calls to __getitem__ and __getattr__ and but prevent, say, calls to "traverse", at least if the format arguments are not restricted to plain lists/tuples/dicts. That's not to say that maybe an extended templating thingy shouldn't ship within the stdlib though, maybe even one that extends the default interpolation syntax in these sorts of ways. - C On Jun 20, 2007, at 10:49 AM, Nick Coghlan wrote: > Chris McDonough wrote: >> Wrt http://www.python.org/dev/peps/pep-3101/ >> PEP 3101 says Py3K should allow item and attribute access syntax >> within string templating expressions but "to limit potential >> security issues", access to underscore prefixed names within >> attribute/item access expressions will be disallowed. > > Personally, I'd be fine with leaving at least the embedded > attribute access out of the initial implementation of the PEP. I'd > even be OK with leaving out the embedded item access, but if we > leave it in "vars(obj)" and the embedded item access would still > provide a shorthand notation for access to instance variable > attributes in a format string. > > So +1 for leaving out embedded attribute access from the initial > implementation of PEP 3101, and -0 for leaving out the embedded > item access. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > --------------------------------------------------------------- > http://www.boredomandlaziness.org > From veloso at verylowsodium.com Wed Jun 20 19:00:59 2007 From: veloso at verylowsodium.com (Greg Falcon) Date: Wed, 20 Jun 2007 13:00:59 -0400 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: References: Message-ID: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> On 6/19/07, Chris McDonough wrote: > Given that security is a concern at all, the only truly reasonable > way to "limit security issues" is to disallow item and attribute > access completely within the string templating expression syntax. It > seems gratuituous to me to encourage string templating expressions > with item/attribute access, given that you could do it within the > format arguments just as easily in the 99% case, and we've (well... > I've) happily been living with that restriction for years now. > > But if this syntax is preserved, there really should be no *default* > restrictions on the traversable names within an expression because > this will almost certainly become a hard-to-explain, hard-to-justify > bug magnet as it has become in Zope. This sounds exactly right to me. I don't have strong feelings either way about attribute lookups in formatting strings, or the security problems they raise. But while it seems a reasonable stance that user-injected getattr()s may pose a security problem, what seems indefensible is the stance that user-injected getattr()s are okay precisely when the attribute being looked up doesn't start with an underscore. A single underscore prefix is a hint to human readers, not to the language itself, and things should stay that way. Greg F From janssen at parc.com Wed Jun 20 19:03:49 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 10:03:49 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> Message-ID: <07Jun20.100358pdt."57996"@synergy1.parc.xerox.com> > > TextIOBase: this seems an odd mix of high-level and low-level. I'd > > remove "seek", "tell", "read", and "write". Remember that in Python, > > mixins actually work, so that you can provide a file object that > > combines several different I/O classes. > > Huh? All those operations you want to remove are entirely necessary > for a number of applications. I'm not sure what you meant about mixins? I meant that TextIOBase should just provide the operations for text. The other operations would be supported, when appropriate, by mixing in an appropriate class that provides them. Remember that this is a PEP about base classes. > It doesn't work? Why not? Of course read() should take the number of > characters as a parameter, not number of bytes. Unfortunately, files contain encodings of characters, and those encodings may at times be mapped to multiple equivalent strings, at least with respect to Unicode, the target for Python-3000. The standard Unicode support for Python-3000 seems to be settling on having code-point representations of those strings exposed to the application, which means that any specific automatic normalization is precluded. So any particular "readchars(1)" operation may validly return different strings even if operating on the same underlying file, and may require a different number of read operations to read the same underlying bytes. That is, I believe that the string and/or file operations are not well-specified enough to guarantee that this won't happen. This is the same situation we have today, which means that the only real way to read Unicode strings from a file will be the same as today, that is, read raw bytes from a file, decode them and normalize them in some specific way, and then see what string you wind up with. You could probably fix this in the PEP by specifying a specific Unicode normalization to use when returning strings. > > feel the need. Stick to just "readline" and "writeline" for text I/O. > > Ah, not everyone dealing with text is dealing with line-delimited > text, you know... It's really the only difference between text and non-text. Bill From janssen at parc.com Wed Jun 20 19:09:04 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 10:09:04 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <46793D3A.8020303@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> <46793D3A.8020303@gmail.com> Message-ID: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com> > Strings are explicitly disallowed (because Guido doesn't want a second > way to spell ''.join(seq), as far as I know): Isn't "map(str, x)" just a second way to write "[str(x) for x in y]"? This "second way" argument is another often-heard bogon. There are lots of second-way and third-way techniques in Python, and properly so. It's more important to make things work consistently than to only have "one way". "sum" should concatenate strings. Bill From janssen at parc.com Wed Jun 20 19:11:08 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 10:11:08 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> Message-ID: <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com> Not bad, but if you're going that route, I think I'd get rid of the optional arguments, and just say seek_from_beginning(INCR: int) seek_from_current(INCR: int) seek_from_end(DECR: int) Bill From nicko at nicko.org Wed Jun 20 19:12:20 2007 From: nicko at nicko.org (Nicko van Someren) Date: Wed, 20 Jun 2007 18:12:20 +0100 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <46793D3A.8020303@gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt."57996"@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt."57996"@synergy1.parc.xerox.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> <46793D3A.8020303@gmail.com> Message-ID: <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org> On 20 Jun 2007, at 15:44, Nick Coghlan wrote: > Nicko van Someren wrote: >> >>> sum(['a','b','c']) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: unsupported operand type(s) for +: 'int' and 'str' >> >>> sum([["a"],[u'b'],[3]]) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: unsupported operand type(s) for +: 'int' and 'list' > > You can already make the second example work properly by supplying > an appropriate starting value: > > >>> sum([["a"],[u'b'],[3]], []) > ['a', u'b', 3] > > (and a similar call will also work for the new bytes type, as well > as other sequences) The need to have an explicit 'start' value just seems wrong. It's horribly inconsistent. Things that can be added to integers work without initialisers but things that can be added to each other (for instance numbers in number fields or vectors in vector spaces) can not. I think in most people's minds the 'sum' operation is like an evaluation of "+".join(...), you are sticking an addition operation between the elements of the list. The need to have an explicit initial value means that sum() is not the sum function for anyone who does math in any sort of non-standard number space. > Strings are explicitly disallowed (because Guido doesn't want a > second way to spell ''.join(seq), as far as I know): > > >>> sum(['a','b','c'], '') > Traceback (most recent call last): > File "", line 1, in > TypeError: sum() can't sum strings [use ''.join(seq) instead] I can appreciate the value of TOOWTDI, and I appreciate that (in the absence of string concatenation by reference) the performance of string sum() would suck, but I still think that wilfully making things inconsistent in order to enforce TOOWTDI is going too far. Nicko From martin at v.loewis.de Wed Jun 20 19:20:42 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 20 Jun 2007 19:20:42 +0200 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: References: Message-ID: <467961EA.4060007@v.loewis.de> > So, my question is how best to handle this test (and thus other tests > like it). Should it just continue to fail until someone fixes > _bsddb.c to accept Unicode keys (and thus start up a FAILING file > listing the various tests that are failing and doc which ones are > expected to fail until something specific changes)? Or do we silence > the failure by making the constants pass through str8? Or should str8 > not even be used at all since (I assume) it won't survive the merge > back into p3yk? This goes back to the text-vs-binary debate. I _think_ bsddb inherently operates on binary data, i.e. neither keys nor values need to be text in some sense. So the most natural way would be to make it accept binary data only on input, and always produce binary data on output. Any *usage* that expect to be able to pass in strings is broken. Regards, Martin From guido at python.org Wed Jun 20 19:24:55 2007 From: guido at python.org (Guido van Rossum) Date: Wed, 20 Jun 2007 10:24:55 -0700 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: <467961EA.4060007@v.loewis.de> References: <467961EA.4060007@v.loewis.de> Message-ID: On 6/20/07, "Martin v. L?wis" wrote: > > So, my question is how best to handle this test (and thus other tests > > like it). Should it just continue to fail until someone fixes > > _bsddb.c to accept Unicode keys (and thus start up a FAILING file > > listing the various tests that are failing and doc which ones are > > expected to fail until something specific changes)? Or do we silence > > the failure by making the constants pass through str8? Or should str8 > > not even be used at all since (I assume) it won't survive the merge > > back into p3yk? > > This goes back to the text-vs-binary debate. I _think_ bsddb inherently > operates on binary data, i.e. neither keys nor values need to be text > in some sense. > > So the most natural way would be to make it accept binary data only on > input, and always produce binary data on output. Any *usage* that > expect to be able to pass in strings is broken. OTOH, pragmatically, people will generally use text strings for db keys. I'm not sure how to decide this; perhaps we need to take it public. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From aleaxit at gmail.com Wed Jun 20 19:50:48 2007 From: aleaxit at gmail.com (Alex Martelli) Date: Wed, 20 Jun 2007 10:50:48 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> Message-ID: On 6/20/07, Christian Heimes wrote: > Nick Coghlan wrote: > > Because (str(x) for x in seq) is not an improvement over map(str, x) - > > applying a single existing function to a sequence is a very common > > operation. > > > > map() accepts any function (given an appropriate number of sequences), > > and thus has wide applicability. > > > > filter() accepts any single argument predicate function (using bool() by > > default), and thus also has wide applicability. > > But map() and filter() can be easily replaced with a generator or list > comprehensive expression. [str(x) for x in seq] and [x for x in seq if > func(x)] are consider easier to read these days. > > IIRC map and filter are a bit slower than list comprehensive and may use > much more memory than a generator expression since map and filter are > returning a list of results. No, in 3.0 they'll return iterables -- you really SHOULD read Guido's blog entry referred to at the top of this thread, , before discussing Python 3.0 issues. So, there's no reason their performance should suffer, either -- using today's itertools.imap as a stand-in for 3.0's map, for example: $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x in it.imap(abs,L): pass' 100000 loops, best of 3: 3 usec per loop $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x in (abs(y) for y in L): pass' 100000 loops, best of 3: 4.47 usec per loop (imap is faster in this case because the built-in name 'abs' is looked up only once -- in the genexp, it's looked up each time, sigh -- possibly the biggest "we should REALLY tweak the language to let this be optimized sensibly" gotcha in Python, IMHO). Alex From exarkun at divmod.com Wed Jun 20 19:51:25 2007 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Wed, 20 Jun 2007 13:51:25 -0400 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: Message-ID: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm> On Wed, 20 Jun 2007 10:24:55 -0700, Guido van Rossum wrote: >On 6/20/07, "Martin v. L?wis" wrote: >> > So, my question is how best to handle this test (and thus other tests >> > like it). Should it just continue to fail until someone fixes >> > _bsddb.c to accept Unicode keys (and thus start up a FAILING file >> > listing the various tests that are failing and doc which ones are >> > expected to fail until something specific changes)? Or do we silence >> > the failure by making the constants pass through str8? Or should str8 >> > not even be used at all since (I assume) it won't survive the merge >> > back into p3yk? >> >> This goes back to the text-vs-binary debate. I _think_ bsddb inherently >> operates on binary data, i.e. neither keys nor values need to be text >> in some sense. >> >> So the most natural way would be to make it accept binary data only on >> input, and always produce binary data on output. Any *usage* that >> expect to be able to pass in strings is broken. > >OTOH, pragmatically, people will generally use text strings for db keys. > >I'm not sure how to decide this; perhaps we need to take it public. If it helps, after having used bsddb for a couple years and developed a non-trivial library on top of it, what Martin said seems most sensible to me. Jean-Paul From alexandre at peadrop.com Wed Jun 20 19:58:09 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Wed, 20 Jun 2007 13:58:09 -0400 Subject: [Python-3000] Summary of the differences between StringIO and cStringIO for PEP-3108 Message-ID: Hi, I written a short summary of the differences between the StringIO and cStringIO modules. I attached it as a patch for PEP-3108. -- Alexandre -------------- next part -------------- A non-text attachment was scrubbed... Name: semantic_diff_stringio.patch Type: text/x-patch Size: 1655 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070620/0d65cca8/attachment.bin From daniel at stutzbachenterprises.com Wed Jun 20 20:17:14 2007 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Wed, 20 Jun 2007 13:17:14 -0500 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <6002181751375776921@unknownmsgid> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> Message-ID: On 6/20/07, Bill Janssen wrote: > > Ah, not everyone dealing with text is dealing with line-delimited > > text, you know... > > It's really the only difference between text and non-text. Text is a sequence of characters. Non-text is a sequence of bytes. Characters may be multi-byte. It is no longer an ASCII world. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC From jimjjewett at gmail.com Wed Jun 20 20:33:10 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Wed, 20 Jun 2007 14:33:10 -0400 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <6382747756546996159@unknownmsgid> References: <6382747756546996159@unknownmsgid> Message-ID: On 6/20/07, Bill Janssen wrote: > Not bad, but if you're going that route, I think I'd get rid of the > optional arguments, and just say > > seek_from_beginning(INCR: int) > > seek_from_current(INCR: int) > > seek_from_end(DECR: int) goto(pos) # absolute move(incr:int) # relative to current position negative numbers can be interpreted naturally; for move they go backwards, and for goto they count from the end. This would require either a length, or a special value (None?) for at least one of Start and End, because 0 == -0. Note that this makes sense for bytes; I'm not sure exactly how unicode characters even should be counted, without a normalization promise. -jJ From lists at cheimes.de Wed Jun 20 19:45:40 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 20 Jun 2007 19:45:40 +0200 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com> References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com> Message-ID: Bill Janssen wrote: > Not bad, but if you're going that route, I think I'd get rid of the > optional arguments, and just say > > seek_from_beginning(INCR: int) > > seek_from_current(INCR: int) > > seek_from_end(DECR: int) I don't like it. It's too noisy and too much to type. My mini proposal has the benefit that it is backward compatible. Besides your arguments names aren't correct. seek_from_current can be negative to seek backward and seek_from_end can be positive to enlarge a file. On some OS a seek after EOF creates a sparse file. From brett at python.org Wed Jun 20 20:50:50 2007 From: brett at python.org (Brett Cannon) Date: Wed, 20 Jun 2007 11:50:50 -0700 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: References: <467961EA.4060007@v.loewis.de> Message-ID: On 6/20/07, Guido van Rossum wrote: > On 6/20/07, "Martin v. L?wis" wrote: > > > So, my question is how best to handle this test (and thus other tests > > > like it). Should it just continue to fail until someone fixes > > > _bsddb.c to accept Unicode keys (and thus start up a FAILING file > > > listing the various tests that are failing and doc which ones are > > > expected to fail until something specific changes)? Or do we silence > > > the failure by making the constants pass through str8? Or should str8 > > > not even be used at all since (I assume) it won't survive the merge > > > back into p3yk? > > > > This goes back to the text-vs-binary debate. I _think_ bsddb inherently > > operates on binary data, i.e. neither keys nor values need to be text > > in some sense. > > > > So the most natural way would be to make it accept binary data only on > > input, and always produce binary data on output. Any *usage* that > > expect to be able to pass in strings is broken. > > OTOH, pragmatically, people will generally use text strings for db keys. > > I'm not sure how to decide this; perhaps we need to take it public. That's fine since I don't want to fix it. =) So kick this out to python-dev then? And speaking of struni, when I realized that fixing _bsddb.c was not going to be simple, I moved on to the next test (test_asynchat) and came across a string with an 's' prefix. Just to make sure I got everything straight, str8 produces a classic str instance (pure ASCII) and a string with an 's' prefix is a str8 string. Other there any other differences to be aware of when working on the branch? And I assume the PyString API is going away, so when working on a module one should just tear out use of the API and convert it over to PyUnicode, correct? And do the same for "s" format characters in Py_BuildValue and PyArg_ParseTuple? I just want to get an idea of the basic process going on to do the conversion so that I don't have to figure out the hard way. -Brett From brett at python.org Wed Jun 20 20:58:44 2007 From: brett at python.org (Brett Cannon) Date: Wed, 20 Jun 2007 11:58:44 -0700 Subject: [Python-3000] Summary of the differences between StringIO and cStringIO for PEP-3108 In-Reply-To: References: Message-ID: On 6/20/07, Alexandre Vassalotti wrote: > Hi, > > I written a short summary of the differences between the StringIO and > cStringIO modules. I attached it as a patch for PEP-3108. Thanks for the summary, Alexandre. Luckily your new version for the io library does away with all of those issues for Py3K. -Brett From alexandre at peadrop.com Wed Jun 20 21:05:17 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Wed, 20 Jun 2007 15:05:17 -0400 Subject: [Python-3000] Summary of the differences between StringIO and cStringIO for PEP-3108 In-Reply-To: References: Message-ID: On 6/20/07, Brett Cannon wrote: > Thanks for the summary, Alexandre. Luckily your new version for the > io library does away with all of those issues for Py3K. Yes, all these issues are fixed (except the pickle thing), in my new version. -- Alexandre From lists at cheimes.de Wed Jun 20 20:40:19 2007 From: lists at cheimes.de (Christian Heimes) Date: Wed, 20 Jun 2007 20:40:19 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> Message-ID: Alex Martelli wrote: > No, in 3.0 they'll return iterables -- you really SHOULD read Guido's > blog entry referred to at the top of this thread, > , before > discussing Python 3.0 issues. I read it. I also wasn't sure if map returns a special iterable like dict.keys() or a list so I tried it before I wrote my posting: Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> map(str, range(5)) ['0', '1', '2', '3', '4'] >>> type(map(str, range(5))) It looks like an ordinary list to me. Christian From eopadoan at altavix.com Wed Jun 20 21:18:41 2007 From: eopadoan at altavix.com (Eduardo "EdCrypt" O. Padoan) Date: Wed, 20 Jun 2007 16:18:41 -0300 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> Message-ID: > It looks like an ordinary list to me. There are many things to implement yet. From martin at v.loewis.de Wed Jun 20 21:25:32 2007 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 20 Jun 2007 21:25:32 +0200 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: References: <467961EA.4060007@v.loewis.de> Message-ID: <46797F2C.8090505@v.loewis.de> > And speaking of struni, when I realized that fixing _bsddb.c was not > going to be simple, I moved on to the next test (test_asynchat) and > came across a string with an 's' prefix. Just to make sure I got > everything straight, str8 produces a classic str instance (pure ASCII) > and a string with an 's' prefix is a str8 string. Other there any > other differences to be aware of when working on the branch? There appears to be some disagreement on what the objective for that branch is. I would personally like to see str8 disappear, at least from Python API. IOW, if a str8 shows up somewhere, check whether it might be easy to replace it with a Unicode string. > And I assume the PyString API is going away, so when working on a > module one should just tear out use of the API and convert it over to > PyUnicode, correct? The API will stay. However, it should get used less-and-less. Whether converting it in Unicode depends on the use case. It might be that converting to binary is the right answer. > And do the same for "s" format characters in > Py_BuildValue and PyArg_ParseTuple? Again, depends. On ParseTuple, the default encoding is applied, which is supposed to always work, and always provides you with a char*. (it currently produces a str8 internally, but eventually should create a bytes object instead). For BuildValue, I would recommend that the s format code produces a Unicode object. That might be ambiguous, as some people might want to create bytes instead, but I recommend to designate a different code for creating bytes in BuildValue. > I just want to get an idea of the basic process going on to do the > conversion so that I don't have to figure out the hard way. I think many questions are still open, and should be discussed (or Guido would have to publish a policy in case he made up his mind already). Regards, Martin From g.brandl at gmx.net Wed Jun 20 21:14:17 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Wed, 20 Jun 2007 21:14:17 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> Message-ID: Christian Heimes schrieb: > Alex Martelli wrote: >> No, in 3.0 they'll return iterables -- you really SHOULD read Guido's >> blog entry referred to at the top of this thread, >> , before >> discussing Python 3.0 issues. > > I read it. I also wasn't sure if map returns a special iterable like > dict.keys() or a list so I tried it before I wrote my posting: > > Python 3.0x (p3yk:56022, Jun 18 2007, 21:10:13) > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> map(str, range(5)) > ['0', '1', '2', '3', '4'] >>>> type(map(str, range(5))) > > > It looks like an ordinary list to me. Well, not everything that's planned is implemented yet in Py3k. So, you should really believe the *plans* rather than the *branch*. Georg From rrr at ronadam.com Wed Jun 20 21:29:10 2007 From: rrr at ronadam.com (Ron Adam) Date: Wed, 20 Jun 2007 14:29:10 -0500 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: References: <467961EA.4060007@v.loewis.de> Message-ID: <46798006.70407@ronadam.com> Brett Cannon wrote: > On 6/20/07, Guido van Rossum wrote: >> On 6/20/07, "Martin v. L?wis" wrote: >>>> So, my question is how best to handle this test (and thus other tests >>>> like it). Should it just continue to fail until someone fixes >>>> _bsddb.c to accept Unicode keys (and thus start up a FAILING file >>>> listing the various tests that are failing and doc which ones are >>>> expected to fail until something specific changes)? Or do we silence >>>> the failure by making the constants pass through str8? Or should str8 >>>> not even be used at all since (I assume) it won't survive the merge >>>> back into p3yk? >>> This goes back to the text-vs-binary debate. I _think_ bsddb inherently >>> operates on binary data, i.e. neither keys nor values need to be text >>> in some sense. >>> >>> So the most natural way would be to make it accept binary data only on >>> input, and always produce binary data on output. Any *usage* that >>> expect to be able to pass in strings is broken. >> OTOH, pragmatically, people will generally use text strings for db keys. >> >> I'm not sure how to decide this; perhaps we need to take it public. > > That's fine since I don't want to fix it. =) So kick this out to > python-dev then? > > And speaking of struni, when I realized that fixing _bsddb.c was not > going to be simple, I moved on to the next test (test_asynchat) and > came across a string with an 's' prefix. Just to make sure I got > everything straight, str8 produces a classic str instance (pure ASCII) > and a string with an 's' prefix is a str8 string. Other there any > other differences to be aware of when working on the branch? There's no 'u' prefix on unicode strings obviously. ;-) The 's' prefix was my idea as a temporary way to differentiate unicode and str8 while the conversion is taking place. It will most likely be removed after all or most all the str8 values are replaced by unicode or bytes. Ron > And I assume the PyString API is going away, so when working on a > module one should just tear out use of the API and convert it over to > PyUnicode, correct? And do the same for "s" format characters in > Py_BuildValue and PyArg_ParseTuple? > > I just want to get an idea of the basic process going on to do the > conversion so that I don't have to figure out the hard way. From mike.klaas at gmail.com Wed Jun 20 22:34:15 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Wed, 20 Jun 2007 13:34:15 -0700 Subject: [Python-3000] How best to handle failing tests in struni? In-Reply-To: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm> References: <20070620175125.4947.1259671454.divmod.quotient.3074@ohm> Message-ID: <30C6361F-4BDC-4638-9AF0-2BB1790BF4BD@gmail.com> On 20-Jun-07, at 10:51 AM, Jean-Paul Calderone wrote: > On Wed, 20 Jun 2007 10:24:55 -0700, Guido van Rossum > wrote: > > > OTOH, pragmatically, people will generally use text strings for > db keys. >> >> I'm not sure how to decide this; perhaps we need to take it public. > > If it helps, after having used bsddb for a couple years and > developed a > non-trivial library on top of it, what Martin said seems most > sensible to > me. As an extremely heavy user of bsddb, +1. Berkeley db is rather sensitive on how things are serialized (for instance, big-endian is much better for ints, performance-wise), so it is necessary to let the developer control this on a bytestring level. It is easy to write a wrapper on top of this to do the serialization automatically. -Mike From gareth.mccaughan at pobox.com Thu Jun 21 00:58:15 2007 From: gareth.mccaughan at pobox.com (Gareth McCaughan) Date: Wed, 20 Jun 2007 23:58:15 +0100 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org> References: <46793D3A.8020303@gmail.com> <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org> Message-ID: <200706202358.16087.gareth.mccaughan@pobox.com> On Wednesday 20 June 2007 18:12, Nicko van Someren wrote (about summing strings, etc.):: > The need to have an explicit 'start' value just seems wrong. It's > horribly inconsistent. Things that can be added to integers work > without initialisers but things that can be added to each other (for > instance numbers in number fields or vectors in vector spaces) can > not. I think there's another reason for not allowing sum on arbitrary objects, which I find more convincing (though still maybe not convincing enough): sum([1,1,1]) = 3 sum([1,1]) = 2 sum([1]) = 1 sum([]) = 0 sum(["a","a","a"]) = "aaa" sum(["a","a"]) = "aa" sum(["a"]) = "a" sum([]) = ???? That is: if you're writing code that expects sum() to do something sensible with lists of strings, you'll usually need it to do something sensible with *empty* lists of strings -- but that isn't possible, because there's only one empty list and it has to serve as the empty list of integers too. > I think in most people's minds the 'sum' operation is like an > evaluation of "+".join(...), you are sticking an addition operation > between the elements of the list. The need to have an explicit > initial value means that sum() is not the sum function for anyone who > does math in any sort of non-standard number space. If you allow the elements of your number field to be added to plain ol' 0, then sum() will work fine for them, no? (But that isn't such a plausible prospect for vectors.) -- Gareth McCaughan From janssen at parc.com Thu Jun 21 02:33:45 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 17:33:45 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> Message-ID: <07Jun20.173354pdt."57996"@synergy1.parc.xerox.com> Daniel Stutzbach wrote: > On 6/20/07, Bill Janssen wrote: > > > Ah, not everyone dealing with text is dealing with line-delimited > > > text, you know... > > > > It's really the only difference between text and non-text. > > Text is a sequence of characters. Non-text is a sequence of bytes. > Characters may be multi-byte. It is no longer an ASCII world. Yes, of course, Daniel, but I was speaking of the contents of files, and files are inherently sequences of bytes. If we are talking about some layer which interprets the contents of a file, just saying "give me N characters" isn't enough. We need to say, "N characters assuming a text encoding of M, with a normalization policy of Q, and a newline policy of R". If we don't, we can't just "read" N characters safely. So I think it's broken to put this in the TextIOBase class; instead, there should be some wrapper class that does buffering and can be configured as to (M, Q, R). Bill From janssen at parc.com Thu Jun 21 02:46:42 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 17:46:42 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <6382747756546996159@unknownmsgid> Message-ID: <07Jun20.174649pdt."57996"@synergy1.parc.xerox.com> > goto(pos) # absolute > > move(incr:int) # relative to current position > > negative numbers can be interpreted naturally; for move they go > backwards, and for goto they count from the end. > > This would require either a length, or a special value (None?) for at > least one of Start and End, because 0 == -0. I like this idea. Define START and END as values in the "file" class. > I'm not sure exactly how unicode > characters even should be counted, without a normalization promise. No one's sure. That's why "read(N: int) => str" doesn't make sense. Bill From janssen at parc.com Thu Jun 21 02:49:47 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 17:49:47 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <07Jun19.114752pdt."57996"@synergy1.parc.xerox.com> <07Jun19.184653pdt."57996"@synergy1.parc.xerox.com> <07Jun20.101108pdt."57996"@synergy1.parc.xerox.com> Message-ID: <07Jun20.174955pdt."57996"@synergy1.parc.xerox.com> Christian Heimes wrote: > Bill Janssen wrote: > > Not bad, but if you're going that route, I think I'd get rid of the > > optional arguments, and just say > > > > seek_from_beginning(INCR: int) > > > > seek_from_current(INCR: int) > > > > seek_from_end(DECR: int) > > I don't like it. It's too noisy and too much to type. Well, it would be noisy, and the complaint about length would apply, if these were widely used many times in one piece of code, but they aren't. So that doesn't matter, and in its favor, it's clear, consistent, and easy to remember. Bill From daniel at stutzbachenterprises.com Thu Jun 21 02:54:17 2007 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Wed, 20 Jun 2007 19:54:17 -0500 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <-665892861201335771@unknownmsgid> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> Message-ID: On 6/20/07, Bill Janssen wrote: > Yes, of course, Daniel, but I was speaking of the contents of files, > and files are inherently sequences of bytes. If we are talking about > some layer which interprets the contents of a file, just saying "give > me N characters" isn't enough. We need to say, "N characters assuming > a text encoding of M, with a normalization policy of Q, and a newline > policy of R". If we don't, we can't just "read" N characters safely. > So I think it's broken to put this in the TextIOBase class; instead, > there should be some wrapper class that does buffering and can be > configured as to (M, Q, R). The PEP specifies that TextIOWrapper objects (the primary implementation of the TextIOBase interface) are created via the following signature: .__init__(self, buffer, encoding=None, newline=None) In other words, TextIOBase *is* the wrapper type that does the buffering and allows the user to configure M and R. Are you suggesting that TextIOBase should be split into two classes, one of which provides the (M, R) functionality and one of which does not? If so, how would the later be different from the RawIOBase and BufferedIOBase classes, already described in the PEP? I'm not sure I 100% understand what you mean by "normalization policy" (Q). Could you give an example? -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC From janssen at parc.com Thu Jun 21 04:00:57 2007 From: janssen at parc.com (Bill Janssen) Date: Wed, 20 Jun 2007 19:00:57 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> Message-ID: <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com> > I'm not sure I 100% understand what you mean by "normalization policy" > (Q). Could you give an example? I was speaking of the 4 different normalization forms for Unicode, which can produce different code-point sequences. Since "strings" in Python-3000 aren't really strings, but instead are immutable code-point sequences, this means that any byte-to-string transformation which doesn't specify this can produce different strings from the same bytes without violating its constraints. Bill From greg.ewing at canterbury.ac.nz Thu Jun 21 04:43:00 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Jun 2007 14:43:00 +1200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt.57996@synergy1.parc.xerox.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <07Jun19.121327pdt.57996@synergy1.parc.xerox.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> Message-ID: <4679E5B4.4040507@canterbury.ac.nz> Nicko van Someren wrote: > perhaps the sum() function could be made properly polymorphic > so as to remove one more class of use cases for reduce(). That's unlikely to happen. As I remember things, sum() was deliberately restricted to numbers so as not to present an attractive nuisance as an inefficient way to concatenate a list of strings. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From greg.ewing at canterbury.ac.nz Thu Jun 21 05:31:18 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Jun 2007 15:31:18 +1200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <07Jun19.105121pdt.57996@synergy1.parc.xerox.com> <4679020A.8020609@gmail.com> Message-ID: <4679F106.2010307@canterbury.ac.nz> Christian Heimes wrote: > IIRC map and filter are a bit slower than list comprehensive But isn't that true only when the function passed is a Python function? Or are LCs faster now even for C functions? -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From showell30 at yahoo.com Thu Jun 21 11:49:46 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 02:49:46 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com> Message-ID: <77903.26658.qm@web33507.mail.mud.yahoo.com> --- Bill Janssen wrote: > [...] It's more important to make things work > consistently than to only > have "one way". "sum" should concatenate strings. > "Sum" should sum stuff. You can't sum strings. It makes no sense in English. You can concatenate strings, or you can join them using a connecting string. Since concatenating is just a degenerate case of joining, it's hard to justify a concat() builtin when you already have ''.join(), but I'd rather have a concat() builtin than an insensible interpretation of sum(). Multiple additions (with "+") mean "sum" in arithmetic, but you can't generalize that to strings and text processing. The "+" operator for any two strings is not about adding--it's about joining/concatenating. So multiple applications of "+" on strings aren't a sum. They're just a longer join/concatenation. Remember also that you can't have "+" operate on a string/integer pair. It's just practicality that Python uses the same punctuation for addition and concatenation. In English it's sensible to have punctuation for addition, so it has "+," but it needs no punctuation for joining/concatenation, so Python had to pick the closest match. ____________________________________________________________________________________ Need a vacation? Get great deals to amazing places on Yahoo! Travel. http://travel.yahoo.com/ From ncoghlan at gmail.com Thu Jun 21 15:40:29 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Jun 2007 23:40:29 +1000 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <435D7523-7A85-48CE-ACA2-B1722D3EC530@gmail.com> <4FED5BD6-F3B9-4860-A22D-074DE2C9E6C3@gmail.com> <4B6A0165-4E03-421E-BC9F-3D20C5427C70@nicko.org> <46793D3A.8020303@gmail.com> <9e804ac0706200813y68a7664dt206e65c3fbb9fcf7@mail.gmail.com> Message-ID: <467A7FCD.6010506@gmail.com> Thomas Wouters wrote: > > > On 6/20/07, *Nick Coghlan* > wrote: > > Strings are explicitly disallowed (because Guido doesn't want a second > way to spell ''.join(seq), as far as I know): > > > More importantly, because it has positively abysmal performance, just > like the reduce() solution (and, in fact, many reduce solutions to > problems better solved otherwise :-) Like the old input(), backticks and > allowing the mixing of tabs and spaces, while it has uses, the ease and > frequency with which it is misused outweigh the utility enough that it > should not be in such a prominent place. The rejected suggestion which lead to the current error message was for sum(seq, '') to call ''.join(seq) behind the scenes to actually do the string concatenation - the performance would have been identical to calling ''.join(seq) directly. You certainly wouldn't want to use the same summing algorithm as is used for mutable sequences - as you say, the performance would be terrible. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Thu Jun 21 15:45:20 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Jun 2007 23:45:20 +1000 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <200706202358.16087.gareth.mccaughan@pobox.com> References: <46793D3A.8020303@gmail.com> <5DF9AC3D-CD82-4912-8C19-004131106F40@nicko.org> <200706202358.16087.gareth.mccaughan@pobox.com> Message-ID: <467A80F0.7060904@gmail.com> Gareth McCaughan wrote: > That is: if you're writing code that expects sum() to do something > sensible with lists of strings, you'll usually need it to do something > sensible with *empty* lists of strings -- but that isn't possible, > because there's only one empty list and it has to serve as the empty > list of integers too. That is indeed the reason for the explicit start value - sum() needs to know what to return when the supplied iterable is empty. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From jimjjewett at gmail.com Thu Jun 21 18:12:22 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 21 Jun 2007 12:12:22 -0400 Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] Message-ID: Should canonicalization should be an extra feature of the Text IO, on par with character encoding? On 6/20/07, Daniel Stutzbach wrote: > On 6/20/07, Bill Janssen wrote: [For the TextIO, as opposed to the raw IO, Bill originally proposed dropping read(n), because character count is not well-defined. Dan objected that not all text has useful line breaks.] > > ... just saying "give me N characters" isn't enough. > > We need to say, "N characters assuming a text > > encoding of M, with a normalization policy of Q, > > and a newline policy of R". [ Daniel points out that TextIO already handles M and R ] > I'm not sure I 100% understand what you mean by > "normalization policy" (Q). Could you give an example? How many characters are there in ?? If I ask for just one character, do I get only the o, without the diaeresis, or do I get both (since they are linguistically one letter), or does it depend on how some editor happened to store it? Distinguishing strings based on an accident of storage would violate unicode standards. (More precisely, it would be a violation of standards to assume that they are distinguished.) To the extent that you are treating the data as text rather than binary, NFC or NFD normalization should always be appropriate. In practice, binary concerns do intrude even for text data; you may well want to save it back out in the original encoding, without any spurious changes. Proposal: open would default to NFC. import would open source code with NFKC. An explict None canonicalization would allow round-trips without spurious binary-level changes. -jJ From amcnabb at mcnabbs.org Thu Jun 21 17:45:01 2007 From: amcnabb at mcnabbs.org (Andrew McNabb) Date: Thu, 21 Jun 2007 09:45:01 -0600 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <77903.26658.qm@web33507.mail.mud.yahoo.com> References: <07Jun20.100913pdt."57996"@synergy1.parc.xerox.com> <77903.26658.qm@web33507.mail.mud.yahoo.com> Message-ID: <20070621154501.GA1607@mcnabbs.org> On Thu, Jun 21, 2007 at 02:49:46AM -0700, Steve Howell wrote: > > "Sum" should sum stuff. You can't sum strings. It makes no sense in > English. I think you're technically right, but I frequently find myself using the phrase "add together a list of strings" when it would be more accurate to say "concatenate a list of strings." I can't say I feel bad when I use this terminology. > Multiple additions (with "+") mean "sum" in arithmetic, but you can't > generalize that to strings and text processing. The "+" operator for > any two strings is not about adding--it's about joining/concatenating. > So multiple applications of "+" on strings aren't a sum. They're just > a longer join/concatenation. I guess I don't find the distinction between adding and concatenating as strong as you do. When we write 'a' + 'b', I don't see any problem with saying that we're adding 'a' and 'b', and I don't think there's anything unclear about sum(['a', 'b', 'c']). -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.python.org/pipermail/python-3000/attachments/20070621/c0366fe4/attachment.pgp From janssen at parc.com Thu Jun 21 18:19:04 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 21 Jun 2007 09:19:04 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <77903.26658.qm@web33507.mail.mud.yahoo.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> Message-ID: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> > Multiple additions (with "+") mean "sum" in > arithmetic, but you can't generalize that to strings > and text processing. The "+" operator for any two > strings is not about adding--it's about > joining/concatenating. So multiple applications of > "+" on strings aren't a sum. They're just a longer > join/concatenation. Hmmm. Your argument would be more pursuasive if you couldn't do this in Python: >>> a = "abc" + "def" + "ghi" + "jkl" >>> a 'abcdefghijkl' >>> The real problem with "sum", I think, is that the parameter list is ill-conceived (perhaps because it was added before variable length parameter lists were?). It should be sum(*operands) not sum(operands, initialvalue=?) It should amount to "map(+, operands)". Bill From showell30 at yahoo.com Thu Jun 21 19:12:30 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 10:12:30 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> Message-ID: <380622.43139.qm@web33515.mail.mud.yahoo.com> --- Bill Janssen wrote: > > Multiple additions (with "+") mean "sum" in > > arithmetic, but you can't generalize that to > strings > > and text processing. The "+" operator for any two > > strings is not about adding--it's about > > joining/concatenating. So multiple applications > of > > "+" on strings aren't a sum. They're just a > longer > > join/concatenation. > > Hmmm. Your argument would be more pursuasive if you > couldn't do this > in Python: > > >>> a = "abc" + "def" + "ghi" + "jkl" > >>> a > 'abcdefghijkl' > >>> > > The real problem with "sum", I think, is that the > parameter list is > ill-conceived (perhaps because it was added before > variable length > parameter lists were?). It should be > > sum(*operands) > > not > > sum(operands, initialvalue=?) > > It should amount to "map(+, operands)". > I think you were missing my point, which is that sum doesn't and shouldn't necessarily have the same semantics as map(+). "Sum," in both Python and common English usage, is a generalization of arithmetic addition, but it's not a generalization of applying operators that happen to be spelled "+." There's no natural English punctuation for concatenation, and Python's choice of "+" could be called mostly arbitrary (although it's consistent with a few other programming languages.) The following operators can mean concatenation in various programming languages: + . & || Oddly, in English a common way to concatenate words is with the "-" character. It means a hyphen in English, and it's use to create multi-word-thingies, but the operator itself is also a subtraction operator. So you could speciously argue that when you concatenate words in English, you're doing a difference, but under your proposed Python, you'd be doing a sum. ____________________________________________________________________________________Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. http://tv.yahoo.com/ From jimjjewett at gmail.com Thu Jun 21 19:18:21 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Thu, 21 Jun 2007 13:18:21 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <-405694068883174656@unknownmsgid> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <-405694068883174656@unknownmsgid> Message-ID: On 6/21/07, Bill Janssen wrote: > The real problem with "sum", I think, is that the parameter list is > ill-conceived (perhaps because it was added before variable length > parameter lists were?). It should be > sum(*operands) > not > sum(operands, initialvalue=?) Is this worth fixing in Python 3, where keyword-only parameters become an option? sum(*operands, start=0) -jJ From showell30 at yahoo.com Thu Jun 21 19:33:41 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 10:33:41 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070621154501.GA1607@mcnabbs.org> Message-ID: <141120.84388.qm@web33510.mail.mud.yahoo.com> --- Andrew McNabb wrote: > > I think you're technically right, but I frequently > find myself using the > phrase "add together a list of strings" when it > would be more accurate > to say "concatenate a list of strings." I can't say > I feel bad when I > use this terminology. > Nope, and I wouldn't throw the grammar book at you if you did. But if you said a compound word is a "sum" of smaller words, I might look at you a little funny. :) > > I guess I don't find the distinction between adding > and concatenating as > strong as you do. > Fair enough. > When we write 'a' + 'b', I don't see any problem > with saying that we're > adding 'a' and 'b', and I don't think there's > anything unclear about > sum(['a', 'b', 'c']). I think you're right that most people would guess that the above returns 'abc', so we're not in major disagreement. But I'm approaching usability from another direction, I guess. If I wanted to join a series of strings together, sum() wouldn't be the most naturally occuring method to me. To me it would be concat() or join(). I think ''.join(...) in Python is a tiny wart, since it's pretty unlikely for a newcomer to guess the syntax, but I'm not sure they'd guess sum() either. The one advantage of ''.join() is that you can at least deduce it, via introspection, by doing dir('foo'). My other concern with sum() is just the common pitfall that you do sum(line_of_numbers.split(',')) and get '35' when you intended to write code to get 8. I'd rather have that fail obviously than subtlely. ____________________________________________________________________________________ Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 From janssen at parc.com Thu Jun 21 19:45:22 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 21 Jun 2007 10:45:22 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <141120.84388.qm@web33510.mail.mud.yahoo.com> References: <141120.84388.qm@web33510.mail.mud.yahoo.com> Message-ID: <07Jun21.104531pdt."57996"@synergy1.parc.xerox.com> > My other concern with sum() is just the common pitfall > that you do sum(line_of_numbers.split(',')) and get > '35' when you intended to write code to get 8. I'd > rather have that fail obviously than subtlely. Common pitfall? I doubt it. Possible pitfall? Sure. Bill From jjb5 at cornell.edu Thu Jun 21 19:32:42 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Thu, 21 Jun 2007 13:32:42 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> Message-ID: <467AB63A.7050505@cornell.edu> > It should be > > sum(*operands) > > not > > sum(operands, initialvalue=?) > > It should amount to "map(+, operands)". Or, to be pedantic, this: reduce(lambda x, y: x.__add__(y), operands) Joel From janssen at parc.com Thu Jun 21 19:51:32 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 21 Jun 2007 10:51:32 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <380622.43139.qm@web33515.mail.mud.yahoo.com> References: <380622.43139.qm@web33515.mail.mud.yahoo.com> Message-ID: <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com> > I think you were missing my point, which is that sum > doesn't and shouldn't necessarily have the same > semantics as map(+). It's not that I don't understand your argument, Steve. I just don't find it effective. If we are going to distinguish between "arithmetic addition" and "concatenation", we should find another operator. As long as we *don't* do that, my personal preference would be to either remove "sum" completely, or have it work in a regular fashion, depending on which data type is passed to it, either as arithmetic addition or as sequence concatenation. Bill From showell30 at yahoo.com Thu Jun 21 20:04:37 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 11:04:37 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.104531pdt."57996"@synergy1.parc.xerox.com> Message-ID: <794697.77794.qm@web33514.mail.mud.yahoo.com> --- Bill Janssen wrote: > > My other concern with sum() is just the common > pitfall > > that you do sum(line_of_numbers.split(',')) and > get > > '35' when you intended to write code to get 8. > I'd > > rather have that fail obviously than subtlely. > > Common pitfall? I doubt it. Possible pitfall? > Sure. > It's a common mistake, for me anyway, to forgot to cast something that I just read from a file into an integer before performing arithmetic on it. But it's usually not a pitfall now. It's just a quick exception that I can quickly diagnose and fix. Try this code under Python 2: name, amount, tip = 'Bill,20,1.5'.split(',') print name + ' payed ' + sum(amount,tip) It will throw an obvious exception. Obviously, this is a pitfall even under current Python: name, amount, tip = 'Bill,20,1.5'.split(',') print name + ' payed ' + amount + tip So then you have three choices on how to improve Python, one of which you sort of alluded to in your other reply: 1) Eliminate the current pitfall by introducing another operator for concatenation. 2) Keep sum() as it is, but make the error message more clear when somebody uses it on strings. Example: Sum() cannot be used to join strings. Perhaps you meant to use ''.join(). 3) Make sum() have a consistent pitfall with the "+" operator, even though English/Python has a lot more latitude with words than punctuation when it comes to disambiguating concepts. IMHO #1 is too extreme, #2 is the best option, and #3 doesn't really solve any practical problems. The arguments for #3 seems to come from consistency/purity vantages, which are fine, but not as important as usability. I concede this entire argument is based on the perhaps shaky premise that *most* people never forget to turn strings into integers, but I fully admit my fallibility in this regard. ____________________________________________________________________________________ Get the Yahoo! toolbar and be alerted to new email wherever you're surfing. http://new.toolbar.yahoo.com/toolbar/features/mail/index.php From amcnabb at mcnabbs.org Thu Jun 21 20:46:18 2007 From: amcnabb at mcnabbs.org (Andrew McNabb) Date: Thu, 21 Jun 2007 12:46:18 -0600 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <141120.84388.qm@web33510.mail.mud.yahoo.com> References: <20070621154501.GA1607@mcnabbs.org> <141120.84388.qm@web33510.mail.mud.yahoo.com> Message-ID: <20070621184618.GB1607@mcnabbs.org> On Thu, Jun 21, 2007 at 10:33:41AM -0700, Steve Howell wrote: > > Nope, and I wouldn't throw the grammar book at you if you did. But if > you said a compound word is a "sum" of smaller words, I might look at > you a little funny. :) It wouldn't be the first time someone looked at me a little funny. :) > But I'm approaching usability from another direction, I guess. If I > wanted to join a series of strings together, sum() wouldn't be the > most naturally occuring method to me. I agree that on its own, it's not the most natural method. However, once you've already used the + operator to join two strings, you are much more likely to consider sum() for concatenating a list of strings. I remember being confused the first time I tried it and found that it didn't work. In the end, though, it's really not that big a deal. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.python.org/pipermail/python-3000/attachments/20070621/0ba915dd/attachment.pgp From showell30 at yahoo.com Thu Jun 21 20:55:46 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 11:55:46 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070621184618.GB1607@mcnabbs.org> Message-ID: <582524.76259.qm@web33515.mail.mud.yahoo.com> --- Andrew McNabb wrote: > > I agree that on its own, it's not the most natural > method. However, > once you've already used the + operator to join two > strings, you are > much more likely to consider sum() for concatenating > a list of strings. > I remember being confused the first time I tried it > and found that it > didn't work. > > In the end, though, it's really not that big a deal. > Sure, I agree with that, although I think we are right to quibble about this, because both problems, summing numbers and summing/joining strings, are pretty darn common, so any brainstorm to make those more natural under the language are worthy of consideration. I've had ''.join() burned into my brain pretty well by now, so I think I'm unfairly biased, and this is really a case where newbies have more perspectives than most people on this list. But I also hate to have a language be *too* driven by newbie concerns, because I think Python also needs to be appreciated over the long haul. ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting From rrr at ronadam.com Thu Jun 21 21:09:07 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 21 Jun 2007 14:09:07 -0500 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com> References: <380622.43139.qm@web33515.mail.mud.yahoo.com> <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com> Message-ID: <467ACCD3.6070608@ronadam.com> Bill Janssen wrote: >> I think you were missing my point, which is that sum >> doesn't and shouldn't necessarily have the same >> semantics as map(+). > > It's not that I don't understand your argument, Steve. > > I just don't find it effective. If we are going to distinguish > between "arithmetic addition" and "concatenation", we should find > another operator. > > As long as we *don't* do that, my personal preference would be to > either remove "sum" completely, or have it work in a regular fashion, > depending on which data type is passed to it, either as arithmetic > addition or as sequence concatenation. From the standpoint of readability and being able to know what a particular section of code does I believe it is better to have limits that make sense in cases where the behavior of a function may change based on what the data is. My preference would be to limit sum() to value addition only, and never do concatenation. For bytes types, it could be the summing of bytes. This could be useful for image data. For all non numeric types it would generate an exception. And if a general function that joins and/or extends is desired, a separate function possibly called merge() might be better. Then sum() would always do numerical addition and merge() would always do concatenation of objects. That makes the code much easier to read 6 months from now with a lower chance of having subtle bugs. The main thing for me is how quickly I can look at a block of code and determine what it does with a minimum of back tracking and data tracing. Cheers, Ron From showell30 at yahoo.com Thu Jun 21 21:15:43 2007 From: showell30 at yahoo.com (Steve Howell) Date: Thu, 21 Jun 2007 12:15:43 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467ACCD3.6070608@ronadam.com> Message-ID: <121087.44934.qm@web33505.mail.mud.yahoo.com> --- Ron Adam wrote: > > > Bill Janssen wrote: > >> I think you were missing my point, which is that > sum > >> doesn't and shouldn't necessarily have the same > >> semantics as map(+). > > > > It's not that I don't understand your argument, > Steve. > > > > I just don't find it effective. If we are going > to distinguish > > between "arithmetic addition" and "concatenation", > we should find > > another operator. > > > > As long as we *don't* do that, my personal > preference would be to > > either remove "sum" completely, or have it work in > a regular fashion, > > depending on which data type is passed to it, > either as arithmetic > > addition or as sequence concatenation. > > From the standpoint of readability and being able > to know what a > particular section of code does I believe it is > better to have limits that > make sense in cases where the behavior of a function > may change based on > what the data is. > > My preference would be to limit sum() to value > addition only, and never do > concatenation. For bytes types, it could be the > summing of bytes. This > could be useful for image data. For all non numeric > types it would > generate an exception. > > And if a general function that joins and/or extends > is desired, a separate > function possibly called merge() might be better. > Then sum() would always > do numerical addition and merge() would always do > concatenation of objects. > That makes the code much easier to read 6 months > from now with a lower > chance of having subtle bugs. > > The main thing for me is how quickly I can look at a > block of code and > determine what it does with a minimum of back > tracking and data tracing. > Ron, I obviously agree with your overriding points 100%, and thank you for expressing them better than I did, but I would object to the name merge(). "Merge" to me has the semantics of blending strings, not joining them. English already has two perfectly well understood words for this concept: "abc", "def" -> "abcdef" The two English words are "join" and "concatenate." Python wisely chose the shorter word, although I can see arguments for the longer word, as "join" probably is a tiny bit more semantically overloaded than "concatenate." The best slang word for the above is "mush." ____________________________________________________________________________________ Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. http://farechase.yahoo.com/promo-generic-14795097 From jjb5 at cornell.edu Thu Jun 21 21:59:03 2007 From: jjb5 at cornell.edu (Joel Bender) Date: Thu, 21 Jun 2007 15:59:03 -0400 Subject: [Python-3000] join vs. add [was: Python 3000 Status Update (Long!)] In-Reply-To: <467ACCD3.6070608@ronadam.com> References: <380622.43139.qm@web33515.mail.mud.yahoo.com> <07Jun21.105136pdt."57996"@synergy1.parc.xerox.com> <467ACCD3.6070608@ronadam.com> Message-ID: <467AD887.2000601@cornell.edu> > My preference would be to limit sum() to value addition only, and never do > concatenation. I would be happy with that, provided there was join function and operator: >>> join = lambda x: reduce(lambda y, z: y.__join__(z), x) I think this is clearer than sum(): >>> join(['a', 'b', 'c']) 'abc' It wouldn't interfere with ''.join(), and ''.__add__() could be redirected to ''.__join__(). > For all non numeric types it would generate an exception. How about generating an exception where __add__ isn't defined, so it would work on MyFunkyVector type. I could join my vectors together as well, since in MyNonEucledeanSpace, it doesn't mean the same thing as "add". Joel From fdrake at acm.org Thu Jun 21 22:14:29 2007 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 21 Jun 2007 16:14:29 -0400 Subject: [Python-3000] join vs. add [was: Python 3000 Status Update (Long!)] In-Reply-To: <467AD887.2000601@cornell.edu> References: <380622.43139.qm@web33515.mail.mud.yahoo.com> <467ACCD3.6070608@ronadam.com> <467AD887.2000601@cornell.edu> Message-ID: <200706211614.29277.fdrake@acm.org> On Thursday 21 June 2007, Joel Bender wrote: > I think this is clearer than sum(): > >>> join(['a', 'b', 'c']) > 'abc' > > It wouldn't interfere with ''.join(), and ''.__add__() could be > redirected to ''.__join__(). And then int.__join__ could be defined in confusing ways, too: >>> join([4, 2]) 42 There's something appealing about that specific example. ;-) -Fred -- Fred L. Drake, Jr. From janssen at parc.com Thu Jun 21 22:21:10 2007 From: janssen at parc.com (Bill Janssen) Date: Thu, 21 Jun 2007 13:21:10 PDT Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467AB63A.7050505@cornell.edu> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> <467AB63A.7050505@cornell.edu> Message-ID: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> > > It should amount to "map(+, operands)". > > Or, to be pedantic, this: > > reduce(lambda x, y: x.__add__(y), operands) Don't you mean: reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) Bill From rrr at ronadam.com Fri Jun 22 01:18:04 2007 From: rrr at ronadam.com (Ron Adam) Date: Thu, 21 Jun 2007 18:18:04 -0500 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <121087.44934.qm@web33505.mail.mud.yahoo.com> References: <121087.44934.qm@web33505.mail.mud.yahoo.com> Message-ID: <467B072C.7000906@ronadam.com> Steve Howell wrote: > --- Ron Adam wrote: > >> >> Bill Janssen wrote: >>>> I think you were missing my point, which is that >> sum >>>> doesn't and shouldn't necessarily have the same >>>> semantics as map(+). >>> It's not that I don't understand your argument, >> Steve. >>> I just don't find it effective. If we are going >> to distinguish >>> between "arithmetic addition" and "concatenation", >> we should find >>> another operator. >>> >>> As long as we *don't* do that, my personal >> preference would be to >>> either remove "sum" completely, or have it work in >> a regular fashion, >>> depending on which data type is passed to it, >> either as arithmetic >>> addition or as sequence concatenation. >> From the standpoint of readability and being able >> to know what a >> particular section of code does I believe it is >> better to have limits that >> make sense in cases where the behavior of a function >> may change based on >> what the data is. >> >> My preference would be to limit sum() to value >> addition only, and never do >> concatenation. For bytes types, it could be the >> summing of bytes. This >> could be useful for image data. For all non numeric >> types it would >> generate an exception. >> >> And if a general function that joins and/or extends >> is desired, a separate >> function possibly called merge() might be better. >> Then sum() would always >> do numerical addition and merge() would always do >> concatenation of objects. >> That makes the code much easier to read 6 months >> from now with a lower >> chance of having subtle bugs. >> >> The main thing for me is how quickly I can look at a >> block of code and >> determine what it does with a minimum of back >> tracking and data tracing. >> > > Ron, I obviously agree with your overriding points > 100%, and thank you for expressing them better than I > did, but I would object to the name merge(). "Merge" > to me has the semantics of blending strings, not > joining them. English already has two perfectly well > understood words for this concept: > > "abc", "def" -> "abcdef" > > The two English words are "join" and "concatenate." > Python wisely chose the shorter word, although I can > see arguments for the longer word, as "join" probably > is a tiny bit more semantically overloaded than > "concatenate." > > The best slang word for the above is "mush." Yes, join is the better choice for strings only, but I think the discussion was also concerned with joining sequences of other types as well. list.join() or tuple.join() doesn't work. For joining sequences we have... str.join() list.extend() # but returns None Sets have an have a .union() method, but it only works on two sets at a time. There is no equivalent of .extend() for tuples. There is an .__add__() method. A join() function or generator might be able to unify these operations and remove the need for sum() to do this. It can't be a method as it would break the general rule that an object should not mutate itself and also return it self. As for the numeric cases... There are no int or float methods to get a single value from a sequence of values. So we need to use a function or some other way of doing it. So sum is needed for this. (well, it's nice to have.) Currently our choices are: sum(seq) reduce(lambda x, y: x+y, seq) # not limited to addition Summing items across sequences of same length might be a sumitems() function. But then we are getting into numeric territory. The more complex uses of these are probably just as well done with a for loop. Cheers, Ron From guido at python.org Fri Jun 22 02:39:34 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 21 Jun 2007 17:39:34 -0700 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <2614969285506109322@unknownmsgid> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <2614969285506109322@unknownmsgid> Message-ID: On 6/20/07, Bill Janssen wrote: > > > TextIOBase: this seems an odd mix of high-level and low-level. I'd > > > remove "seek", "tell", "read", and "write". Remember that in Python, > > > mixins actually work, so that you can provide a file object that > > > combines several different I/O classes. > > > > Huh? All those operations you want to remove are entirely necessary > > for a number of applications. I'm not sure what you meant about mixins? > > I meant that TextIOBase should just provide the operations for text. > The other operations would be supported, when appropriate, by mixing > in an appropriate class that provides them. Remember that this is > a PEP about base classes. Um, it's not meant to be just about base classes -- it's also meant to be about the actual implementations -- both abstract and concrete classes will be importable from the same module, 'io'. Have you checked out io.py in the p3yk branch? > > It doesn't work? Why not? Of course read() should take the number of > > characters as a parameter, not number of bytes. > > Unfortunately, files contain encodings of characters, and those > encodings may at times be mapped to multiple equivalent strings, at > least with respect to Unicode, the target for Python-3000. The > standard Unicode support for Python-3000 seems to be settling on > having code-point representations of those strings exposed to the > application, which means that any specific automatic normalization is > precluded. So any particular "readchars(1)" operation may validly > return different strings even if operating on the same underlying > file, and may require a different number of read operations to read > the same underlying bytes. That is, I believe that the string and/or > file operations are not well-specified enough to guarantee that this > won't happen. This is the same situation we have today, which means > that the only real way to read Unicode strings from a file will be the > same as today, that is, read raw bytes from a file, decode them and > normalize them in some specific way, and then see what string you wind > up with. You could probably fix this in the PEP by specifying a > specific Unicode normalization to use when returning strings. I don't understand exactly what you're saying, but here's the semantic model from which I've been operating. A file contains a sequence of bytes. If you read it all in one fell swoop, and then decoded it to Unicode (using a specific encoding), you'd get a specific text string. This is a sequence of code units. (Whether they are valid code points or characters I don't think we can guarantee -- I use the GIGO principle.) *Conceptually*, read(n) simply returns the next n code units; readline() is equivalent to read(n) for some n, whose value is determined by looking ahead until the first \n is found. Universal newlines collapse \r\n into \n and turn lone \r into \n (or whatever algorithm is deemed right, I'm not sure the latter is still needed) *before* we reach the sequence of code points that read() and readline() see. Files are all about making this conceptual model efficient even if the file doesn't fit in memory. We have incremental codecs which make this possible. (We always assume the file doesn't change while we're reading it; if it does, certain bets are off.) In my mind, seek() and tell() should work like getpos() and setpos() in modern C stdio -- tell() returns a "cookie" whose only use is that you can later pass it to seek() and it will reset the position in the sequence of code units to where it was when tell() was called. For many encodings, in practice, seek() and tell() can just use byte positions since the boundaries between code points always fall on byte boundaries (but not the other way around). For other encodings, the implementation currently in io.py encodes the incremental codec state in the (very) high bits of the cookie (this is convenient since we have arbitrary precision integers). Relative seeks (except for a few end cases) are not supported for text files. > > > feel the need. Stick to just "readline" and "writeline" for text I/O. > > > > Ah, not everyone dealing with text is dealing with line-delimited > > text, you know... > > It's really the only difference between text and non-text. Again, I don't quite follow this. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From greg.ewing at canterbury.ac.nz Fri Jun 22 03:27:34 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 22 Jun 2007 13:27:34 +1200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070621154501.GA1607@mcnabbs.org> References: <07Jun20.100913pdt.57996@synergy1.parc.xerox.com> <77903.26658.qm@web33507.mail.mud.yahoo.com> <20070621154501.GA1607@mcnabbs.org> Message-ID: <467B2586.4050901@canterbury.ac.nz> Andrew McNabb wrote: > I think you're technically right, but I frequently find myself using the > phrase "add together a list of strings" when it would be more accurate > to say "concatenate a list of strings." The word "add" has a wider connotation in English than "sum". Consider the following two sentences: I put a sandwich and an apple in my lunchbox, then I added a banana. I put the sum of a sandwich, an apple and a banana in my lunchbox. -- Greg From greg.ewing at canterbury.ac.nz Fri Jun 22 03:34:01 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 22 Jun 2007 13:34:01 +1200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.091906pdt.57996@synergy1.parc.xerox.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt.57996@synergy1.parc.xerox.com> Message-ID: <467B2709.5050808@canterbury.ac.nz> Bill Janssen wrote: > It should be > > sum(*operands) That would incur copying of the sequence. It would be justifiable only if the vast majority of use cases involved passing the operands as separate arguments, which I don't think is true. -- Greg From daniel at stutzbachenterprises.com Fri Jun 22 05:40:38 2007 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Thu, 21 Jun 2007 22:40:38 -0500 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <2614969285506109322@unknownmsgid> Message-ID: On 6/21/07, Guido van Rossum wrote: > In my mind, seek() and tell() should work like getpos() and setpos() > in modern C stdio -- tell() returns a "cookie" whose only use is that > you can later pass it to seek() and it will reset the position in the > sequence of code units to where it was when tell() was called. For > many encodings, in practice, seek() and tell() can just use byte > positions since the boundaries between code points always fall on byte > boundaries (but not the other way around). For other encodings, the > implementation currently in io.py encodes the incremental codec state > in the (very) high bits of the cookie (this is convenient since we > have arbitrary precision integers). If the cookie is meant to be opaque to the caller, is there a reason that the cookie must be an integer? Specifying the return type as opaque might also reduce the temptation to do perform arithmetic on them, which will work for some codecs (ASCII), but break later in odd ways for others. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises LLC From guido at python.org Fri Jun 22 07:43:37 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 21 Jun 2007 22:43:37 -0700 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <2614969285506109322@unknownmsgid> Message-ID: On 6/21/07, Daniel Stutzbach wrote: > On 6/21/07, Guido van Rossum wrote: > > In my mind, seek() and tell() should work like getpos() and setpos() > > in modern C stdio -- tell() returns a "cookie" whose only use is that > > you can later pass it to seek() and it will reset the position in the > > sequence of code units to where it was when tell() was called. For > > many encodings, in practice, seek() and tell() can just use byte > > positions since the boundaries between code points always fall on byte > > boundaries (but not the other way around). For other encodings, the > > implementation currently in io.py encodes the incremental codec state > > in the (very) high bits of the cookie (this is convenient since we > > have arbitrary precision integers). > > If the cookie is meant to be opaque to the caller, is there a reason > that the cookie must be an integer? Yes, so the API for seek() and tell() can be the same for binary and text files. It also makes it easier to persist cookies. > Specifying the return type as opaque might also reduce the temptation > to do perform arithmetic on them, which will work for some codecs > (ASCII), but break later in odd ways for others. I actually like the "open kimono" approach where users can work around the system if they really need to. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From guido at python.org Fri Jun 22 07:54:23 2007 From: guido at python.org (Guido van Rossum) Date: Thu, 21 Jun 2007 22:54:23 -0700 Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] In-Reply-To: References: Message-ID: On 6/21/07, Jim Jewett wrote: > Should canonicalization should be an extra feature of the Text IO, on > par with character encoding? > > On 6/20/07, Daniel Stutzbach wrote: > > On 6/20/07, Bill Janssen wrote: > > [For the TextIO, as opposed to the raw IO, Bill originally proposed > dropping read(n), because character count is not well-defined. Dan > objected that not all text has useful line breaks.] > > > > ... just saying "give me N characters" isn't enough. > > > We need to say, "N characters assuming a text > > > encoding of M, with a normalization policy of Q, > > > and a newline policy of R". > > [ Daniel points out that TextIO already handles M and R ] > > > I'm not sure I 100% understand what you mean by > > "normalization policy" (Q). Could you give an example? > > How many characters are there in ?? > > If I ask for just one character, do I get only the o, without the > diaeresis, or do I get both (since they are linguistically one > letter), or does it depend on how some editor happened to store it? It should get you the next code unit as it comes out of the incremental codec. (Did you see my semantic model I described in a different thread?) > Distinguishing strings based on an accident of storage would violate > unicode standards. (More precisely, it would be a violation of > standards to assume that they are distinguished.) I don't give a damn about this requirement of the Unicode standard. At least, I don't think Python should enforce it at the level of the str data type, and that includes str objects returned by the I/O library. > To the extent that you are treating the data as text rather than > binary, NFC or NFD normalization should always be appropriate. > > In practice, binary concerns do intrude even for text data; you may > well want to save it back out in the original encoding, without any > spurious changes. > > Proposal: > > open would default to NFC. > > import would open source code with NFKC. > > An explict None canonicalization would allow round-trips without > spurious binary-level changes. Counter-proposal: normalization is provided as library functionality. Applications are responsible for normalization data when they need it to be normalized and they can't be sure that it isn't already normalized. The source parser used by import and a few other places is an "application" in this sense and can certainly apply whatever normalization is required. Have we agreed on the level of normalization for source code yet? I'm pretty sure we have agreed on *when* it happens, i.e. (logically) before the lexer starts scanning the source code. I would not be against an additional optional layer in the I/O stack that applies normalization. We could even have an optional parameter to open() to push this onto the stack. But I don't think it should be the default. What is the status of normalization in Java? Does Java source code get normalized before it is parsed? What if \u.... is used? Do the Java I/O library classes normalize text? -- --Guido van Rossum (home page: http://www.python.org/~guido/) From martin at v.loewis.de Fri Jun 22 08:45:27 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 22 Jun 2007 08:45:27 +0200 Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] In-Reply-To: References: Message-ID: <467B7007.1040503@v.loewis.de> > Counter-proposal: normalization is provided as library functionality. > Applications are responsible for normalization data when they need it > to be normalized and they can't be sure that it isn't already > normalized. The source parser used by import and a few other places is > an "application" in this sense and can certainly apply whatever > normalization is required. Have we agreed on the level of > normalization for source code yet? I'm pretty sure we have agreed on > *when* it happens, i.e. (logically) before the lexer starts scanning > the source code. That isn't actually my view: I would apply normalization *only* to identifiers, i.e. leave string literals unmodified. If people would rather see normalization applied to the entire input, that would be an option, of course (although perhaps more expensive to implement, as you need to perform it on all source, even if that source turns out to be ASCII only). > What is the status of normalization in Java? Does Java source code get > normalized before it is parsed? The JLS is silent on that issue, so I think the answer is "no". A quick test (see attached file) shows that it doesn't: i.e. it reports an error "cannot find symbol" even though the symbol would be defined under NFC (or NFD). > What if \u.... is used? It just gets inserted as-is. > Do the Java I/O library classes normalize text? The java.io.InputStreamReader doesn't, see attached code. It appears that Java JRE doesn't support normalization at all until Java 6, where you can use java.text.Normalizer. Before, this class was in sun.text.Normalizer, and (apparently) only used for URI (normalizing to NFC), collation (performing NFD on request), and regular expressions (likewise). Apparently, Sun doesn't consider Unicode normalization as an issue. Regards, Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: foo.java Type: text/x-java Size: 53 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment.java -------------- next part -------------- A non-text attachment was scrubbed... Name: r.java Type: text/x-java Size: 479 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment-0001.java From ncoghlan at gmail.com Fri Jun 22 11:11:14 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Jun 2007 19:11:14 +1000 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> <467AB63A.7050505@cornell.edu> <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> Message-ID: <467B9232.2090002@gmail.com> Bill Janssen wrote: >>> It should amount to "map(+, operands)". >> Or, to be pedantic, this: >> >> reduce(lambda x, y: x.__add__(y), operands) > > Don't you mean: > > reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) This is a nice illustration of a fairly significant issue with the usability of reduce: two attempts to rewrite sum() using reduce(), and both of them are buggy. Neither of the solutions above can correctly handle an empty sequence: .>>> operands = [] .>>> reduce(lambda x, y: x.__add__(y), operands). Traceback (most recent call last): File "", line 1, in TypeError: reduce() of empty sequence with no initial value .>>> reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) Traceback (most recent call last): File "", line 1, in IndexError: list index out of range Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From aurelien.campeas at logilab.fr Fri Jun 22 11:46:20 2007 From: aurelien.campeas at logilab.fr (=?iso-8859-1?Q?Aur=E9lien_Camp=E9as?=) Date: Fri, 22 Jun 2007 11:46:20 +0200 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467B9232.2090002@gmail.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> <467AB63A.7050505@cornell.edu> <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> <467B9232.2090002@gmail.com> Message-ID: <20070622094620.GA25641@crater.logilab.fr> On Fri, Jun 22, 2007 at 07:11:14PM +1000, Nick Coghlan wrote: > Bill Janssen wrote: > >>> It should amount to "map(+, operands)". > >> Or, to be pedantic, this: > >> > >> reduce(lambda x, y: x.__add__(y), operands) > > > > Don't you mean: > > > > reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) > > This is a nice illustration of a fairly significant issue with the > usability of reduce: two attempts to rewrite sum() using reduce(), and > both of them are buggy. Neither of the solutions above can correctly Maybe the specification/documentation is missing some phrasing like that : "The function must also be able to accept no arguments." (taken from another language spec.) ? Better fix the documentation than blame reduce. Of course, reduce was taken from Lisp, where lambda is not castrated and thus allows one to write the no-argument case with more ease. Castrated lambdas limit the usefulness of reduce *in python*, not in general. Regards, Au?lien. > handle an empty sequence: > > .>>> operands = [] > .>>> reduce(lambda x, y: x.__add__(y), operands). > Traceback (most recent call last): > File "", line 1, in > TypeError: reduce() of empty sequence with no initial value > .>>> reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) > Traceback (most recent call last): > File "", line 1, in > IndexError: list index out of range > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > --------------------------------------------------------------- > http://www.boredomandlaziness.org > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/aurelien.campeas%40logilab.fr From showell30 at yahoo.com Fri Jun 22 12:24:08 2007 From: showell30 at yahoo.com (Steve Howell) Date: Fri, 22 Jun 2007 03:24:08 -0700 (PDT) Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467B2586.4050901@canterbury.ac.nz> Message-ID: <525768.82779.qm@web33502.mail.mud.yahoo.com> --- Greg Ewing wrote: > The word "add" has a wider connotation in English > than > "sum". [...] Just to elaborate on the point... And, likewise, symbolic operators have a wider connotation in programming languages than do keywords. Keywords can, and should, be more specifically spelled for a task than punctuation characters. ____________________________________________________________________________________ 8:00? 8:25? 8:40? Find a flick in no time with the Yahoo! Search movie showtime shortcut. http://tools.search.yahoo.com/shortcuts/#news From ncoghlan at gmail.com Fri Jun 22 14:12:08 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Jun 2007 22:12:08 +1000 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com> Message-ID: <467BBC98.1070201@gmail.com> Bill Janssen wrote: >> I'm not sure I 100% understand what you mean by "normalization policy" >> (Q). Could you give an example? > > I was speaking of the 4 different normalization forms for Unicode, > which can produce different code-point sequences. Since "strings" in > Python-3000 aren't really strings, but instead are immutable > code-point sequences, this means that any byte-to-string > transformation which doesn't specify this can produce different > strings from the same bytes without violating its constraints. A given codec won't randomly decide to change its normalisation policy, though - so when you pick the codec, you're picking the normalisation as well. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From ncoghlan at gmail.com Fri Jun 22 14:18:10 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Jun 2007 22:18:10 +1000 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <2614969285506109322@unknownmsgid> Message-ID: <467BBE02.9030103@gmail.com> Daniel Stutzbach wrote: > If the cookie is meant to be opaque to the caller, is there a reason > that the cookie must be an integer? > > Specifying the return type as opaque might also reduce the temptation > to do perform arithmetic on them, which will work for some codecs > (ASCII), but break later in odd ways for others. seek() & tell() are already documented as using opaque cookies for text files (quote is from the documentation of file.seek()): If the file is opened in text mode (without 'b'), only offsets returned by tell() are legal. Use of other offsets causes undefined behavior. (Seeking to an arbitrary byte index on a file with DOS line endings may put you in the middle of a \r\n sequence, which may cause weirdness) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From stephen at xemacs.org Fri Jun 22 16:15:14 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 22 Jun 2007 23:15:14 +0900 Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] In-Reply-To: References: Message-ID: <87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > > If I ask for just one character, do I get only the o, without the > > diaeresis, or do I get both (since they are linguistically one > > letter), or does it depend on how some editor happened to store it? > > It should get you the next code unit as it comes out of the > incremental codec. (Did you see my semantic model I described in a > different thread?) I don't like this, but since that's the way it's gonna be ... > > Distinguishing strings based on an accident of storage would violate > > unicode standards. (More precisely, it would be a violation of > > standards to assume that they are distinguished.) > > I don't give a damn about this requirement of the Unicode standard. ... this requirement does not apply to the Python str type as you have described it. I think at this stage we're asking for trouble to have any normalization by default, even in the TextIO module. str is not text, it's an array of code units. str is going to be used to implement codecs, I/O buffers, all kinds of things that don't necessarily have Unicode text semantics. Unless the Python language itself defines the semantics of the array of code units, EIBTI. This accords with Martin's statement about identifiers being the only thing he proposed normalizing. Even if we know a user wants text, I don't see any state of the art that allows us to guess which normalization will be most useful to him. I think for identifiers, NFKC is almost a no-brainer. But for strings it is not at all obvious. NFC violates such useful string invariants such as len(a) + len(b) == len(a+b). AFAICS, NKD does not. OTOH, if you don't need strings to obey array invariants, NFC is much more friendly to "dumb" UIs that just display the characters as they get them, without trying to find an equivalent that is in the font for missing charactes. And it seems plausible that some applications will mix normalizations inside of the Python instance. The app must handle this; Python can't. Even if you carry normalization information around with your str object, what normalization is Python supposed to apply to nfd_str + nfc_str? But surely that operation is permissible! > > In practice, binary concerns do intrude even for text data; you may > > well want to save it back out in the original encoding, without any > > spurious changes. Then for the purposes of this discussion, it's not text, it's binary. In many cases it will need to be read as bytes and stored that way until written back out. Ie, many legacy encodings do not support roundtrips, such as those that use ISO 2022 extension techniques: there's no rule against having a mode-changing sequence and its inverse in succession, and it's occasionally seen in the wild. Even UTF-8 has unnormalized representations for many characters, and it was only recently that Unicode came to require that they be treated as errors, and not interpreted (producing them has always been forbidden). From guido at python.org Fri Jun 22 17:58:04 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 22 Jun 2007 08:58:04 -0700 Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] In-Reply-To: <467B7007.1040503@v.loewis.de> References: <467B7007.1040503@v.loewis.de> Message-ID: On 6/21/07, "Martin v. L?wis" wrote: [Guido] > > Have we agreed on the level of > > normalization for source code yet? I'm pretty sure we have agreed on > > *when* it happens, i.e. (logically) before the lexer starts scanning > > the source code. > > That isn't actually my view: I would apply normalization *only* to > identifiers, i.e. leave string literals unmodified. If people would > rather see normalization applied to the entire input, that would be > an option, of course (although perhaps more expensive to implement, > as you need to perform it on all source, even if that source turns > out to be ASCII only). OK, sorry, I must've stopped reading that thread at the wrong moment. No need to change it on my behalf. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Fri Jun 22 18:37:43 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 22 Jun 2007 09:37:43 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <467BBC98.1070201@gmail.com> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> <07Jun20.190105pdt."57996"@synergy1.parc.xerox.com> <467BBC98.1070201@gmail.com> Message-ID: <07Jun22.093746pdt."57996"@synergy1.parc.xerox.com> > A given codec won't randomly decide to change its normalisation policy, > though - so when you pick the codec, you're picking the normalisation as > well. You're sure? Between CPython and Jython and IronPython and JavascriptPython and ...? Might as well specify it up front. Bill From janssen at parc.com Fri Jun 22 18:41:19 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 22 Jun 2007 09:41:19 PDT Subject: [Python-3000] canonicalization [was: On PEP 3116: new I/O base classes] In-Reply-To: <87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87645g5b4d.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <07Jun22.094122pdt."57996"@synergy1.parc.xerox.com> > > > In practice, binary concerns do intrude even for text data; you may > > > well want to save it back out in the original encoding, without any > > > spurious changes. > > Then for the purposes of this discussion, it's not text, it's binary. > In many cases it will need to be read as bytes and stored that way > until written back out. That was more or less my original point; the string situation has gotten complicated enough that I believe any careful coder will do any transformations in application code, rather than relying on (and trying to understand) the particular machinations of some text wrapper in the I/O library. Bill From guido at python.org Fri Jun 22 19:21:18 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 22 Jun 2007 10:21:18 -0700 Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: <-3030247401668859168@unknownmsgid> References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> <467BBC98.1070201@gmail.com> <-3030247401668859168@unknownmsgid> Message-ID: On 6/22/07, Bill Janssen wrote: > > A given codec won't randomly decide to change its normalisation policy, > > though - so when you pick the codec, you're picking the normalisation as > > well. > > You're sure? Between CPython and Jython and IronPython and > JavascriptPython and ...? Might as well specify it up front. I'm not sure I see the use case. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From janssen at parc.com Fri Jun 22 20:40:55 2007 From: janssen at parc.com (Bill Janssen) Date: Fri, 22 Jun 2007 11:40:55 PDT Subject: [Python-3000] On PEP 3116: new I/O base classes In-Reply-To: References: <21065E51-C749-4DF9-8743-EF4CDE79F3FD@fuhm.net> <6002181751375776921@unknownmsgid> <-665892861201335771@unknownmsgid> <467BBC98.1070201@gmail.com> <-3030247401668859168@unknownmsgid> Message-ID: <07Jun22.114102pdt."57996"@synergy1.parc.xerox.com> Guido writes: > On 6/22/07, Bill Janssen wrote: > > > A given codec won't randomly decide to change its normalisation policy, > > > though - so when you pick the codec, you're picking the normalisation as > > > well. > > > > You're sure? Between CPython and Jython and IronPython and > > JavascriptPython and ...? Might as well specify it up front. > > I'm not sure I see the use case. Portable Python code that reads and writes "text" files the same way in any implementation of Python. Bill From ntoronto at cs.byu.edu Fri Jun 22 21:32:42 2007 From: ntoronto at cs.byu.edu (Neil Toronto) Date: Fri, 22 Jun 2007 13:32:42 -0600 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> Message-ID: <467C23DA.2040507@cs.byu.edu> Alex Martelli wrote: > $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x > in it.imap(abs,L): pass' > 100000 loops, best of 3: 3 usec per loop > $ python -mtimeit -s'import itertools as it' -s'L=range(-7,17)' 'for x > in (abs(y) for y in L): pass' > 100000 loops, best of 3: 4.47 usec per loop > > (imap is faster in this case because the built-in name 'abs' is looked > up only once -- in the genexp, it's looked up each time, sigh -- > possibly the biggest "we should REALLY tweak the language to let this > be optimized sensibly" gotcha in Python, IMHO). > What is it about the language as it stands that requires abs() to be looked up each iteration? Neil From amcnabb at mcnabbs.org Fri Jun 22 21:42:19 2007 From: amcnabb at mcnabbs.org (Andrew McNabb) Date: Fri, 22 Jun 2007 13:42:19 -0600 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467C23DA.2040507@cs.byu.edu> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> <467C23DA.2040507@cs.byu.edu> Message-ID: <20070622194219.GB26333@mcnabbs.org> On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote: > > (imap is faster in this case because the built-in name 'abs' is looked > > up only once -- in the genexp, it's looked up each time, sigh -- > > possibly the biggest "we should REALLY tweak the language to let this > > be optimized sensibly" gotcha in Python, IMHO). > > What is it about the language as it stands that requires abs() to be > looked up each iteration? Calling abs() could change locals()['abs'], in which case a different function would be called the next time through. You lookup 'abs' each time just in case it's changed. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/4bc19635/attachment.pgp From ntoronto at cs.byu.edu Fri Jun 22 22:13:39 2007 From: ntoronto at cs.byu.edu (Neil Toronto) Date: Fri, 22 Jun 2007 14:13:39 -0600 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070622194219.GB26333@mcnabbs.org> References: <4677C3F8.3050305@nekomancer.net> <7301715244131583311@unknownmsgid> <43aa6ff70706190946w15cf2c17jf8195cddba024d17@mail.gmail.com> <4679020A.8020609@gmail.com> <467C23DA.2040507@cs.byu.edu> <20070622194219.GB26333@mcnabbs.org> Message-ID: <467C2D73.3020600@cs.byu.edu> Andrew McNabb wrote: > On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote: > >>> (imap is faster in this case because the built-in name 'abs' is looked >>> up only once -- in the genexp, it's looked up each time, sigh -- >>> possibly the biggest "we should REALLY tweak the language to let this >>> be optimized sensibly" gotcha in Python, IMHO). >>> >> What is it about the language as it stands that requires abs() to be >> looked up each iteration? >> > > Calling abs() could change locals()['abs'], in which case a different > function would be called the next time through. You lookup 'abs' each > time just in case it's changed. > I can't think of a reason to allow that outside of something like an obfuscated Python code contest. I'm sure there exists someone who thinks differently... Neil From exarkun at divmod.com Fri Jun 22 22:50:01 2007 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 22 Jun 2007 16:50:01 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <467C2D73.3020600@cs.byu.edu> Message-ID: <20070622205001.4947.1910896405.divmod.quotient.3570@ohm> On Fri, 22 Jun 2007 14:13:39 -0600, Neil Toronto wrote: >Andrew McNabb wrote: >> On Fri, Jun 22, 2007 at 01:32:42PM -0600, Neil Toronto wrote: >> >>>> (imap is faster in this case because the built-in name 'abs' is looked >>>> up only once -- in the genexp, it's looked up each time, sigh -- >>>> possibly the biggest "we should REALLY tweak the language to let this >>>> be optimized sensibly" gotcha in Python, IMHO). >>>> >>> What is it about the language as it stands that requires abs() to be >>> looked up each iteration? >>> >> >> Calling abs() could change locals()['abs'], in which case a different >> function would be called the next time through. You lookup 'abs' each >> time just in case it's changed. >> > >I can't think of a reason to allow that outside of something like an >obfuscated Python code contest. I'm sure there exists someone who thinks >differently... The perfectly good reason to allow it is that it is a completely predictable, unsurprising consequence of how the Python language is defined. Making a special case for the way names are looked up in a genexp means making it harder to learn Python and to understand programs written in Python. Keeping this simple isn't about letting people obfuscate code, it's about making it _easy_ for people to understand Python programs. If the goal is to make it easier to write obscure code, _that_ would be a valid motivation for changing the lookup rules here. Preventing people from writing obfuscated programs is _not_. Jean-Paul From aleaxit at gmail.com Fri Jun 22 23:06:07 2007 From: aleaxit at gmail.com (Alex Martelli) Date: Fri, 22 Jun 2007 14:06:07 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070622205001.4947.1910896405.divmod.quotient.3570@ohm> References: <467C2D73.3020600@cs.byu.edu> <20070622205001.4947.1910896405.divmod.quotient.3570@ohm> Message-ID: On 6/22/07, Jean-Paul Calderone wrote: ... > >> Calling abs() could change locals()['abs'], in which case a different > >> function would be called the next time through. You lookup 'abs' each > >> time just in case it's changed. > >> > > > >I can't think of a reason to allow that outside of something like an > >obfuscated Python code contest. I'm sure there exists someone who thinks > >differently... > > The perfectly good reason to allow it is that it is a completely > predictable, unsurprising consequence of how the Python language > is defined. > > Making a special case for the way names are looked up in a genexp > means making it harder to learn Python and to understand programs > written in Python. Absolutely: it should NOT be about specialcasing genexp. Rather, it would be some new rule such as: """ If a built-in name that is used within the body of a function F is rebound or unbound (in the builtins' module or in F's own module), after 'def F' executes and builds a function object F', and before any call to F' has finished executing, the resulting effect is undefined. """ This gives a future Python compiler a fighting chance to optimize builtins' access and use -- quite independently from specialcases such as genexps. (Limiting the optimization to functions is, I believe, quite fine, because similar limitations apply to optimization of local-variable access; IOW, people who care about the speed of some piece of code had better make that code part of some function body, already:-). Alex From exarkun at divmod.com Fri Jun 22 23:13:26 2007 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 22 Jun 2007 17:13:26 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: Message-ID: <20070622211326.4947.551402273.divmod.quotient.3575@ohm> On Fri, 22 Jun 2007 14:06:07 -0700, Alex Martelli wrote: >On 6/22/07, Jean-Paul Calderone wrote: > ... >> >> Calling abs() could change locals()['abs'], in which case a different >> >> function would be called the next time through. You lookup 'abs' each >> >> time just in case it's changed. >> >> >> > >> >I can't think of a reason to allow that outside of something like an >> >obfuscated Python code contest. I'm sure there exists someone who thinks >> >differently... >> >>The perfectly good reason to allow it is that it is a completely >>predictable, unsurprising consequence of how the Python language >>is defined. >> >>Making a special case for the way names are looked up in a genexp >>means making it harder to learn Python and to understand programs >>written in Python. > >Absolutely: it should NOT be about specialcasing genexp. Rather, it >would be some new rule such as: >""" >If a built-in name that is used within the body of a function F is >rebound or unbound (in the builtins' module or in F's own module), >after 'def F' executes and builds a function object F', and before any >call to F' has finished executing, the resulting effect is undefined. >""" >This gives a future Python compiler a fighting chance to optimize >builtins' access and use -- quite independently from specialcases such >as genexps. (Limiting the optimization to functions is, I believe, >quite fine, because similar limitations apply to optimization of >local-variable access; IOW, people who care about the speed of some >piece of code had better make that code part of some function body, >already:-). > This is more reasonable, but it's still a new rule (and I personally find rules which include undefined behavior to be distasteful -- but your suggestion could be modified so that the name change is never respected to achieve roughly the same consequence). And it's not even a rule imposed for a good reason (good reasons are reasons of semantic simplicity, consistency, etc), it's just imposed to make it easier to optimize the runtime. If the common case is to read a name repeatedly and not care about writes to the name, then leave the language alone and just optimize reading of names. For example, have the runtime set up observers for the names used in a function and require any write to a name to notify those observers. Now lookups are fast, the semantics are unchanged, and there are no new rules. No, I'm not volunteering to implement this, but if someone else is interested in spending time speeding up CPython, then this is worth trying first (and it is worth trying to think of other ideas that don't complicate the language). Jean-Paul From aleaxit at gmail.com Fri Jun 22 23:28:12 2007 From: aleaxit at gmail.com (Alex Martelli) Date: Fri, 22 Jun 2007 14:28:12 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070622211326.4947.551402273.divmod.quotient.3575@ohm> References: <20070622211326.4947.551402273.divmod.quotient.3575@ohm> Message-ID: On 6/22/07, Jean-Paul Calderone wrote: ... > This is more reasonable, but it's still a new rule (and I personally > find rules which include undefined behavior to be distasteful -- but > your suggestion could be modified so that the name change is never > respected to achieve roughly the same consequence). And it's not even That would put a potentially heavy burden on Python compilers that may not be interested in producing speedy code but in compiling faster. Specifically asserting that some _weird_ behavior is undefined in order to allow the writing of compilers without excessive burden is quite sensible to me, in general. For example, what happens to 'import foo' statements if some foo.py appears, disappears, and/or changes somewhere on sys.path during the program run IS "de facto" undefined (for filesystems with sufficiently flaky behavior, such as remote ones:-) -- I'd like that to be stated outright in the docs, to allow a sensible and compliant import system to perform some caching (e.g. ensuring os.listdir is called no more than once per directory in sys.path) without lingering feelings of guilt or trickiness. > a rule imposed for a good reason (good reasons are reasons of semantic > simplicity, consistency, etc), it's just imposed to make it easier to > optimize the runtime. If the common case is to read a name repeatedly > and not care about writes to the name, then leave the language alone and > just optimize reading of names. For example, have the runtime set up > observers for the names used in a function and require any write to a > name to notify those observers. Now lookups are fast, the semantics > are unchanged, and there are no new rules. However, this would not afford the same level of optimization (e.g. special opcodes for very lightweight builtins such as len), and if it involved making all dicts richer to support 'observers on key rebinds' might possibly slow dicts by enough to more than counteract the benefits (of course it might be possible to get away with replacing builtin and modules' dicts with instances of an "observabledict" subclass -- possibly worthwhile, but a HUGE workload to undertake in order to let some weirdo reassign 'len' in builtins at random times). Practicality beats purity. Alex From exarkun at divmod.com Fri Jun 22 23:44:11 2007 From: exarkun at divmod.com (Jean-Paul Calderone) Date: Fri, 22 Jun 2007 17:44:11 -0400 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: Message-ID: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm> On Fri, 22 Jun 2007 14:28:12 -0700, Alex Martelli wrote: >On 6/22/07, Jean-Paul Calderone wrote: > ... >>This is more reasonable, but it's still a new rule (and I personally >>find rules which include undefined behavior to be distasteful -- but >>your suggestion could be modified so that the name change is never >>respected to achieve roughly the same consequence). And it's not even > >That would put a potentially heavy burden on Python compilers that may >not be interested in producing speedy code but in compiling faster. >Specifically asserting that some _weird_ behavior is undefined in >order to allow the writing of compilers without excessive burden is >quite sensible to me, in general. For example, what happens to >'import foo' statements if some foo.py appears, disappears, and/or >changes somewhere on sys.path during the program run IS "de facto" >undefined (for filesystems with sufficiently flaky behavior, such as >remote ones:-) -- I'd like that to be stated outright in the docs, to >allow a sensible and compliant import system to perform some caching >(e.g. ensuring os.listdir is called no more than once per directory in >sys.path) without lingering feelings of guilt or trickiness. Could be. I don't find many of my programs to be bottlenecked on compilation time or import time, so these optimizations look like pure lose to me. >>a rule imposed for a good reason (good reasons are reasons of semantic >>simplicity, consistency, etc), it's just imposed to make it easier to >>optimize the runtime. If the common case is to read a name repeatedly >>and not care about writes to the name, then leave the language alone and >>just optimize reading of names. For example, have the runtime set up >>observers for the names used in a function and require any write to a >>name to notify those observers. Now lookups are fast, the semantics >>are unchanged, and there are no new rules. > >However, this would not afford the same level of optimization (e.g. >special opcodes for very lightweight builtins such as len), and if it >involved making all dicts richer to support 'observers on key rebinds' >might possibly slow dicts by enough to more than counteract the >benefits (of course it might be possible to get away with replacing >builtin and modules' dicts with instances of an "observabledict" >subclass -- possibly worthwhile, but a HUGE workload to undertake in >order to let some weirdo reassign 'len' in builtins at random times). >Practicality beats purity. > I also don't find much of my code bottlenecked on local name lookup. Function call and attribute lookup overhead is a much bigger killer, but I can still write apps where the Python VM isn't the bottleneck without really trying, and when something is slow, giving attention to the miniscule fraction of my overall codebase which causes the problem is itself not much of a problem. Is it neat when CPython gets faster overall? Sure. Is it worth complications to the language for what is ultimately a tiny speedup? Not on my balance sheet. Jean-Paul From mike.klaas at gmail.com Sat Jun 23 00:26:37 2007 From: mike.klaas at gmail.com (Mike Klaas) Date: Fri, 22 Jun 2007 15:26:37 -0700 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm> References: <20070622214411.4947.1977577066.divmod.quotient.3580@ohm> Message-ID: On 22-Jun-07, at 2:44 PM, Jean-Paul Calderone wrote: > On Fri, 22 Jun 2007 14:28:12 -0700, Alex Martelli > wrote: > > Could be. I don't find many of my programs to be bottlenecked on > compilation time or import time, so these optimizations look like > pure lose to me. Nor do mine, though this complaint is common (python startup time in general, esp for short scripts). >> However, this would not afford the same level of optimization (e.g. >> special opcodes for very lightweight builtins such as len), and if it >> involved making all dicts richer to support 'observers on key >> rebinds' >> might possibly slow dicts by enough to more than counteract the >> benefits (of course it might be possible to get away with replacing >> builtin and modules' dicts with instances of an "observabledict" >> subclass -- possibly worthwhile, but a HUGE workload to undertake in >> order to let some weirdo reassign 'len' in builtins at random times). >> Practicality beats purity. >> > > I also don't find much of my code bottlenecked on local name lookup. > Function call and attribute lookup overhead is a much bigger killer, > but I can still write apps where the Python VM isn't the bottleneck > without really trying, and when something is slow, giving attention > to the miniscule fraction of my overall codebase which causes the > problem is itself not much of a problem. > > Is it neat when CPython gets faster overall? Sure. Is it worth > complications to the language for what is ultimately a tiny > speedup? Not on my balance sheet. I agree that making CPython .5% faster is not compelling, but there is value in knowing that certain patterns of code are optimized in certain ways, so that less mental effort and tweaking is necessary in those bottleneck functions. Further, it allows the code to remain clearer and truer to the original intent (rebinding globals to locals is _ugly_). It is like constant folding: I don't expect that it produces much by way of general CPython speedup, but it allows clearer code to be written without micro-worries about micro-optimization. s = 0 for x in xrange(10): s += 10*1024*1024 # add ten MB I _like_ being able to write that, knowing that my preferred way of writing the code is not costing me anything. It would, of course, be even better if the whole loop disappeared :) -Mike From nicko at nicko.org Sat Jun 23 10:12:30 2007 From: nicko at nicko.org (Nicko van Someren) Date: Sat, 23 Jun 2007 09:12:30 +0100 Subject: [Python-3000] Python 3000 Status Update (Long!) In-Reply-To: <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> References: <77903.26658.qm@web33507.mail.mud.yahoo.com> <07Jun21.091906pdt."57996"@synergy1.parc.xerox.com> <467AB63A.7050505@cornell.edu> <07Jun21.132117pdt."57996"@synergy1.parc.xerox.com> Message-ID: <189666FC-E9D9-4A84-9934-DDA0B51BF958@nicko.org> On 21 Jun 2007, at 21:21, Bill Janssen wrote: >>> It should amount to "map(+, operands)". >> >> Or, to be pedantic, this: >> >> reduce(lambda x, y: x.__add__(y), operands) > > Don't you mean: > > reduce(lambda x, y: x.__add__(y), operands[1:], operands[0]) In the absence of a "start" value reduce "does the right thing", so you don't need to do that. My original post was asking for sum to behave as Joel wrote. At the moment sum is more like: def sum(operands, start=0): return reduce(lambda x,y: x+y, operands, start) Since the start value defaults to 0, if you don't specify a start value and your items can't be added to zero you run into a problem. I was proposing something that behaved more like: def sum(operands, start=None): if start is None: operands , start = operands[1:], operands[0] return reduce(lambda x,y: x+y, operands, start) The best argument against this so far however is the one from Gareth about what type is returned if no start value is given and the list is also empty. Unless one is happy with the idea that sum([]) == None then I concede that the current behaviour is probably the best compromise. That said, I still think that the special case rejection of strings is ugly! Cheers, Nicko From alexandre at peadrop.com Sat Jun 23 17:53:35 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Sat, 23 Jun 2007 11:53:35 -0400 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly Message-ID: Hello, I think found a bug in the implementation of StringIO/BytesIO in the new io module. I would like to fix it, but I am not sure what should be the correct behavior. Any hint on this? And one more thing, the close method on StringIO/BytesIO objects doesn't work. I will try to fix that too. Thanks, -- Alexandre Python 3.0x (py3k-struni:56080M, Jun 22 2007, 17:18:04) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import io >>> s1 = io.StringIO() >>> s1.seek(10) 10 >>> s1.write('hello') 5 >>> s1.getvalue() 'hello' >>> s1.seek(0) 0 >>> s1.write('abc') 3 >>> s1.getvalue() 'abclo' >>> import StringIO >>> s2 = StringIO.StringIO() >>> s2.seek(10) >>> s2.write('hello') >>> s2.getvalue() '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00hello' >>> s2.seek(0) >>> s2.write('abc') >>> s2.getvalue() 'abc\x00\x00\x00\x00\x00\x00\x00hello' From guido at python.org Sat Jun 23 19:52:19 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 23 Jun 2007 10:52:19 -0700 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly In-Reply-To: References: Message-ID: On 6/23/07, Alexandre Vassalotti wrote: > Hello, > > I think found a bug in the implementation of StringIO/BytesIO in the > new io module. I would like to fix it, but I am not sure what should > be the correct behavior. Any hint on this? BytesIO should behave the way Unix files work: just seeking only sets the read/write position, but writing inserts null bytes between the existing end of the file and the new write position. (Writing zero bytes doesn't count; I've just experimentally verified this.) I think however that for StringIO this should not be allowed -- seek() on StringIO is only allowed to accept cookies returned by tell() on the same file object. > And one more thing, the close method on StringIO/BytesIO objects > doesn't work. I will try to fix that too. What do you want it to do? I'm thinking perhaps it doesn't need to do anything. --Guido > Thanks, > -- Alexandre > > Python 3.0x (py3k-struni:56080M, Jun 22 2007, 17:18:04) > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import io > >>> s1 = io.StringIO() > >>> s1.seek(10) > 10 > >>> s1.write('hello') > 5 > >>> s1.getvalue() > 'hello' > >>> s1.seek(0) > 0 > >>> s1.write('abc') > 3 > >>> s1.getvalue() > 'abclo' > >>> import StringIO > >>> s2 = StringIO.StringIO() > >>> s2.seek(10) > >>> s2.write('hello') > >>> s2.getvalue() > '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00hello' > >>> s2.seek(0) > >>> s2.write('abc') > >>> s2.getvalue() > 'abc\x00\x00\x00\x00\x00\x00\x00hello' > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) From alexandre at peadrop.com Sat Jun 23 20:24:14 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Sat, 23 Jun 2007 14:24:14 -0400 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly In-Reply-To: References: Message-ID: On 6/23/07, Guido van Rossum wrote: > On 6/23/07, Alexandre Vassalotti wrote: > > I think found a bug in the implementation of StringIO/BytesIO in the > > new io module. I would like to fix it, but I am not sure what should > > be the correct behavior. Any hint on this? > > BytesIO should behave the way Unix files work: just seeking only sets > the read/write position, but writing inserts null bytes between the > existing end of the file and the new write position. (Writing zero > bytes doesn't count; I've just experimentally verified this.) I agree with this. I will try to write a patch to fix io.BytesIO. > I think however that for StringIO this should not be allowed -- seek() > on StringIO is only allowed to accept cookies returned by tell() on > the same file object. I am not sure what you mean, by "cookies", here. So, do you mean StringIO would not be allowed to seek beyond the end-of-file? > > And one more thing, the close method on StringIO/BytesIO objects > > doesn't work. I will try to fix that too. > > What do you want it to do? I'm thinking perhaps it doesn't need to do anything. Free the resources held by the object, and make all methods of the object raise a ValueError if they are used. -- Alexandre From guido at python.org Sat Jun 23 20:48:11 2007 From: guido at python.org (Guido van Rossum) Date: Sat, 23 Jun 2007 11:48:11 -0700 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly In-Reply-To: References: Message-ID: On 6/23/07, Alexandre Vassalotti wrote: > On 6/23/07, Guido van Rossum wrote: > > On 6/23/07, Alexandre Vassalotti wrote: > > > I think found a bug in the implementation of StringIO/BytesIO in the > > > new io module. I would like to fix it, but I am not sure what should > > > be the correct behavior. Any hint on this? > > > > BytesIO should behave the way Unix files work: just seeking only sets > > the read/write position, but writing inserts null bytes between the > > existing end of the file and the new write position. (Writing zero > > bytes doesn't count; I've just experimentally verified this.) > > I agree with this. I will try to write a patch to fix io.BytesIO. Great! > > I think however that for StringIO this should not be allowed -- seek() > > on StringIO is only allowed to accept cookies returned by tell() on > > the same file object. > > I am not sure what you mean, by "cookies", here. So, do you mean > StringIO would not be allowed to seek beyond the end-of-file? tell() returns a number that doesn't necessary a byte offset. It's an abstract value that only seek() knows what to do with. TextIOBase in general doesn't support arbitrary seeks at all. I just realize that a different implementation of StringIO could use "code unit" offsets and then it could be allowed to seek beyond EOF. But IMO it's not required to do that (and the current implementation doesn't work that way -- it's a TextIOWrapper on top of a BytesIO). > > > And one more thing, the close method on StringIO/BytesIO objects > > > doesn't work. I will try to fix that too. > > > > What do you want it to do? I'm thinking perhaps it doesn't need to do anything. > > Free the resources held by the object, and make all methods of the > object raise a ValueError if they are used. I'm not sure what the use case for that is (even though the 2.x StringIO does this). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From talin at acm.org Sun Jun 24 04:28:58 2007 From: talin at acm.org (Talin) Date: Sat, 23 Jun 2007 19:28:58 -0700 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: <20070620085701.GA31968@crater.logilab.fr> References: <20070620085701.GA31968@crater.logilab.fr> Message-ID: <467DD6EA.6010303@acm.org> I haven't responded to this thread because I was hoping some of the original proponents of the feature would come out to defend it. (Remember, 3101 is a synthesis of a lot of people's ideas gleaned from many forum postings - In some cases I am willing to defend particular aspects of the PEP, and in others I just write down what I think the general consensus is.) That being said - from what I've read so far, the evidence on both sides of the argument seems anecdotal to me. I'd rather wait and see what more people have to say on the topic. -- Talin Aur?lien Camp?as wrote: > On Tue, Jun 19, 2007 at 08:20:25AM -0700, Guido van Rossum wrote: >> Those are valid concerns. I'm cross-posting this to the python-3000 >> list in the hope that the PEP's author and defendents can respond. I'm >> sure we can work something out. > > Thanks to raise this. It is horrible enough that I feel obliged to > de-lurk. > > -10 on this part of PEP3101. > > >> Please keep further discussion on the python-3000 at python.org list. >> >> --Guido >> >> On 6/19/07, Chris McDonough wrote: >>> Wrt http://www.python.org/dev/peps/pep-3101/ >>> >>> PEP 3101 says Py3K should allow item and attribute access syntax >>> within string templating expressions but "to limit potential security >>> issues", access to underscore prefixed names within attribute/item >>> access expressions will be disallowed. > > People talking about potential security issues should have an > obligation to show how their proposals *really* improve security (in > general); this is of course, a hard thing to do; mere hand-waving is > not sufficient. > >>> I am a person who has lived with the aftermath of a framework >>> designed to prevent data access by restricting access to underscore- >>> prefixed names (Zope 2, ahem), and I've found it's very hard to >>> explain and justify. As a result, I feel that this is a poor default >>> policy choice for a framework. > > And it's even poorer in the context of a language (for it's probably > harder to escape language-level restrictions than framework > obscurities ...). > >>> In some cases, underscore names must become part of an object's >>> external interface. Consider a URL with one or more underscore- >>> prefixed path segment elements (because prefixing a filename with an >>> underscore is a perfectly reasonable thing to do on a filesystem, and >>> path elements are often named after file names) fed to a traversal >>> algorithm that attempts to resolve each path element into an object >>> by calling __getitem__ against the parent found by the last path >>> element's traversal result. Perhaps this is poor design and >>> __getitem__ should not be consulted here, but I doubt that highly >>> because there's nothing particularly special about calling a method >>> named __getitem__ as opposed to some method named "traverse". > > This is trying to make a technical argument, but the 'consenting > adults' policy might be enough. In my experience, zope forbiding > access to _ prefixed attributes just led to work around the > limitation, thus adding more useless indirection to an already crufty > code base. The result is more obfuscation and probably even less > security (as in auditability of the code). > >>> The only precedent within Python 2 for this sort of behavior is >>> limiting access to variables that begin with __ and which do not end >>> with __ to the scope defined by a class and its instances. I >>> personally don't believe this is a very useful feature, but it's >>> still only an advisory policy and you can worm around it with enough >>> gyrations. > > FWIW I've come to never use __attrs. The obfuscation feature seems to > bring nothing but pain (the few times I've fell into that trap as a > beginner python programmer). > >>> Given that security is a concern at all, the only truly reasonable >>> way to "limit security issues" is to disallow item and attribute >>> access completely within the string templating expression syntax. It >>> seems gratuituous to me to encourage string templating expressions >>> with item/attribute access, given that you could do it within the >>> format arguments just as easily in the 99% case, and we've (well... >>> I've) happily been living with that restriction for years now. >>> >>> But if this syntax is preserved, there really should be no *default* >>> restrictions on the traversable names within an expression because >>> this will almost certainly become a hard-to-explain, hard-to-justify >>> bug magnet as it has become in Zope. > > I'd add that Zope in general looks to me like a giant collection of > python anti-patterns and as such can be used as a clue source about > what not to do, especially what not to include in Py3k. > > I don't want to offense people, well no more than necessary (imho zope > *is* an offense to common sense in many ways), but that's the opinion > from someone who earns its living mostly from zope/plone products > dev. and maintenance (these days, anyway). > > Regards, > Aur?lien. > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org > From brett at python.org Sun Jun 24 05:30:40 2007 From: brett at python.org (Brett Cannon) Date: Sat, 23 Jun 2007 20:30:40 -0700 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> References: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> Message-ID: On 6/20/07, Greg Falcon wrote: > On 6/19/07, Chris McDonough wrote: > > Given that security is a concern at all, the only truly reasonable > > way to "limit security issues" is to disallow item and attribute > > access completely within the string templating expression syntax. It > > seems gratuituous to me to encourage string templating expressions > > with item/attribute access, given that you could do it within the > > format arguments just as easily in the 99% case, and we've (well... > > I've) happily been living with that restriction for years now. > > > > But if this syntax is preserved, there really should be no *default* > > restrictions on the traversable names within an expression because > > this will almost certainly become a hard-to-explain, hard-to-justify > > bug magnet as it has become in Zope. > > This sounds exactly right to me. I don't have strong feelings either > way about attribute lookups in formatting strings, or the security > problems they raise. But while it seems a reasonable stance that > user-injected getattr()s may pose a security problem, what seems > indefensible is the stance that user-injected getattr()s are okay > precisely when the attribute being looked up doesn't start with an > underscore. > > A single underscore prefix is a hint to human readers, not to the > language itself, and things should stay that way. Since Talin said he wanted to see what others had to say, I am going to say I agree with this sentiment. I want string formatting to be dead-simple. That means either leaving out overly fancy formatting abilities and keeping it simple, or make it very intuitive with as few special cases as possible. -Brett From talin at acm.org Sun Jun 24 08:01:17 2007 From: talin at acm.org (Talin) Date: Sat, 23 Jun 2007 23:01:17 -0700 Subject: [Python-3000] Issues with PEP 3101 (string formatting) In-Reply-To: References: <46793E85.4000402@gmail.com> Message-ID: <467E08AD.8020703@acm.org> Chris McDonough wrote: > Allowing attribute and/or item access within templating expressions > has historically been the domain of full-on templating languages > (which invariably also have a way to do repeats, conditionals, > arbitrary method calls, etc). > > I think it should probably stay that way because to me, at least, > there's not much more compelling about being able to do item/ > attribute access within a template expression than there is to be > able to do replacements using results from arbitrary method calls. > It's fairly arbitrary to allow calls to __getitem__ and __getattr__ > and but prevent, say, calls to "traverse", at least if the format > arguments are not restricted to plain lists/tuples/dicts. I don't buy this argument - in that I don't think its arbitrary. You are correct that 3101 is not intended to be a full-on templating language, but that doesn't mean that we can't extend it beyond what, say, printf can do. The current design is a mid-point between Perl's interpolated strings (which can contain arbitrary expressions), and C-style printf. The guiding rule is to allow expressions which increase convenience and expressiveness, and which are likely to be useful, while disallowing most of the types of expressions which would be likely to have side effects. Since this is Python, we can't guarantee that there's no side effects, but we can make a pretty good guess based on the assumption that most Python programmers are reasonable and sane. From an implementation standpoint, this is not where the complexity lies. (The most complex part of the code is the part dealing with details of conversion specifiers and formatting of numbers.) > That's not to say that maybe an extended templating thingy shouldn't > ship within the stdlib though, maybe even one that extends the > default interpolation syntax in these sorts of ways. > > - C > > On Jun 20, 2007, at 10:49 AM, Nick Coghlan wrote: > >> Chris McDonough wrote: >>> Wrt http://www.python.org/dev/peps/pep-3101/ >>> PEP 3101 says Py3K should allow item and attribute access syntax >>> within string templating expressions but "to limit potential >>> security issues", access to underscore prefixed names within >>> attribute/item access expressions will be disallowed. >> Personally, I'd be fine with leaving at least the embedded >> attribute access out of the initial implementation of the PEP. I'd >> even be OK with leaving out the embedded item access, but if we >> leave it in "vars(obj)" and the embedded item access would still >> provide a shorthand notation for access to instance variable >> attributes in a format string. >> >> So +1 for leaving out embedded attribute access from the initial >> implementation of PEP 3101, and -0 for leaving out the embedded >> item access. >> >> Cheers, >> Nick. >> >> -- >> Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia >> --------------------------------------------------------------- >> http://www.boredomandlaziness.org >> > > _______________________________________________ > Python-3000 mailing list > Python-3000 at python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org > From chrism at plope.com Sun Jun 24 08:32:13 2007 From: chrism at plope.com (Chris McDonough) Date: Sun, 24 Jun 2007 02:32:13 -0400 Subject: [Python-3000] Issues with PEP 3101 (string formatting) In-Reply-To: <467E08AD.8020703@acm.org> References: <46793E85.4000402@gmail.com> <467E08AD.8020703@acm.org> Message-ID: On Jun 24, 2007, at 2:01 AM, Talin wrote: > The current design is a mid-point between Perl's interpolated > strings (which can contain arbitrary expressions), and C-style > printf. The guiding rule is to allow expressions which increase > convenience and expressiveness, and which are likely to be useful, > while disallowing most of the types of expressions which would be > likely to have side effects. Since this is Python, we can't > guarantee that there's no side effects, but we can make a pretty > good guess based on the assumption that most Python programmers are > reasonable and sane. Of course it's a judgment call whether the benefit of being able to do attribute/item lookup within formatting expressions is "worth it". At very least it means I'll need to be more careful when supplying formatting arguments in order to prevent inappropriate data exposure. And I won't be able to allow untrusted users to compose plain strings with formatting expressions in them, at least without imposing some restricted execution model within the objects fed to the formatter. Zope currently does this inasmuch as it allows people to compose dnyamic TALES expressions, which is "safe" right now, but will become unsafe. Frankly I'd rather just not think about it, because leaving this feature out is way easier than dealing with restricted execution or coming up with a mini templating language to replace the current string formatting stuff, which works fine. But, that aside, at very least, we shouldn't restrict the names available to be looked up by default to those not starting with an underscore (for the reasons I mentioned in the original post in this thread). > > From an implementation standpoint, this is not where the complexity > lies. (The most complex part of the code is the part dealing with > details of conversion specifiers and formatting of numbers.) I know it's not very complex, I just don't believe it's terribly beneficial to have in the base string formatting implementation, and it's potentially harmful. Particularly to web programmers, at least to dumb ones like me. - C From p.f.moore at gmail.com Sun Jun 24 21:10:43 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 24 Jun 2007 20:10:43 +0100 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: References: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> Message-ID: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com> On 24/06/07, Brett Cannon wrote: > On 6/20/07, Greg Falcon wrote: > > This sounds exactly right to me. I don't have strong feelings either > > way about attribute lookups in formatting strings, or the security > > problems they raise. But while it seems a reasonable stance that > > user-injected getattr()s may pose a security problem, what seems > > indefensible is the stance that user-injected getattr()s are okay > > precisely when the attribute being looked up doesn't start with an > > underscore. > > > > A single underscore prefix is a hint to human readers, not to the > > language itself, and things should stay that way. > > Since Talin said he wanted to see what others had to say, I am going > to say I agree with this sentiment. I want string formatting to be > dead-simple. That means either leaving out overly fancy formatting > abilities and keeping it simple, or make it very intuitive with as few > special cases as possible. Again, I agree. I'd prefer to see attribute access stay, but I'm not too bothered, I'm very strongly against any restrictions based on the form of name. Count me as +0 on allowing a.b, and -1 on allowing a.b unless b contains leading underscores. Paul. From p.f.moore at gmail.com Sun Jun 24 21:13:50 2007 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 24 Jun 2007 20:13:50 +0100 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com> References: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com> Message-ID: <79990c6b0706241213o41e395den6687e7fa9af3c189@mail.gmail.com> On 24/06/07, Paul Moore wrote: > Count me as +0 on allowing a.b, and -1 on allowing a.b unless b > contains leading underscores. Rereading that, the second part didn't make sense. Assuming a.b is allowed, I'm -1 on putting restrictions on b, specifically on not allowing it to start with an underscore. Heck, the fact that I find it so hard to describe argues that it's a misguided restriction (ignoring the possibility that I simply can't express myself in my native language :-)) Paul. From talin at acm.org Sun Jun 24 21:51:51 2007 From: talin at acm.org (Talin) Date: Sun, 24 Jun 2007 12:51:51 -0700 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: References: Message-ID: <467ECB57.8080209@acm.org> Georg Brandl wrote: > Another question w.r.t. new string formatting: > > Assuming the %-operator for strings goes away as you said in the recent blog > post, how are we going to convert string formatting (which I daresay is a very > common operation in Python modules) in the 2to3 tool? > > Of course, "abc" % anything can be converted easily. > > name % tuple_or_dict can only be converted to name.format(tuple_or_dict), > without correcting the format string. > > name % name can not be converted at all without type inference. > > Though probably the first type of application is the most frequent one, > pre-building (or just loading from elsewhere) of format strings is not so > uncommon when it comes to localization, where the format string likely > has a _() wrapped around it. > > Of course, converting format strings manually is a PITA, mainly because it's > so common. > > Georg Actually, I was presuming that '%' would stick around for the time being, although it might be officially deprecated. Given that writing a 2to3 converter for format strings would be a project in itself, I think it's probably best to remain backwards compatible for now. -- Talin From jcarlson at uci.edu Sun Jun 24 23:05:30 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Sun, 24 Jun 2007 14:05:30 -0700 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <87bqfcj97n.fsf@ten22.rhodesmill.org> References: <4677E097.5060205@online.de> <87bqfcj97n.fsf@ten22.rhodesmill.org> Message-ID: <20070624132756.7998.JCARLSON@uci.edu> Brandon Craig Rhodes wrote: > Joachim K??nig writes: > > > ... could someone enlighten me why > > > > {,} > > > > can't be used for the empty set, analogous to the empty tuple (,)? > > And now that someone else has broken the ice regarding questions that > have probably been exhausted already, I want to comment that Python 3k > seems to perpetuate a vast asymmetry. Observe: Since no one seems to have responded to this, I will go ahead and do so (I just got back from vacation). > (a) Syntactic constructors > > [ 1,2,3 ] works > { 1,2,3 } works > { 1:1, 2:4, 3:9 } works > > (b) Generators + constructor functions > > list(i for i in (1,2,3)) works > set(i for i in (1,2,3)) works > dict((i,i*i) for i in (1,2,3)) works > > (c) Comprehensions > > [ i for i in (1,2,3) ] works > { i for i in (1,2,3) } works > { i:i*i for i in (1,2,3) ] returns a SyntaxError! But you forgot tuples! ( 1,2,3 ) tuple(i for i in (1,2,3)) (i for i in (1,2,3)) Oops, that last one isn't a tuple, it is a generator expression wrapped up in parenthesis. Really though, there are two exceptions to the rule. Honestly, if you are that concerned about teaching students the language (to the point that they have difficulty figuring out the *two* exceptions to the rule), teach them the single form that always works; generators + constructors. They may see the three different comprehensions/expressions (list, set, generator), but it should be fairly easy to explain that they are equivalent to the generator + constructor version. > Given that Python 3k is making such strides in other areas where cruft > and asymmetry needed to be removed, it would seem a shame to leave the > container types in such disarray. And one could make the argument that TOOTDI says that literals and generators + constructors are the only reasonable options. Comprehensions save perhaps 5 characters over the constructor method, and may be a bit faster, but result in teh asymmetry above. But I will admit that comprehension syntax is not likely to be going anywhere, and dictionary comprehensions are not likely to be added (and neither are tuple comprehensions). - Josiah From jimjjewett at gmail.com Mon Jun 25 16:07:26 2007 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 25 Jun 2007 10:07:26 -0400 Subject: [Python-3000] [Python-Dev] Issues with PEP 3101 (string formatting) In-Reply-To: <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com> References: <3cdcefb80706201000t91f5fd4ia8fac97d1dbc3d@mail.gmail.com> <79990c6b0706241210x2d17a37pfa48c3346e9b7da8@mail.gmail.com> Message-ID: On 6/24/07, Paul Moore wrote: > Count me as +0 on allowing a.b, and -1 on allowing a.b > unless b contains leading underscores. FWIW, I do want to allow a.b, because it means I can more easily pass locals(), instead of creating a one-use near-boilerplate dictionary, such as {"a"=a, "b"=b, "name"=c.name} I do like the "no attribues with leading underscores" restriction as the default; these shouldn't be part of the public API. If they are needed, there should be an alias, and if there isn't an alias, then ... make it easy to override the policy. If the restriction were actually "no magic attributes", so that _myfile was fine, but __file__ wasn't, that would work even better -- except that it would encourage people to use __attributes__ when they shouldn't, just to get the protection. -jJ From python3now at gmail.com Mon Jun 25 18:36:52 2007 From: python3now at gmail.com (James Thiele) Date: Mon, 25 Jun 2007 09:36:52 -0700 Subject: [Python-3000] Bug(s) in 2to3 Message-ID: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com> After checking out subversion repository of 2to3 yesterday I found two cases where refactor.py failed. It didn't like this line: example.py:322: print h.iterkeys().next() throwing: AttributeError: 'DelayedStrNode' object has no attribute 'set_prefix' The attached file "dict_ex.py" is a short example which also gets this error. refactor.py also didn't like: lineno, line = lineno+1, f.next() also throwing: AttributeError: 'DelayedStrNode' object has no attribute 'get_prefix' The attached file "tup.py" is a short example which also gets this error. The attached file "no_tup.py" comments out the offending line and doesn't throw the exception. The attached file "transcript" contains a shell session with full tracebacks. The line numbers in the tracebacks may vary slightly from the repository versions due to debug code used to isolate the problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: transcript Type: application/octet-stream Size: 2315 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: dict_ex.py Type: application/octet-stream Size: 33 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: tup.py Type: application/octet-stream Size: 226 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0002.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: no_tup.py Type: application/octet-stream Size: 228 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/3d2b45a6/attachment-0003.obj From collinw at gmail.com Mon Jun 25 18:42:50 2007 From: collinw at gmail.com (Collin Winter) Date: Mon, 25 Jun 2007 09:42:50 -0700 Subject: [Python-3000] Bug(s) in 2to3 In-Reply-To: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com> References: <8f01efd00706250936u18dcaa7x918fd5bebdf33b1@mail.gmail.com> Message-ID: <43aa6ff70706250942s23c0bd5fve015164de07bbbff@mail.gmail.com> On 6/25/07, James Thiele wrote: > After checking out subversion repository of 2to3 yesterday I found two > cases where refactor.py failed. I'll fix this. Thanks for the bug report. Collin Winter From alexandre at peadrop.com Mon Jun 25 19:18:25 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 25 Jun 2007 13:18:25 -0400 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk Message-ID: Hi, I found two small bugs in pydoc.py. The patch is rather simple, so I doubt I have to explain it. Note, I removed the -*- coding: -*- tag, since the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's what Emacs told me). -- Alexandre -------------- next part -------------- A non-text attachment was scrubbed... Name: pydoc-fix.patch Type: text/x-patch Size: 825 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/344fee3d/attachment.bin From g.brandl at gmx.net Mon Jun 25 19:47:54 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Mon, 25 Jun 2007 19:47:54 +0200 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk In-Reply-To: References: Message-ID: Alexandre Vassalotti schrieb: > Hi, > > I found two small bugs in pydoc.py. The patch is rather simple, so I doubt > I have to explain it. You've submitted this before; I've already committed it to SVN. > Note, I removed the -*- coding: -*- tag, since > the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's > what Emacs told me). AFAICS, the file doesn't have any non-ascii characters in it, so actually it's both latin1 and utf8 :) Georg From alexandre at peadrop.com Mon Jun 25 20:14:16 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 25 Jun 2007 14:14:16 -0400 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly In-Reply-To: References: Message-ID: On 6/23/07, Guido van Rossum wrote: > On 6/23/07, Alexandre Vassalotti wrote: > > I agree with this. I will try to write a patch to fix io.BytesIO. > > Great! I got the patch (it's attached to this email). The fix was simpler than I thought. I would like to write a unittest for it, but I am not sure where it should go in test_io.py. From what I see, MemorySeekTestMixin is for testing read/seek operation common to BytesIO and StringIO, so I can't put it there. And I don't really like the idea of adding another test in IOTest.test_raw_bytes_io. By the way, I am having the same problem for the tests of _string_io and _bytes_io -- i.e., I don't know exactly how to organize them with the rest of the tests in test_io.py. > > Free the resources held by the object, and make all methods of the > > object raise a ValueError if they are used. > > I'm not sure what the use case for that is (even though the 2.x > StringIO does this). > It seem the close method on TextIOWrapper objects is broken too (or at least, bizarre): >>> f = open('test', 'w') >>> f.write('hello') 5 >>> f.close() >>> f.write('hello') 5 >>> ^D $ hd test 00000000 68 65 6c 6c 6f |hello| 00000005 -- Alexandre -------------- next part -------------- A non-text attachment was scrubbed... Name: overseek-bytesio.patch Type: text/x-patch Size: 608 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/25c904d4/attachment.bin From alexandre at peadrop.com Mon Jun 25 20:22:15 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 25 Jun 2007 14:22:15 -0400 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk In-Reply-To: References: Message-ID: On 6/25/07, Georg Brandl wrote: > Alexandre Vassalotti schrieb: > > I found two small bugs in pydoc.py. The patch is rather simple, so I doubt > > I have to explain it. > > You've submitted this before; I've already committed it to SVN. > Really??? I don't remember this ... My last patch was against pdb.py, not pydoc.py > > Note, I removed the -*- coding: -*- tag, since > > the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's > > what Emacs told me). > > AFAICS, the file doesn't have any non-ascii characters in it, so actually > it's both latin1 and utf8 :) Ah! -- Alexandre From alexandre at peadrop.com Mon Jun 25 20:27:55 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 25 Jun 2007 14:27:55 -0400 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk In-Reply-To: References: Message-ID: Meanwhile, I found another division/range combination that could be problematic. I attached an updated patch. -- Alexandre On 6/25/07, Alexandre Vassalotti wrote: > Hi, > > I found two small bugs in pydoc.py. The patch is rather simple, so I doubt > I have to explain it. Note, I removed the -*- coding: -*- tag, since > the encoding of pydoc.py is actually utf-8, not Latin-1 (at least, that's > what Emacs told me). > -------------- next part -------------- A non-text attachment was scrubbed... Name: pydoc-fix-2.patch Type: text/x-patch Size: 1225 bytes Desc: not available Url : http://mail.python.org/pipermail/python-3000/attachments/20070625/ccddc33d/attachment-0001.bin From alexandre at peadrop.com Mon Jun 25 20:44:42 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Mon, 25 Jun 2007 14:44:42 -0400 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk In-Reply-To: References: Message-ID: On 6/25/07, Alexandre Vassalotti wrote: > On 6/25/07, Georg Brandl wrote: > > You've submitted this before; I've already committed it to SVN. > > Really??? I don't remember this ... My last patch was against pdb.py, > not pydoc.py Nevermind, I just found out someone else already sent a patch (Patch #1739659). Sorry for the noise, -- Alexandre From rowen at cesmail.net Mon Jun 25 21:55:53 2007 From: rowen at cesmail.net (Russell E. Owen) Date: Mon, 25 Jun 2007 12:55:53 -0700 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) References: <4677E097.5060205@online.de> <87bqfcj97n.fsf@ten22.rhodesmill.org> <20070624132756.7998.JCARLSON@uci.edu> Message-ID: In article <20070624132756.7998.JCARLSON at uci.edu>, Josiah Carlson wrote: > ...one could make the argument that TOOTDI says that literals and > generators + constructors are the only reasonable options. > Comprehensions save perhaps 5 characters over the constructor method, > and may be a bit faster, but result in teh asymmetry above. But I will > admit that comprehension syntax is not likely to be going anywhere, and > dictionary comprehensions are not likely to be added (and neither are > tuple comprehensions). OK, I'll bite. Does Python really need both list comprehensions and generator expressions? Perhaps list comprehensions should go away in Python 3000? I'm sure it's been discussed (I'm late to this list) and a google search showed a few blog entries but nothing more. -- Russell From jcarlson at uci.edu Tue Jun 26 05:05:41 2007 From: jcarlson at uci.edu (Josiah Carlson) Date: Mon, 25 Jun 2007 20:05:41 -0700 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: <20070624132756.7998.JCARLSON@uci.edu> Message-ID: <20070625193345.79AA.JCARLSON@uci.edu> "Russell E. Owen" wrote: > In article <20070624132756.7998.JCARLSON at uci.edu>, > Josiah Carlson wrote: > > > ...one could make the argument that TOOTDI says that literals and > > generators + constructors are the only reasonable options. > > Comprehensions save perhaps 5 characters over the constructor method, > > and may be a bit faster, but result in teh asymmetry above. But I will > > admit that comprehension syntax is not likely to be going anywhere, and > > dictionary comprehensions are not likely to be added (and neither are > > tuple comprehensions). > > OK, I'll bite. Does Python really need both list comprehensions and > generator expressions? Perhaps list comprehensions should go away in > Python 3000? I'm sure it's been discussed (I'm late to this list) and a > google search showed a few blog entries but nothing more. If list comprehensions went away, then it would make sense for set comprehensions to go away too (being that list comprehensions arguably have far more example uses in real code, and perhaps more use-cases). - Josiah From cspencer at cinci.rr.com Tue Jun 26 20:47:14 2007 From: cspencer at cinci.rr.com (Chris Spencer) Date: Tue, 26 Jun 2007 14:47:14 -0400 Subject: [Python-3000] An impassioned plea for true multithreading Message-ID: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> I know this is probably futile, but I'm going to ask anyway. Since I have not the time (or ability) to code this, I am not even submitting a PEP. I'm throwing this out there on the wind. Since we're doing a lot of work that breaks backwards compatibility, one more piece of breakage needs to happen. We need to have true multithreading. Reasons: 1. Most people who bought a computer in the past year bought a dual-core processor with it. Quad-cores are going to take over the market in 2008. To not be able to take advantage of these extra cores is an inherent language disadvantage. Yes, you can run more than one process and do some sort of IPC, but it requires a lot more work for the coder and a lot more complexity in the code (ie more bugs). 2. It makes writing servers so much more easy on Windows systems (you know, the OS without an effective "fork" mechanism). To simply stick fingers in your ears and yelling "LA LA LA" in the hopes Windows will go away is not effective language design. 3. C# and Java have true multithreading. Ruby doesn't. Let's get it before Ruby does. 4. It will actually speed up the Python interpreter. Not at first, but I'm certain there's a level of parallelism in the Python bytecode that can be exploited by threaded branch prediction and concurrent processing. For example, a generator that figures out its next value BEFORE being called, so it's a simple return of a value when the iteration is called. I speculate that with true multitasking, an optimized python interpreter will appear within a year to take advantage of these possibilities. I hope the thoughts behind this email aren't outweighed by the fact that it didn't go through the proper channels. Thank you for your time. Christoper L. Spencer CTO ROXX, LLC 4515 Leslie Ave. Cincinnati, OH 45242 TEL: 513-545-7057 EMAIL: cspencer at cinci.rr.com From ronaldoussoren at mac.com Tue Jun 26 21:23:00 2007 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Tue, 26 Jun 2007 12:23:00 -0700 Subject: [Python-3000] An impassioned plea for true multithreading In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> Message-ID: On Tuesday, June 26, 2007, at 08:49PM, "Chris Spencer" wrote: > I know this is probably futile, but I'm going to ask anyway. >Since I have not the time (or ability) to code this, I am not even >submitting a PEP. I'm throwing this out there on the wind. > Since we're doing a lot of work that breaks backwards >compatibility, one more piece of breakage needs to happen. We need to >have true multithreading. This request comes up from time to time and the standard OSS mantra applies here: show us the code. None of the core developers is interested enough to work on this and it is far from sure that removing the GIL can be done without massive restructuring of the core interpreter or loss of performance (and possibly both). Someone has tried to lose the GIL several years ago (Google should be able to tell you about this) and ended up with a working but significantly slower interpreter. > >Reasons: [snip the same old reasons] Ronald From rowen at cesmail.net Wed Jun 27 02:23:13 2007 From: rowen at cesmail.net (Russell E. Owen) Date: Tue, 26 Jun 2007 17:23:13 -0700 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) References: <20070624132756.7998.JCARLSON@uci.edu> <20070625193345.79AA.JCARLSON@uci.edu> Message-ID: In article <20070625193345.79AA.JCARLSON at uci.edu>, Josiah Carlson wrote: > "Russell E. Owen" wrote: > > In article <20070624132756.7998.JCARLSON at uci.edu>, > > Josiah Carlson wrote: > > > > > ...one could make the argument that TOOTDI says that literals and > > > generators + constructors are the only reasonable options. > > > Comprehensions save perhaps 5 characters over the constructor method, > > > and may be a bit faster, but result in teh asymmetry above. But I will > > > admit that comprehension syntax is not likely to be going anywhere, and > > > dictionary comprehensions are not likely to be added (and neither are > > > tuple comprehensions). > > > > OK, I'll bite. Does Python really need both list comprehensions and > > generator expressions? Perhaps list comprehensions should go away in > > Python 3000? I'm sure it's been discussed (I'm late to this list) and a > > google search showed a few blog entries but nothing more. > > If list comprehensions went away, then it would make sense for set > comprehensions to go away too (being that list comprehensions arguably > have far more example uses in real code, and perhaps more use-cases). I would personally be happy lose set comprehensions and just use generator expressions for all comprehension-like tasks. -- Russell From martin at v.loewis.de Wed Jun 27 05:43:54 2007 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 27 Jun 2007 05:43:54 +0200 Subject: [Python-3000] An impassioned plea for true multithreading In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> Message-ID: <4681DCFA.4050000@v.loewis.de> Chris Spencer schrieb: > I know this is probably futile, but I'm going to ask anyway. > Since I have not the time (or ability) to code this, I am not even > submitting a PEP. I'm throwing this out there on the wind. Just to second Ronald's sentiment: it won't happen unless somebody does it, and it is highly unlikely that somebody will. > Since we're doing a lot of work that breaks backwards > compatibility, one more piece of breakage needs to happen. We need to > have true multithreading. Be careful when using the pronoun "we"; in the first sentence, it seems to not include yourself, and in the second sentence, it does not include myself. Regards, Martin From rasky at develer.com Wed Jun 27 11:26:59 2007 From: rasky at develer.com (Giovanni Bajo) Date: Wed, 27 Jun 2007 11:26:59 +0200 Subject: [Python-3000] An impassioned plea for true multithreading In-Reply-To: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> Message-ID: On 26/06/2007 20.47, Chris Spencer wrote: > 1. Most people who bought a computer in the past year bought a > dual-core processor with it. Quad-cores are going to take over the > market in 2008. To not be able to take advantage of these extra cores > is an inherent language disadvantage. Yes, you can run more than one > process and do some sort of IPC, but it requires a lot more work for > the coder and a lot more complexity in the code (ie more bugs). In my experience, it's multi-threading that gives you endless bugs without any hope of getting debugged and fixed. Multi-processing (carefully coupled with event-based programming) instead gives you a solid program with small parts which can be run and tested invididually. In fact, I am *happy* that Python does not have true multithreading: this forces people to design programs the right way from the beginning (unless you want the typical quick, non-performance-sensitive, fast-hack thread, and in that case Python's multithreading with GIL is more than enough). So please don't say that Python isn't able to exploit quad-cores: it's a false statement. On the contrary: it lets you use them CORRECTLY, without shared memory issues. Have a look at the package called "processing" in PyPI. You might find it interesting. -- Giovanni Bajo From gproux+py3000 at gmail.com Wed Jun 27 12:44:51 2007 From: gproux+py3000 at gmail.com (Guillaume Proux) Date: Wed, 27 Jun 2007 19:44:51 +0900 Subject: [Python-3000] An impassioned plea for true multithreading In-Reply-To: References: <1sm2835aicn1fisghnmq3nt1cqd9crsj12@4ax.com> Message-ID: <19dd68ba0706270344q27fe5e7fg3bc15f70336db23d@mail.gmail.com> My 2 cents... I have really felt the need for real multithreading when I have tried programming multimedia with python (pygame). Doing scene management at the same time than other processes that require quasi realtime (video decode) is just basically impossible (without telling you about the garbage collector kicking in when the bad guy is about to shoot you!) Of course, one solution is to make a multithreaded scene-graph engine in C++ and control that engine from Python but then it just proves the point that not everything can be scaled up through increasing the number of processes. Some things just cannot be scaled up when it is required to have simultaneous access to the same dataset. Regards, Guillaume From greg.ewing at canterbury.ac.nz Thu Jun 28 02:37:20 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 28 Jun 2007 12:37:20 +1200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: References: <20070624132756.7998.JCARLSON@uci.edu> <20070625193345.79AA.JCARLSON@uci.edu> Message-ID: <468302C0.3050808@canterbury.ac.nz> Russell E. Owen wrote: > I would personally be happy lose set comprehensions and just use > generator expressions for all comprehension-like tasks. One advantage of the comprehension syntaxes is that the body can be inlined instead of relegated to a lambda, saving the overhead of a Python function call per loop. It would be difficult to do that optimisation with a generator unless things like list(generator) were recognised and special-cased somehow. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+ From ncoghlan at gmail.com Thu Jun 28 15:56:25 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 28 Jun 2007 23:56:25 +1000 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <468302C0.3050808@canterbury.ac.nz> References: <20070624132756.7998.JCARLSON@uci.edu> <20070625193345.79AA.JCARLSON@uci.edu> <468302C0.3050808@canterbury.ac.nz> Message-ID: <4683BE09.4010702@gmail.com> Greg Ewing wrote: > Russell E. Owen wrote: >> I would personally be happy lose set comprehensions and just use >> generator expressions for all comprehension-like tasks. > > One advantage of the comprehension syntaxes is that the > body can be inlined instead of relegated to a lambda, > saving the overhead of a Python function call per > loop. I'm not sure what you mean by "function call per loop" in this paragraph. There is no function call per loop even when using a generator expression - a generator function is implicit defined, and then called once to instantiate the generator. Iterating over this suspends and resumes the generating to retrieve each item, rather than making a Python function call as such - is that behaviour what you were referring to? Regardless, what the list and set comprehension syntax saves you is that instead of having to suspend/resume a generator multiple times while iterating over it to fill the container, the implicitly defined function instead creates and populates the desired container type directly. These operations are also compiled to use special opcodes, so they should be significantly faster than the corresponding pure Python code would be. (I'd provide some timing figures, but my Py3k checkout is somewhat stale, so the timeit module isn't working for me at the moment) To get back to the original question, I believe the point of adding set literal and comprehension syntax is to make it possible to easily speed up membership tests for items in known groups - the existing list literals are fast to create, but slow to search. Using a set literal instead of a list literal is also a good way to make it explicit that the order in which the items are added to the container is arbitrary and coincidental, rather than having any significant meaning. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From alexandre at peadrop.com Thu Jun 28 16:37:01 2007 From: alexandre at peadrop.com (Alexandre Vassalotti) Date: Thu, 28 Jun 2007 10:37:01 -0400 Subject: [Python-3000] StringIO/BytesIO in io.py doesn't over-seek properly In-Reply-To: References: Message-ID: Can someone, other than Guido, review my patch? He is in vacation right now, so he probably won't have the time to review and submit it until August. Thanks, -- Alexandre On 6/25/07, Alexandre Vassalotti wrote: > On 6/23/07, Guido van Rossum wrote: > > On 6/23/07, Alexandre Vassalotti wrote: > > > I agree with this. I will try to write a patch to fix io.BytesIO. > > > > Great! > > I got the patch (it's attached to this email). The fix was simpler > than I thought. > > I would like to write a unittest for it, but I am not sure where it > should go in test_io.py. From what I see, MemorySeekTestMixin is for > testing read/seek operation common to BytesIO and StringIO, so I can't > put it there. And I don't really like the idea of adding another test > in IOTest.test_raw_bytes_io. > > By the way, I am having the same problem for the tests of _string_io > and _bytes_io -- i.e., I don't know exactly how to organize them with > the rest of the tests in test_io.py. > > > > Free the resources held by the object, and make all methods of the > > > object raise a ValueError if they are used. > > > > I'm not sure what the use case for that is (even though the 2.x > > StringIO does this). > > > > It seem the close method on TextIOWrapper objects is broken too (or at > least, bizarre): > > >>> f = open('test', 'w') > >>> f.write('hello') > 5 > >>> f.close() > >>> f.write('hello') > 5 > >>> ^D > $ hd test > 00000000 68 65 6c 6c 6f |hello| > 00000005 > > > -- Alexandre > > -- Alexandre Vassalotti From tav at espians.com Thu Jun 28 17:59:31 2007 From: tav at espians.com (tav) Date: Thu, 28 Jun 2007 16:59:31 +0100 Subject: [Python-3000] pimp; restructuring the standard library Message-ID: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> rehi all, First of all, I'd like to say "fucking great work!". Whilst initially skeptical about Python 3.0, I really love all the decisions that have been made so far. Python 3.0 is looking like it's going to be a great language! Thank you ever so much to all those who've put in their time and effort. Now, one of the killer features of Python has always been it's batteries included standard library. However, until recently, this has been slightly abandoned. Wth 3.0, we have a chance to rectify this and bring it up-to-date with the modern era. I don't think what PEP 3001 currently suggests goes far enough in this regard. It seems to be treating the change as a usual python 2.x -> 2.x+1 change. I'd like to suggest a complete overhaul of the standard library, and along with it, perhaps some changes to the import mechanism. * Structured hierarchy (this seems to be something that already has support). * Abandoning of unit tests and replacing with full doctest coverage in the style perfected by Jim Fulton and PJE. Integration with py.test. * Ability to import from remote networked sources, e.g. import http://foo.com/blah/ * Authentication of sources with configurable crypto. * Full integration with setuptools + eggs. * Pluggable integration support for version control systems like svn/bzr. * Builtin versioning support for modules. * Live-update of modules/code support (in the vein of Erlang). * Rewrite of standard library to be more adaptable, concurrent, and pertaining to object capability. This way, we can have a secure, composable and parallelisable standard library! * Integration of "best-of" libraries out there. (Obviously subjective...) * Grouped imports/exports, e.g. from module import :api, instead of the current all or nothing, from module import * Now, this might seem a bit much but if done well, I think it can provide Python a huge leap over other languages... I have already worked on this for my own projects by implementing an import replacement called ``pimp`` in python 2.x. See: https://svn.espnow.net/24weeks/trunk/source/python/importer/pimp/pimp.py And, have been working on structuring code for my own uses under: https://svn.espnow.net/24weeks/trunk/source/python/ Hope this all makes some kind of sense... your thoughts will be much appreciated. Thanks! -- love, tav founder and ceo, esp metanational llp plex:espians/tav | tav at espians.com | +44 (0) 7809 569 369 From pje at telecommunity.com Thu Jun 28 18:41:30 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Thu, 28 Jun 2007 12:41:30 -0400 Subject: [Python-3000] pimp; restructuring the standard library In-Reply-To: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.co m> References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> Message-ID: <20070628163922.A40063A40AF@sparrow.telecommunity.com> At 04:59 PM 6/28/2007 +0100, tav wrote: >* Abandoning of unit tests and replacing with full doctest coverage in >the style perfected by Jim Fulton and PJE. Integration with py.test. I believe that the origination credit for this rightly falls to Tim Peters. (I just copied Jim, myself.) Meanwhile, there are quite a few stdlib doctests now, and unittests still more than have their place. Indeed, I'm also wary of breaking backward compatibility of unittest or doctest in Python 3.0, because that will make it even harder to port code over. How will 2.x users run their existing test suites to verify their code has been ported correctly, if they can't keep using unittest? As it is, they'll have to run them through 2to3, which AFAIK doesn't do doctests currently. >* Ability to import from remote networked sources, e.g. import >http://foo.com/blah/ A strong -1 on any import system that breaks down the current strict separation between module names and module *locations*. Too many people confuse these concepts already, and we already have a nicely field-tested mechanism for specifying locations and turning them into importer objects. >* Authentication of sources with configurable crypto. > >* Full integration with setuptools + eggs. > >* Pluggable integration support for version control systems like svn/bzr. > >* Builtin versioning support for modules. > >* Live-update of modules/code support (in the vein of Erlang). > >* Rewrite of standard library to be more adaptable, concurrent, and >pertaining to object capability. This way, we can have a secure, >composable and parallelisable standard library! Um, and who are you volunteering to do all this work? i.e., "you and what army?" :) >Hope this all makes some kind of sense... your thoughts will be much >appreciated. Thanks! My thought is that you've just proposed several major PEPs that are already too late for Python 3.0 and would probably have been rejected or deferred anyway. I also think it's more likely that your ideas would find more interest/support with the PyPy project than with mainline Python, as some of them at least vaguely resemble some things they are working on, and/or would be easier to implement using PyPy object spaces. From tav at espians.com Thu Jun 28 19:03:31 2007 From: tav at espians.com (tav) Date: Thu, 28 Jun 2007 18:03:31 +0100 Subject: [Python-3000] pimp; restructuring the standard library In-Reply-To: <20070628163922.A40063A40AF@sparrow.telecommunity.com> References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: <95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com> > Indeed, I'm also wary of breaking backward compatibility of unittest > or doctest in Python 3.0, because that will make it even harder to > port code over. How will 2.x users run their existing test suites to > verify their code has been ported correctly, if they can't keep using > unittest? As it is, they'll have to run them through 2to3, which > AFAIK doesn't do doctests currently. Ah, wasn't suggesting dumping the unittest module. Just that tests in the "standard library" should be doctest-based as these are much nicer and more useful! > A strong -1 on any import system that breaks down the current strict > separation between module names and module *locations*. Too many > people confuse these concepts already, and we already have a nicely > field-tested mechanism for specifying locations and turning them into > importer objects. I agree with your -1. Let me rephrase that as being able to use any character in a str for import as opposed to the current limited set in 2.x. I understand that python literals are much broader in 3.0, how does that impact import? > >* Authentication of sources with configurable crypto. > > > >* Full integration with setuptools + eggs. > > > >* Pluggable integration support for version control systems like svn/bzr. > > > >* Builtin versioning support for modules. > > > >* Live-update of modules/code support (in the vein of Erlang). > > > >* Rewrite of standard library to be more adaptable, concurrent, and > >pertaining to object capability. This way, we can have a secure, > >composable and parallelisable standard library! > > Um, and who are you volunteering to do all this work? i.e., "you and > what army?" :) Well, all that code being added to PyPi and the ASPN Python cookbook ain't being done by imagination alone... ;p Seriously, with: a). clear 'lead by example' initial set of how modules should work (with the above mentioned feature sets) b). a provisional incentive model (say, via a gift economy model for all those who have contributed to the standard library) c). a simple hook in the importer which keeps track of modules/code is used, which is used to remunerate the army (if anyone ever contributes financially to it) ;p > My thought is that you've just proposed several major PEPs that are > already too late for Python 3.0 and would probably have been rejected > or deferred anyway. I understood that issues relating to the standard library were still not fixed-in-stone for python 3.0? > I also think it's more likely that your ideas would find more > interest/support with the PyPy project than with mainline Python, as > some of them at least vaguely resemble some things they are working > on, and/or would be easier to implement using PyPy object spaces. This is definitely true. But I want to see all this in Python 3.0... And, I don't see any of the changes proposed requiring any changes to the language.. besides the __subclasses__ and func_closure thing we discussed on python-dev. -- love, tav founder and ceo, esp metanational llp plex:espians/tav | tav at espians.com | +44 (0) 7809 569 369 From brett at python.org Thu Jun 28 19:29:03 2007 From: brett at python.org (Brett Cannon) Date: Thu, 28 Jun 2007 10:29:03 -0700 Subject: [Python-3000] pimp; restructuring the standard library In-Reply-To: <95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com> References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> <95d8c0810706281003l362e5a7ne2108132cb1a0065@mail.gmail.com> Message-ID: On 6/28/07, tav wrote: > > Indeed, I'm also wary of breaking backward compatibility of unittest > > or doctest in Python 3.0, because that will make it even harder to > > port code over. How will 2.x users run their existing test suites to > > verify their code has been ported correctly, if they can't keep using > > unittest? As it is, they'll have to run them through 2to3, which > > AFAIK doesn't do doctests currently. > > Ah, wasn't suggesting dumping the unittest module. Just that tests in > the "standard library" should be doctest-based as these are much nicer > and more useful! But that is your opinion. I personally prefer unittest and find them just as useful. If you want to get more doctest usage then you can convert some tests over from the old stdout-comparison style to doctest. > > > A strong -1 on any import system that breaks down the current strict > > separation between module names and module *locations*. Too many > > people confuse these concepts already, and we already have a nicely > > field-tested mechanism for specifying locations and turning them into > > importer objects. > > I agree with your -1. > > Let me rephrase that as being able to use any character in a str for > import as opposed to the current limited set in 2.x. I understand that > python literals are much broader in 3.0, how does that impact import? > You would need to change the grammar to get this to work. And at that point I would say you are better off developing a function to handle the import than tweaking the grammar. > > >* Authentication of sources with configurable crypto. > > > > > >* Full integration with setuptools + eggs. > > > > > >* Pluggable integration support for version control systems like svn/bzr. > > > > > >* Builtin versioning support for modules. > > > > > >* Live-update of modules/code support (in the vein of Erlang). > > > > > >* Rewrite of standard library to be more adaptable, concurrent, and > > >pertaining to object capability. This way, we can have a secure, > > >composable and parallelisable standard library! > > > > Um, and who are you volunteering to do all this work? i.e., "you and > > what army?" :) > > Well, all that code being added to PyPi and the ASPN Python cookbook > ain't being done by imagination alone... ;p > > Seriously, with: > > a). clear 'lead by example' initial set of how modules should work > (with the above mentioned feature sets) What is that supposed to mean? Modules work how they work. If you are after specific style guidelines in terms of structure you are not going to get one since each module has their own needs. And taking volunteer code already is hard enough; forcing a specific structure just makes getting help that much harder. > > b). a provisional incentive model (say, via a gift economy model for > all those who have contributed to the standard library) > Are we going to give gifts to everyone who has already contributed, with interest? And where is this money coming from? I just see people tossing stuff at us just to get the money, and I don't want that. > c). a simple hook in the importer which keeps track of modules/code is > used, which is used to remunerate the army (if anyone ever contributes > financially to it) ;p > So you want every execution of a Python program to know who contributed code to its execution? It's called Misc/ACKS. > > My thought is that you've just proposed several major PEPs that are > > already too late for Python 3.0 and would probably have been rejected > > or deferred anyway. > > I understood that issues relating to the standard library were still > not fixed-in-stone for python 3.0? > Right, but that is mostly the reorganization, renaming, etc. It does not include changes to how import works, etc. that would require a PEP. > > I also think it's more likely that your ideas would find more > > interest/support with the PyPy project than with mainline Python, as > > some of them at least vaguely resemble some things they are working > > on, and/or would be easier to implement using PyPy object spaces. > > This is definitely true. But I want to see all this in Python 3.0... =) Well, everyone wants to see everything they want in the next version of Python. Even core developers don't always get what they want in a release. -Brett From theller at ctypes.org Thu Jun 28 19:11:00 2007 From: theller at ctypes.org (Thomas Heller) Date: Thu, 28 Jun 2007 19:11:00 +0200 Subject: [Python-3000] Py3k doesn't understand octal literals (on Windows) Message-ID: In a break from real work, I wanted to play a little with 3.0. Did svn update, and built on Windows. Unfortunately, the resulting Python does not understand the new octal literals like 0o777, so importing 'os' fails: 'import site' failed; use -v for traceback Python 3.0x (p3yk:55071M, May 2 2007, 13:50:45) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import os Traceback (most recent call last): File "", line 1, in File "c:\svn\p3yk\lib\os.py", line 150 def makedirs(name, mode=0o777): ^ SyntaxError: invalid syntax >>> Any hints on what I have to do to make this work? Thanks, Thomas From theller at ctypes.org Thu Jun 28 20:33:57 2007 From: theller at ctypes.org (Thomas Heller) Date: Thu, 28 Jun 2007 20:33:57 +0200 Subject: [Python-3000] Py3k doesn't understand octal literals (on Windows) In-Reply-To: References: Message-ID: Thomas Heller schrieb: > In a break from real work, I wanted to play a little with 3.0. Did svn update, and built on Windows. > Unfortunately, the resulting Python does not understand the new octal literals like 0o777, so > importing 'os' fails: > > 'import site' failed; use -v for traceback > Python 3.0x (p3yk:55071M, May 2 2007, 13:50:45) [MSC v.1310 32 bit (Intel)] on win32 > Type "help", "copyright", "credits" or "license" for more information. >>>> import os > Traceback (most recent call last): > File "", line 1, in > File "c:\svn\p3yk\lib\os.py", line 150 > def makedirs(name, mode=0o777): > ^ > SyntaxError: invalid syntax >>>> > > Any hints on what I have to do to make this work? > > Thanks, > Thomas > Sorry for the false alarm, it was all my fault. Thomas From chrism at plope.com Thu Jun 28 22:04:20 2007 From: chrism at plope.com (Chris McDonough) Date: Thu, 28 Jun 2007 16:04:20 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <20070628163922.A40063A40AF@sparrow.telecommunity.com> References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: On Jun 28, 2007, at 12:41 PM, Phillip J. Eby wrote: > At 04:59 PM 6/28/2007 +0100, tav wrote: >> * Abandoning of unit tests and replacing with full doctest >> coverage in >> the style perfected by Jim Fulton and PJE. Integration with py.test. > > I believe that the origination credit for this rightly falls to Tim > Peters. (I just copied Jim, myself.) Meanwhile, there are quite a > few stdlib doctests now, and unittests still more than have their > place. > > Indeed, I'm also wary of breaking backward compatibility of unittest > or doctest in Python 3.0, because that will make it even harder to > port code over. How will 2.x users run their existing test suites to > verify their code has been ported correctly, if they can't keep using > unittest? As it is, they'll have to run them through 2to3, which > AFAIK doesn't do doctests currently. I've historically not been a huge fan of doctests because (these things may have changed since last I used doctest in anger): a) If one of your fixture calls or an assertion fails for some reason, the rest of the test trips over itself trying to complete, usually without success because an invariant hasn't been met, and you need to scroll through a bunch of decoy output to see where the actual problem began. b) I often use test bodies as convenient points to put a pdb.set_trace call if I want to debug something. This wasn't very well supported when I was trying to use doctest. As a result, I still use unittest pretty much exlusively to write tests. I'd be sad if it went away. - C From fdrake at acm.org Thu Jun 28 22:20:31 2007 From: fdrake at acm.org (Fred L. Drake, Jr.) Date: Thu, 28 Jun 2007 16:20:31 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: <200706281620.32051.fdrake@acm.org> On Thursday 28 June 2007, Chris McDonough wrote: > a) If one of your fixture calls or an assertion fails for some > reason, the rest of the test > trips over itself trying to complete, usually without success > because an invariant > hasn't been met, and you need to scroll through a bunch of decoy > output to > see where the actual problem began. The testrunner in zope.testing handles this by providing an option to hide the secondary failures, so only one traceback shows up per document. > b) I often use test bodies as convenient points to put a > pdb.set_trace call if I want to > debug something. This wasn't very well supported when I was > trying to use doctest. The doctest in zope.testing supports this; hopefully someone sufficiently in-the-know can unfork that version. > As a result, I still use unittest pretty much exlusively to write > tests. I'd be sad if it went away. Yes; there's definately a place for unittest, or something very like it. -Fred -- Fred L. Drake, Jr. From pje at telecommunity.com Thu Jun 28 22:57:04 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Thu, 28 Jun 2007 16:57:04 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: <20070628205456.A94EA3A40A8@sparrow.telecommunity.com> At 04:04 PM 6/28/2007 -0400, Chris McDonough wrote: >a) If one of your fixture calls or an assertion fails for some >reason, the rest of the test > trips over itself trying to complete, usually without success >because an invariant > hasn't been met, and you need to scroll through a bunch of decoy >output to > see where the actual problem began. Use the REPORT_ONLY_FIRST_FAILURE option: http://python.org/doc/2.4.1/lib/doctest-options.html >b) I often use test bodies as convenient points to put a >pdb.set_trace call if I want to > debug something. This wasn't very well supported when I was >trying to use doctest. I believe this was fixed in 2.4. And I *know* it's fixed in 2.5. :) From greg.ewing at canterbury.ac.nz Fri Jun 29 01:26:13 2007 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Fri, 29 Jun 2007 11:26:13 +1200 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <4683BE09.4010702@gmail.com> References: <20070624132756.7998.JCARLSON@uci.edu> <20070625193345.79AA.JCARLSON@uci.edu> <468302C0.3050808@canterbury.ac.nz> <4683BE09.4010702@gmail.com> Message-ID: <46844395.4000802@canterbury.ac.nz> Nick Coghlan wrote: > There is no function call per loop even when using a > generator expression - a generator function is implicit defined, and > then called once to instantiate the generator. You're right -- I must have been half-thinking of map() at the time. Resuming the generator ought to be faster than a function call. But still a bit slower than in-line code, perhaps. > I believe the point of adding set > literal and comprehension syntax is to make it possible to easily speed > up membership tests for items in known groups Yes, but set(generator) would do that just as well as {generator} if it weren't any slower. So the reasons for keeping the comprehension notations are (a) slightly more convenient syntax and (b) maybe a bit faster. -- Greg From barry at python.org Fri Jun 29 05:46:05 2007 From: barry at python.org (Barry Warsaw) Date: Thu, 28 Jun 2007 23:46:05 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote: > I've historically not been a huge fan of doctests because (these > things may have changed since last I used doctest in anger): I used to think the same thing, but I've gotten the doctest religion. I'm using them almost exclusively in the new Mailman code, and we use them at work (though both still have traditional Python unit tests). The thing that convinced me was the realization (assisted by my colleagues) that doctests are first and foremost documentation. They are testable documentation sure, but the unit tests are secondary. There's no question that for things like system documentation, the narrative that weaves the testable bits together in a well written doctest are much more valuable than the tests. Most unittest based tests have little or no comments, and nothing approaching the narrative in a good doctest, so it's clear that unittests are tests first and probably not documentation at all. I've even experimented with writing a PEP for my enum package (as yet unsubmitted) that is nothing more than a doctest. It seemed almost completely natural. A good test suite can benefit from both doctests and unittests and I don't think unittest will ever go away (nor should it), but in my latest work I'm opting more and more for doctests. That Tim Peters is a smart guy after all I guess. :) - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRoSAfnEjvBPtnXfVAQLSrQP/criiWjS2RdChwq5CVw1BbYbS5LP8WI7b 4SY6BRLFFWH218IrihVa8kZh8cvrTb1PHxVqiuEQIj3qcHo3SuMO6A1MKYZJhuCN vOINQkseaP1jGn5/b85/Q3OSUGbVfdWS+E7Yri5qCva/GaTNwCNNHNTT9+K7LBqE 7AA937O2oa8= =97lr -----END PGP SIGNATURE----- From chrism at plope.com Fri Jun 29 07:40:39 2007 From: chrism at plope.com (Chris McDonough) Date: Fri, 29 Jun 2007 01:40:39 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: On Jun 28, 2007, at 11:46 PM, Barry Warsaw wrote: > The thing that convinced me was the realization (assisted by my > colleagues) that doctests are first and foremost documentation. > They are testable documentation sure, but the unit tests are > secondary. There's no question that for things like system > documentation, the narrative that weaves the testable bits together > in a well written doctest are much more valuable than the tests. I suspect it would be even more valuable as documentation if it didn't give good coverage. > Most unittest based tests have little or no comments, and nothing > approaching the narrative in a good doctest, so it's clear that > unittests are tests first and probably not documentation at all. This probably isn't the place for this discussion but I'll give an explanation about why I think that's actually a good thing. I find that I only get good test coverage when I have more test code for a component than the implementation of the component I'm trying to test. At least that's been my experience. I haven't been able to make the tests more succinct while still testing things adequately. When coverage gets good, "documentation-ness" of tests suffers. You can get good test coverage with any sort of tests. But once you get good test coverage, whatever framework you've chose to write them in, the tests are no longer very good as narrative documentation because they're littered with bits of fixture code, edge case assertions, etc. I don't mind doctest at all really (I just use unittest out of inertia and personal preference, I'd probably just as happy with nose or whatever). I just don't like when folks advertise the same doctest as both a comprehensive set of tests and a component's only source of documentation, because I don't think it's possible for it to be both at the same time with any sort of quality in both directions simultaneously. That said, having testable documentation is a good thing! I'd just prefer that that documentation did not include lots of fixture noise. > A good test suite can benefit from both doctests and unittests and > I don't think unittest will ever go away (nor should it), but in my > latest work I'm opting more and more for doctests. That Tim Peters > is a smart guy after all I guess. :) I miss "uncle Timmy". :-( - C From mhammond at skippinet.com.au Fri Jun 29 07:27:53 2007 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri, 29 Jun 2007 15:27:53 +1000 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: Message-ID: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> Barry writes: > On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote: > > > I've historically not been a huge fan of doctests because (these > > things may have changed since last I used doctest in anger): > > I used to think the same thing, but I've gotten the doctest > religion. I'm using them almost exclusively in the new > Mailman code, > and we use them at work (though both still have traditional Python > unit tests). > > The thing that convinced me was the realization (assisted by my > colleagues) that doctests are first and foremost > documentation. They > are testable documentation sure, but the unit tests are secondary. I admit I'm still yet to get the doctest religion to that degree - but for exactly the same reasons :) My problem is that too quickly, doctests go way beyond documentation - they turn into a full-blown test framework, and this tends to work against the clarity of the resulting documentation. I like doctests that give you a general introduction to what is being tested. They can operate as a kind of 'tutorial', allowing someone with no experience in the code to quickly see the basics of how it is used - that is very useful indeed. But IMO, these too quickly morph into the territory of unittests - they start testing all corner cases. The simple tutorial quality gets lost as the doctests start including reams of test data and testing against invariants that are important to the developer of the library, but mere noise to a casual user of it. Another key feature of unitests are their utility in helping you *find* bugs in the first place. When a bug is identified "in the field", unit tests make it easy to find a "smallest possible" reproduction of a bug in order to identify the root cause - which is then checked in when the bug is fixed. If only doctests are available, then either that obscure bug is also added to the doctests (making even more noise), or a test case is extracted to a temporary program and discarded once the bug is fixed. > A good test suite can benefit from both doctests and unittests and I > don't think unittest will ever go away (nor should it), but in my > latest work I'm opting more and more for doctests. I find myself opting for doctests when working with "new" code, but quickly leaving the doctests in their pristine state and moving to unittests once the bugs get a bit curlier, or coverage.py directs me to write tests I'd never dreamt of, etc... > That Tim Peters is a smart guy after all I guess. :) Indeed he is - which is exactly why use them as I described - that is my interpretation of what he intended Mark From ncoghlan at gmail.com Fri Jun 29 09:41:57 2007 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 29 Jun 2007 17:41:57 +1000 Subject: [Python-3000] [Python-Dev] Python 3000 Status Update (Long!) In-Reply-To: <46844395.4000802@canterbury.ac.nz> References: <20070624132756.7998.JCARLSON@uci.edu> <20070625193345.79AA.JCARLSON@uci.edu> <468302C0.3050808@canterbury.ac.nz> <4683BE09.4010702@gmail.com> <46844395.4000802@canterbury.ac.nz> Message-ID: <4684B7C5.8020307@gmail.com> Greg Ewing wrote: > So the reasons for keeping the comprehension notations > are (a) slightly more convenient syntax and (b) maybe > a bit faster. Yes, I was actually agreeing with you on that point (I just got sidetracked on a couple of technical quibbles, so my agreement may not have been clear...) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org From guido at python.org Fri Jun 29 16:49:05 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 29 Jun 2007 07:49:05 -0700 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> Message-ID: If I have any say in it, unittest isn't going away (unless replaced by something very similar, and doctest ain't it). Religion is all fine and well, as long as there's room for other views. I personally find using unit tests a lot easier than using doctest, for many of the things I tend to do (and most of my co-workers at Google see it that way, too). That said, I hope that the doctest community will contribute a better way for the 2to3 tool to find and fix doctests; the -d option is too cumbersome to use. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From pje at telecommunity.com Fri Jun 29 17:37:51 2007 From: pje at telecommunity.com (Phillip J. Eby) Date: Fri, 29 Jun 2007 11:37:51 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> Message-ID: <20070629153542.467A03A40BF@sparrow.telecommunity.com> At 01:40 AM 6/29/2007 -0400, Chris McDonough wrote: >When coverage gets good, "documentation-ness" of tests suffers. The question is more one of, "documentation for whom?". You can write separate documents for library users than for library extenders/developers. I don't put doctests in docstrings, but if I did, I'd probably only put user doctests there. As it is, I normally split my doctests into multiple files for different audiences, or under different headings in one large file. For example, if you look at the BytecodeAssembler documentation: http://peak.telecommunity.com/DevCenter/BytecodeAssembler You'll see that the assertion and invariant testing is mostly relegated to a separate section. Another library I'm working on has two doctest files for users (a quick intro and a developer guide/reference) and a separate file that tests all the innards. So, there are a lot of ways to use doctests effectively, at least if you're doing them in text files, rather than in your docstrings. I've actually never put a doctest in a docstring; it always seems like overkill to me. (Especially since reST doctests can be made into nice HTML pages like the above!) From barry at python.org Fri Jun 29 18:12:28 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 29 Jun 2007 12:12:28 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <20070629153542.467A03A40BF@sparrow.telecommunity.com> References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> <20070628163922.A40063A40AF@sparrow.telecommunity.com> <20070629153542.467A03A40BF@sparrow.telecommunity.com> Message-ID: <41E11C29-D7BF-45C5-8E0F-FEE3FB1CE150@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Since this has stopped being on-topic for this mailing list, so just one last follow up from me. On Jun 29, 2007, at 11:37 AM, Phillip J. Eby wrote: > The question is more one of, "documentation for whom?". You can > write separate documents for library users than for library > extenders/developers. I don't put doctests in docstrings, but if I > did, I'd probably only put user doctests there. As it is, I > normally split my doctests into multiple files for different > audiences, or under different headings in one large file. > > Another library I'm working on has two doctest files for users (a > quick intro and a developer guide/reference) and a separate file > that tests all the innards. So, there are a lot of ways to use > doctests effectively, at least if you're doing them in text files, > rather than in your docstrings. I've actually never put a doctest > in a docstring; it always seems like overkill to me. (Especially > since reST doctests can be made into nice HTML pages like the above!) I concur with Phillip about two important points. First, I also never put doctests in docstrings. I find them unreadable, difficult to edit, and not conducive to a cohesive narrative. I always put my doctests in a separate file, usually under a 'docs' directory actually. Maybe this will make a difference for people considering doctests as a complement to traditional unittests. Second, I agree that you can achieve a high degree of coverage with doctests if you stop to answer Phillip's question: "documentation for whom?" I typically put documentation for users in a separate file than documentation for extenders/developers, but that's just personal taste. The important insight is that explaining how to use a library often covers most if not all the corner cases of its use. Lastly, an observation: I've found that using doctests has had the surprisingly consequence of making test-driven development enjoyable. I've found myself starting a new task by writing the documentation first, which is a great way to design how you want the code to work. Because the documentation is testable, you're left with a simple matter of coding until the doctest passes. Cheers, - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRoUvbXEjvBPtnXfVAQIodgP9H7+7LFCY+TJtCMbeZjl9Q1JHXBrZyIBj q0aQFR82qvo/I/OUoAwLAqOQbAxyXXBMiEVuAKBQA8ETCOhmOcF/apTmizu0SWZR b4a6XbvPiIH9DMeOWyFOEEqchfMYFyqWkp+J5fc1mmyF9wLGeQxdmUffzSy17a66 7b/GQBwajlI= =Aamm -----END PGP SIGNATURE----- From srichter at cosmos.phy.tufts.edu Fri Jun 29 18:39:34 2007 From: srichter at cosmos.phy.tufts.edu (Stephan Richter) Date: Fri, 29 Jun 2007 12:39:34 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <95d8c0810706280859x20984153k50469e6979ba265a@mail.gmail.com> Message-ID: <200706291239.34485.srichter@cosmos.phy.tufts.edu> On Friday 29 June 2007 01:40, Chris McDonough wrote: > I don't mind doctest at all really (I just use unittest out of ? > inertia and personal preference, I'd probably just as happy with nose ? > or whatever). ?I just don't like when folks advertise the same ? > doctest as both a comprehensive set of tests and a component's only ? > source of documentation, because I don't think it's possible for it ? > to be both at the same time with any sort of quality in both ? > directions simultaneously. I could not disagree more. My personal rule is that any released code should be 100% coverage tested. And I never write regular unittests anymore, unless for some super-specific cases. Alos, people compliment me about good documentation all the time. Have a look at http://svn.zope.org/z3c.form/trunk/src/z3c/form/. the documentation is example driven, yet still covers all of the API. Having said that, writing comprehensive doctests that do not read like a CS thesis is very hard. It took me the last 5 years developing Zope 3 to learn how to do that right. BTW, I do agree with what Phillip and Barry wrote. I always consider it a challenge to see how many lines of testable documentation I can write before writing one line of code -- I max out at about 2k right now. Regards, Stephan -- Stephan Richter CBU Physics & Chemistry (B.S.) / Tufts Physics (Ph.D. student) Web2k - Web Software Design, Development and Training From santagada at gmail.com Fri Jun 29 19:03:19 2007 From: santagada at gmail.com (Leonardo Santagada) Date: Fri, 29 Jun 2007 14:03:19 -0300 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> Message-ID: <3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com> Em 29/06/2007, ?s 11:49, Guido van Rossum escreveu: > If I have any say in it, unittest isn't going away (unless replaced by > something very similar, and doctest ain't it). Religion is all fine > and well, as long as there's room for other views. I personally find > using unit tests a lot easier than using doctest, for many of the > things I tend to do (and most of my co-workers at Google see it that > way, too). py.test is similar enough to replace unittest? -- Leonardo Santagada From guido at python.org Fri Jun 29 19:09:19 2007 From: guido at python.org (Guido van Rossum) Date: Fri, 29 Jun 2007 10:09:19 -0700 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com> References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> <3915BC8B-CC39-4C02-9A96-4225A7678062@gmail.com> Message-ID: On 6/29/07, Leonardo Santagada wrote: > > Em 29/06/2007, ?s 11:49, Guido van Rossum escreveu: > > > If I have any say in it, unittest isn't going away (unless replaced by > > something very similar, and doctest ain't it). Religion is all fine > > and well, as long as there's room for other views. I personally find > > using unit tests a lot easier than using doctest, for many of the > > things I tend to do (and most of my co-workers at Google see it that > > way, too). > > py.test is similar enough to replace unittest? I've never looked at py.test, so I can't tell. There needs to be a 100% backwards-compatible API so existing unittests don't need to be changed (as they are the cornerstone of any transition to Python 3000). -- --Guido van Rossum (home page: http://www.python.org/~guido/) From rrr at ronadam.com Sat Jun 30 01:19:54 2007 From: rrr at ronadam.com (Ron Adam) Date: Fri, 29 Jun 2007 18:19:54 -0500 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> Message-ID: <4685939A.9090803@ronadam.com> Mark Hammond wrote: > Barry writes: > >> On Jun 28, 2007, at 4:04 PM, Chris McDonough wrote: >> A good test suite can benefit from both doctests and unittests and I >> don't think unittest will ever go away (nor should it), but in my >> latest work I'm opting more and more for doctests. > > I find myself opting for doctests when working with "new" code, but quickly > leaving the doctests in their pristine state and moving to unittests once > the bugs get a bit curlier, or coverage.py directs me to write tests I'd > never dreamt of, etc... I agree with this completely. Doctests are very useful for getting the basics down and working both while the code is being written. After that, unittests are much better for testing edge cases and making sure everything works including the kitchen sink, the pipes to the sink, the quality of water, etc... ;-) IF there is a problem, I don't think it is in the exact execution of doctests or unittests, but in the organization of them relative to the modules and how they are run. Currently the unittest test suite runs tests that are in a know place with known name. There can be modules in a distribution that are completely untested and you would not know unless you manually checked for this. I'd like to see this turned around a bit so that the test suite runner first scans the modules and then looks for tests for each of them. And if no test for a particular module is found, give some sort of warning. Possibly a module could have a __tests__ 'list' attribute with locations of tests? So an automatic test runner might start by first importing a module and then running the test modules listed in __tests__. And yes even the tests can have tests. ;-) A "__tests__ = None" could explicitly turn that off, where a "__tests__ = []" would indicate a module that does not yet have tests but needs them. This could also reduce the boiler plate needed to run unittests as a side bonus. There's been a few times where I started writing doctests for a module with less than 100 lines of code and by the time I was done with the doc tests, it became a 500 line or more module. The actual code then starts to get lost in the file. It would be cool if the documents files could also contain the doc tests instead of them being in the source code. I'm sure the could be done now, but there isn't a standard way to do this. Currently I create a seperate test module which unclutters the program modules, but then it isn't clear these are meant to be documentation first. Cheers, Ron From percivall at gmail.com Sat Jun 30 01:39:43 2007 From: percivall at gmail.com (Simon Percivall) Date: Sat, 30 Jun 2007 01:39:43 +0200 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <4685939A.9090803@ronadam.com> References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> <4685939A.9090803@ronadam.com> Message-ID: On 30 jun 2007, at 01.19, Ron Adam wrote: > It would be cool if the documents files could also contain the doc > tests > instead of them being in the source code. I'm sure the could be > done now, > but there isn't a standard way to do this. Currently I create a > seperate > test module which unclutters the program modules, but then it isn't > clear > these are meant to be documentation first. Well, there is doctest.testfile, which should do that. It's been in doctest since 2.4. //Simon From benji at benjiyork.com Sat Jun 30 04:20:01 2007 From: benji at benjiyork.com (Benji York) Date: Fri, 29 Jun 2007 22:20:01 -0400 Subject: [Python-3000] doctests vs. unittests (was Re: pimp; restructuring the standard library) In-Reply-To: <4685939A.9090803@ronadam.com> References: <003501c7ba0e$3e78ef90$0200a8c0@enfoldsystems.local> <4685939A.9090803@ronadam.com> Message-ID: <4685BDD1.6010906@benjiyork.com> Being off topic, I'm just going to do a drive by and urge people that are interested in following up to visit the TIP (testing in Python) list at http://lists.idyll.org/listinfo/testing-in-python. Ron Adam wrote: > I agree with this completely. Doctests are very useful for getting the > basics down and working both while the code is being written. > > After that, unittests are much better for testing edge cases and making > sure everything works including the kitchen sink, the pipes to the sink, > the quality of water, etc... ;-) In the code bases I'm involved in right now, we use doctests almost exclusively, including for the "kitchen sink" tests. We find the slight tenancy toward more and better prose in doctests is especially nice when trying to discern what exactly some obscure test code is actually trying to verify (particularly important when the test fails). > Currently the unittest test suite runs tests that are in a know place with > known name. There can be modules in a distribution that are completely > untested and you would not know unless you manually checked for this. Most test runners have coverage reporting options for both unit tests and doctests. > There's been a few times where I started writing doctests for a module with > less than 100 lines of code and by the time I was done with the doc tests, > it became a 500 line or more module. The actual code then starts to get > lost in the file. > > It would be cool if the documents files could also contain the doc tests > instead of them being in the source code. As mentioned later in this thread this is already possible. Having a separate files for them (one of which is usually named README.txt) is quite a bit nicer. If you write your "whole file" doctests in ReST, you can also render them to HTML as is done for the packages we put in pypi (here's a short example: http://cheeseshop.python.org/pypi/zc.queue/1.1, ReST source at http://svn.zope.org/*checkout*/zc.queue/trunk/src/zc/queue/queue.txt). -- Benji York http://benjiyork.com From g.brandl at gmx.net Sat Jun 30 09:33:22 2007 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 30 Jun 2007 09:33:22 +0200 Subject: [Python-3000] Fix for Lib/pydoc.py in p3yk In-Reply-To: References: Message-ID: Alexandre Vassalotti schrieb: > Meanwhile, I found another division/range combination that could be > problematic. I attached an updated patch. Thanks, committed. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From matt-python at theory.org Sat Jun 30 22:54:44 2007 From: matt-python at theory.org (Matt Chisholm) Date: Sat, 30 Jun 2007 13:54:44 -0700 Subject: [Python-3000] Announcing PEP 3136 Message-ID: <20070630205444.GD22221@theory.org> Hi all. I've created and submitted a new PEP proposing support for labels in Python's break and continue statements. Georg Brandl has graciously added it to the PEP list as PEP 3136: http://www.python.org/dev/peps/pep-3136/ I understand that the deadline for submitting features for Python 3.0 has passed, so this PEP targets Python 3.1. I also expect that people might not want to take time off from the Python 3.0 effort to discuss features that are even further off in the future. Thanks for your time, and thanks for letting me contribute an idea to Python. -matt