From greg at krypto.org Fri May 1 00:44:12 2009 From: greg at krypto.org (Gregory P. Smith) Date: Thu, 30 Apr 2009 15:44:12 -0700 Subject: [Python-Dev] Proposed: a new function-based C API for declaring Python types In-Reply-To: <49F7C37C.5090305@hastings.org> References: <49F7C37C.5090305@hastings.org> Message-ID: <52dc1c820904301544m649b78acr6238b66d9a63be61@mail.gmail.com> On Tue, Apr 28, 2009 at 8:03 PM, Larry Hastings wrote: > > EXECUTIVE SUMMARY > > I've written a patch against py3k trunk creating a new function-based > API for creating extension types in C. This allows PyTypeObject to > become a (mostly) private structure. > > > THE PROBLEM > > Here's how you create an extension type using the current API. > > * First, find some code that already has a working type declaration. > Copy and paste their fifty-line PyTypeObject declaration, then > hack it up until it looks like what you need. > > * Next--hey! There *is* no next, you're done. You can immediately > create an object using your type and pass it into the Python > interpreter and it would work fine. You are encouraged to call > PyType_Ready(), but this isn't required and it's often skipped. > > This approach causes two problems. > > 1) The Python interpreter *must support* and *cannot change* > the PyTypeObject structure, forever. Any meaningful change to > the structure will break every extension. This has many > consequences: > a) Fields that are no longer used must be left in place, > forever, as ignored placeholders if need be. Py3k cleaned > up a lot of these, but it's already picked up a new one > ("tp_compare" is now "tp_reserved"). > b) Internal implementation details of the type system must > be public. > c) The interpreter can't even use a different structure > internally, because extensions are free to pass in objects > using PyTypeObjects the interpreter has never seen before. > > 2) As a programming interface this lacks a certain gentility. It > clearly *works*, but it requires programmers to copy and paste > with a large structure mostly containing NULLs, which they must > pick carefully through to change just a few fields. > > > THE SOLUTION > > My patch creates a new function-based extension type definition API. > You create a type by calling PyType_New(), then call various accessor > functions on the type (PyType_SetString and the like), and when your > type has been completely populated you must call PyType_Activate() > to enable it for use. > > With this API available, extension authors no longer need to directly > see the innards of the PyTypeObject structure. Well, most of the > fields anyway. There are a few shortcut macros in CPython that need > to continue working for performance reasons, so the "tp_flags" and > "tp_dealloc" fields need to remain publically visible. > > One feature worth mentioning is that the API is type-safe. Many such > APIs would have had one generic "PyType_SetPointer", taking an > identifier for the field and a void * for its value, but this would > have lost type safety. Another approach would have been to have one > accessor per field ("PyType_SetAddFunction"), but this would have > exploded the number of functions in the API. My API splits the > difference: each distinct *type* has its own set of accessors > ("PyType_GetSSizeT") which takes an identifier specifying which > field you wish to get or set. > > > SIDE-EFFECTS OF THE API > > The major change resulting from this API: all PyTypeObjects must now > be *pointers* rather than static instances. For example, the external > declaration of PyType_Type itself changes from this: > PyAPI_DATA(PyTypeObject) PyType_Type; > to this: > PyAPI_DATA(PyTypeObject *) PyType_Type; > > This gives rise to the first headache caused by the API: type casts > on type objects. It took me a day and a half to realize that this, > from Modules/_weakref.c: > PyModule_AddObject(m, "ref", > (PyObject *) &_PyWeakref_RefType); > really needed to be this: > PyModule_AddObject(m, "ref", > (PyObject *) _PyWeakref_RefType); > > Hopefully I've already found most of these in CPython itself, but > this sort of code surely lurks in extensions yet to be touched. > > (Pro-tip: if you're working with this patch, and you see a crash, > and gdb shows you something like this at the top of the stack: > #0 0x081056d8 in visit_decref (op=0x8247aa0, data=0x0) > at Modules/gcmodule.c:323 > 323 if (PyObject_IS_GC(op)) { > your problem is an errant &, likely on a type object you're passing > in to the interpreter. Think--what did you touch recently? Or debug > it by salting your code with calls to collect(NUM_GENERATIONS-1).) > > > Another irksome side-effect of the API: because of "tp_flags" and > "tp_dealloc", I now have two declarations of PyTypeObject. There's > the externally-visible one in Include/object.h, which lets external > parties see "tp_dealloc" and "tp_flags". Then there's the internal > one in Objects/typeprivate.h which is the real structure. Since > declaring a type twice is a no-no, the external one is gated on > #ifndef PY_TYPEPRIVATE > If you're a normal Python extension programmer, you'd include Python.h > as normal: > #include "Python.h" > Python implementation files that need to see the real PyTypeObject > structure now look like this: > #define PY_TYPEPRIVATE > #include "Python.h" > #include "../Objects/typeprivate.h" > > Also, since the structure of PyTypeObject hasn't yet changed, there > are a bunch of fields in PyTypeObject that are externally visible that > I don't want to be visible. To ensure no one was using them, I renamed > them to "mysterious_object_0" and "mysterious_object_1" and the like. > Before this patch gets accepted, I want to reorder the fields in > PyTypeObject (which we can! because it's private!) so that these public > fields are at the top of the both the external and internal structures. > > > THE UPGRADE PATH > > Python internally declares a great many types, and I haven't attempted > to convert them all. Instead there's an conversion header file that > does most of the work for you. Here's how one would apply it to an > existing type. > > 1. Where your file currently has this: > #include "Python.h" > change it to this: > #define PY_TYPEPRIVATE > #include "Python.h" > #include "pytypeconvert.h" > > 2. Whenever you declare a type, change it from this: > static PyTypeObject YourExtension_Type = { > to this: > static PyTypeObject *YourExtension_Type; > static PyTypeObject _YourExtension_Type = { > > Use NULL for your metaclass. For example, change this: > PyObject_HEAD_INIT(&PyType_Type), > to this: > PyObject_HEAD_INIT(NULL), > > Also use NULL for your baseclass. For example, change this: > &PyDict_Type, /* tp_base */ > to this: > NULL, /* tp_base */ > setting it to NULL instead. > > 3. In your module's init function, add this: > CONVERT_TYPE(YourExtension_Type, > metaclass, baseclass, "description of type"); > "metaclass" and "baseclass" should be the metaclass and baseclass > for your type, the ones you just set to NULL in step 3. If you > had NULL before the baseclass, use NULL here too. > > 4. If you have any static object declarations, set their ob_type to > NULL in the static declaration, then set it explicitly in your > init function. If your object uses a locally-defined type, > be sure to do this *after* the CONVERT_TYPE line for that type. > (See _Py_EllipsisObject for an example.) > > 5. Anywhere you're using existing Python type declarations > you must remove the & from the front. > > The conversion header file *also* redefines PyTypeObject. But this > time it redefines it to the existing definition, and that definition > will stay the same forever. That's the whole point: if you have an > existing Python 3.0 extension, it won't have to change if we change > the internal definition of PyTypeObject. > > (Why bother with this conversion process, with few py3k extensions > in the wild? This patch was started quite a while ago, when it > seemed plausible the API would get backported to 2.x. Now I'm not > so sure that will happen.) > > > > > THE CURRENT PATCH > > I've uploaded a patch to the tracker: > http://bugs.python.org/issue5872 > It applies cleanly to py3k/trunk (r72081). But the code is awfully > grubby. > > * I haven't dealt with any types I can't build, and I can't build > a lot of the extensions. I'm using Linux, and I don't have the > dev headers for many libraries on my laptop, and I haven't touched > Windows or Mac stuff. > > * I created some new build warnings which should obviously be fixed. > > * With the patch installed, py3k trunk builds and installs. It does > *not* pass the regression test suite. (It crashes.) I don't think > this'll be too bad, it's just taken me this long to get it as far > as I have. > > * There are some internal scaffolds and hacks that should be purged > by the final patch. > > * There's no documentation. If you'd like to see how you'd use the > new API, currently the best way to learn is to read > Include/pytypeconvert.h. > > * I don't like the PY_TYPEPRIVATE hack. I only used it 'cause it > sucks less than the other approaches I've thought of. I welcome > your suggestions. > > The second-best approach I've come up with: make PyTypeObject > genuinely private, and declare a different structure containing just > the head of PyTypeObject. Let's call it PyTypeObjectHead. Then, > for the convenience macros that use "dealloc" and "flags", cast the > object to PyTypeObjectHead before dereferencing. This abandons type > safety, and given my longing for type safety while developing this > patch I'd prefer to not make loss of type safety an official API. > > THE FEEDBACK I SEEK > > My understanding is that the feature-freeze for Python 3.1 is in a > little over a week. Given the current stability level and untestedness > of the patch, and the lateness of the hour... is there any chance this > would be accepted into Python 3.1? If so, I'll need to act fast. If > not, I might as well take it relax, huh. > > > My thanks to Neal Norwitz for suggesting this project, and Brett Cannon > for some recent encouragement. (And another person who I discussed it > with so long ago I forgot who it was... maybe Fredik Lundh?) > > > /larry/ +1 I haven't looked at your code so I can't comment on the API itself... But awesome. I like the general idea. Exposing structures has hampered us for quite a while with forwards API compatability. I predict not enough people are available to drive this to adoption and use for Python 3.1 given the time frame (the beta feature freeze happens this Saturday I believe?) but we should make this happen for 3.2 and get it stable and into in trunk soon after release-31maint branch is created. Whats needed? Perhaps a PEP describing a lot of what you started to write up in this email: the new extension module API with sections on the upgrade path and backwards compatibillity story. Extension modules are often maintained such that they work on all versions of Python from 2.3 or 2.4 on up to 3.x. We should provide a decent way to do that. Could some of these API functions be provided as a rarely changing add on .c/.h file for extension module authors to bundle as part of their extension modules for use with older versions of python to avoid big #ifdefs around structure definitions vs initialization API calls? -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: From skippy.hammond at gmail.com Fri May 1 02:20:52 2009 From: skippy.hammond at gmail.com (Mark Hammond) Date: Fri, 01 May 2009 10:20:52 +1000 Subject: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath In-Reply-To: <49F9FCD0.80208@hastings.org> References: <49F8B222.7070204@hastings.org> <49F8D9A0.7000104@voidspace.org.uk> <49F8DBCD.6050504@trueblade.com> <49F9FCD0.80208@hastings.org> Message-ID: <49FA4064.5000508@gmail.com> Larry Hastings wrote: > > > Counting the votes for http://bugs.python.org/issue5799 : > > +1 from Mark Hammond (via private mail) > +1 from Paul Moore (via the tracker) > +1 from Tim Golden (in Python-ideas, though what he literally said > was "I'm up for it") > +1 from Michael Foord > +1 from Eric Smith > > There have been no other votes. > > Is that enough consensus for it to go in? If so, are there any core > developers who could help me get it in before the 3.1 feature freeze? > The patch should be in good shape; it has unit tests and updated > documentation. I've taken the liberty of explicitly CCing Martin just incase he missed the thread with all the noise regarding PEP383. If there are no objections from Martin or anyone else here, please feel free to assign it to me (and mail if I haven't taken action by the day before the beta freeze...) Cheers, Mark From steve at pearwood.info Fri May 1 04:40:14 2009 From: steve at pearwood.info (Steven D'Aprano) Date: Fri, 1 May 2009 12:40:14 +1000 Subject: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces In-Reply-To: <7e51d15d0904301355u2268bf0te06769792f697cc7@mail.gmail.com> References: <20090427211447.GA4291@cskk.homeip.net> <7e51d15d0904301355u2268bf0te06769792f697cc7@mail.gmail.com> Message-ID: <200905011240.14428.steve@pearwood.info> On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote: > You can get the same error on Linux: > > $ python > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) > [GCC 4.3.3] on linux2 > Type "help", "copyright", "credits" or "license" for more > information. > > >>> f=open(chr(255),'w') > > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' Works for me under Fedora using ext3 as the file system. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f=open(chr(255),'w') >>> f.close() >>> import os >>> os.remove(chr(255)) >>> Given that chr(255) is a valid filename on my file system, I would consider it a bug if Python couldn't deal with a file with that name. -- Steven D'Aprano From ronaldoussoren at mac.com Fri May 1 07:41:16 2009 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Fri, 01 May 2009 07:41:16 +0200 Subject: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces In-Reply-To: References: <20090427211447.GA4291@cskk.homeip.net> <49F658A5.7080807@g.nevcal.com> <79990c6b0904280220x5a1352b6u153edc7487c737f9@mail.gmail.com> <79990c6b0904280457g3c8b1153p84624b3ab1ef04be@mail.gmail.com> <49F6F09E.2020506@voidspace.org.uk> <1209A1AB-1A80-4E46-88B3-5F545476ADFA@mac.com> Message-ID: <67A75595-8D07-4D65-A234-301A8B45FB29@mac.com> On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote: >>>>>> Ronald Oussoren (RO) wrote: > >> RO> For what it's worth, the OSX API's seem to behave as follows: >> RO> * If you create a file with an non-UTF8 name on a HFS+ >> filesystem the >> RO> system automaticly encodes the name. > >> RO> That is, open(chr(255), 'w') will silently create a file named >> '%FF' >> RO> instead of the name you'd expect on a unix system. > > Not for me (I am using Python 2.6.2). > >>>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' >>>> That's odd. Which version of OSX do you use? ronald at Rivendell-2[0]$ sw_vers ProductName: Mac OS X ProductVersion: 10.5.6 BuildVersion: 9G55 [~/testdir] ronald at Rivendell-2[0]$ /usr/bin/python Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.listdir('.') [] >>> open(chr(255), 'w').write('x') >>> os.listdir('.') ['%FF'] >>> And likewise with python 2.6.1+ (after cleaning the directory): [~/testdir] ronald at Rivendell-2[0]$ python2.6 Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.listdir('.') [] >>> open(chr(255), 'w').write('x') >>> os.listdir('.') ['%FF'] >>> > > I once got a tar file from a Linux system which contained a file > with a > non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be > unpacked on a HFS+ filesystem. > -- > Piet van Oostrum > URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] > Private email: piet at vanoostrum.org -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2224 bytes Desc: not available URL: From zookog at gmail.com Fri May 1 07:44:36 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Thu, 30 Apr 2009 23:44:36 -0600 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. So far, it looks like PEP 383 doesn't provide both of these requirements, so I am going to have to continue working-around the Python API even after PEP 383. In fact, it might actually increase the amount of working-around that I have to do. If I understand correctly, .decode(encoding, 'strict') will not be changed by PEP 383. A new error handler is added, so .decode('utf-8', 'python-escape') performs the utf-8b decoding. Am I right so far? Therefore if I have a string of bytes, I can attempt to decode it with 'strict', and if that fails I can set the flag showing that it was not a valid byte string in the expected encoding, and then I can invoke .decode('utf-8', 'python-escape') on it. So far, so good. (Note that I never want to do .decode(expected_encoding, 'python-escape') -- if it wasn't a valid bytestring in the expected_encoding, then I want to decode it with utf-8b, regardless of what the expected encoding was.) Anyway, I can use it like this: class FName: def __init__(self, name, failed_decode=False): self.name = name self.failed_decode = failed_decode def fs_to_unicode(bytes): try: return FName(bytes.decode(sys.getfilesystemencoding(), 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) And what about unicode-oriented APIs such as os.listdir()? Uh-oh, the PEP says that on systems with locale 'utf-8', it will automatically be changed to 'utf-8b'. This means I can't reliably find out whether the entries in the directory *were* named with valid encodings in utf-8? That's not acceptable for my use case. I would have to refrain from using the unicode-oriented os.listdir() on POSIX, and instead do something like this: if platform.system() in ('Windows', 'Darwin'): def listdir(d): return [FName(n) for n in os.listdir(d)] elif platform.system() in ('Linux', 'SunOs'): def listdir(d): bytesd = d.encode(sys.getfilesystemencoding()) return [fs_to_unicode(n) for n in os.listdir(bytesd)] else: raise NotImplementedError("Please classify platform.system() == %s \ as either unicode-safe or unicode-unsafe." % platform.system()) In fact, if 'utf-8' gets automatically converted to 'utf-8b' when *decoding* as well as encoding, then I would have to change my fs_to_unicode() function to check for that and make sure to use strict utf-8 in the first attempt: def fs_to_unicode(bytes): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' try: return FName(bytes.decode(fse, 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383, which I have approved of and liked so far in this discussion, would actually not help Tahoe at all and would in fact harm Tahoe -- I would have to remember to detect and work-around the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python 3. If anyone else has a concrete, real use case which would be helped by PEP 383, I would like to hear about it. Perhaps Tahoe can learn something from it. Oh, if this PEP could be extended to add a flag to each unicode object indicating whether it was created with the python-escape handler or not, then it would be useful to me. Regards, Zooko [1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py From martin at v.loewis.de Fri May 1 08:25:34 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 08:25:34 +0200 Subject: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath In-Reply-To: <49FA4064.5000508@gmail.com> References: <49F8B222.7070204@hastings.org> <49F8D9A0.7000104@voidspace.org.uk> <49F8DBCD.6050504@trueblade.com> <49F9FCD0.80208@hastings.org> <49FA4064.5000508@gmail.com> Message-ID: <49FA95DE.8060409@v.loewis.de> > I've taken the liberty of explicitly CCing Martin just incase he missed > the thread with all the noise regarding PEP383. > > If there are no objections from Martin It's fine with me - I just won't have time to look into the details of that change. Regards, Martin From fuzzyman at voidspace.org.uk Fri May 1 11:06:08 2009 From: fuzzyman at voidspace.org.uk (Michael Foord) Date: Fri, 01 May 2009 10:06:08 +0100 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: <49FABB80.8050301@voidspace.org.uk> Zooko O'Whielacronx wrote: > [snip...] > Would it be possible for Python unicode objects to have a flag > indicating whether the 'python-escape' error handler was present? That > would serve the same purpose as my "failed_decode" flag above, and would > basically allow me to use the Python APIs directory and make all this > work-around code disappear. > > Failing that, I can't see any way to use the os.listdir() in its > unicode-oriented mode to satisfy Tahoe's requirements. > > If you take the above code and then add the fact that you want to use > the failed_decode flag when *encoding* the d argument to os.listdir(), > then you get this code: [2]. > > Oh, I just realized that I *could* use the PEP 383 os.listdir(), like > this: > > def listdir(d): > fse = sys.getfilesystemencoding() > if fse == 'utf-8b': > fse = 'utf-8' > ns = [] > for fn in os.listdir(d): > bytes = fn.encode(fse, 'python-escape') > try: > ns.append(FName(bytes.decode(fse, 'strict'))) > except UnicodeDecodeError: > ns.append(FName(fn.decode('utf-8', 'python-escape'), > failed_decode=True)) > return ns > > (And I guess I could define listdir() like this only on the > non-unicode-safe platforms, as above.) > > However, that strikes me as even more horrible than the previous > "listdir()" work-around, in part because it means decoding, re-encoding, > and re-decoding every name, so I think I would stick with the previous > version. > The current unicode mode would skip the filenames you are interested (those that fail to decode correctly) - so you would have been forced to use the bytes mode. If you need access to the original bytes then you should continue to do this. PEP-383 is entirely neutral for your use case as far as I can see. Michael > Oh, one more note: for Tahoe's purposes you can, in all of the code > above, replace ".decode('utf-8', 'python-replace')" with > ".decode('windows-1252')" and it works just as well. While UTF-8b seems > like a really cool hack, and it would produce more legible results if > utf-8-encoded strings were partially corrupted, I guess I should just > use 'windows-1252' which is already implemented in Python 2 (as well as > in all other software in the world). > > I guess this means that PEP 383, which I have approved of and liked so > far in this discussion, would actually not help Tahoe at all and would > in fact harm Tahoe -- I would have to remember to detect and work-around > the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python > 3. > > If anyone else has a concrete, real use case which would be helped by > PEP 383, I would like to hear about it. Perhaps Tahoe can learn > something from it. > > Oh, if this PEP could be extended to add a flag to each unicode object > indicating whether it was created with the python-escape handler or not, > then it would be useful to me. > > Regards, > > Zooko > > [1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html > [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk > -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog From rdmurray at bitdance.com Fri May 1 13:13:24 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 1 May 2009 07:13:24 -0400 (EDT) Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: On Thu, 30 Apr 2009 at 23:44, Zooko O'Whielacronx wrote: > Would it be possible for Python unicode objects to have a flag > indicating whether the 'python-escape' error handler was present? That Unless I'm misunderstanding something, couldn't you implement what you need by looking in a given string for the half surrogates? If you find one, you have a string python-escape modified, if you don't, it didn't. What does Tahoe do on Windows when it gets a filename that is not valid Unicode? You might not even have to conditionalize the above code on platform (ie: instead you have a generalized is_valid_unicode test function that you always use). --David From martin at v.loewis.de Fri May 1 17:16:16 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 17:16:16 +0200 Subject: [Python-Dev] Deferring PEP 382 Message-ID: <49FB1240.50403@v.loewis.de> During Guido's review, we discovered that PEP 382 doesn't deal with PEP 302 loaders; I believe that it should, though. Rather than coming up with an ad-hoc design, I propose to defer the PEP to Python 3.2 - unless somebody can propose a straight-forward design with not too many new interfaces. FWIW, my own approach would be to add two new interfaces to loaders: 1. extend the package path according to .pth files available to the loader (alternatively, provide the contents of the .pth files of the package in question) 2. search for and execute a package initialization module. Regards, Martin From stephen at xemacs.org Fri May 1 17:36:39 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 02 May 2009 00:36:39 +0900 Subject: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces In-Reply-To: <36EBC80A-EBF2-4C4E-B948-48AA30E63911@fuhm.net> References: <49EEBE2E.3090601@v.loewis.de> <49F184C6.8000905@g.nevcal.com> <49F30083.5050506@v.loewis.de> <49F559A4.8050400@g.nevcal.com> <49F60A8A.8090603@v.loewis.de> <49F63B19.7010306@g.nevcal.com> <49F6799F.5030208@v.loewis.de> <875E02B9-00AA-47E0-AA68-66C2B62DBF33@fuhm.net> <49F6A71A.3020809@v.loewis.de> <873CC8F9-879C-4146-91D5-072ACA4D4D9B@fuhm.net> <49F97275.3010307@v.loewis.de> <36EBC80A-EBF2-4C4E-B948-48AA30E63911@fuhm.net> Message-ID: <87skjoj0mw.fsf@uwakimon.sk.tsukuba.ac.jp> James Y Knight writes: > in python. It seems like the most common reason why people want to use > SJIS is to make old pre-unicode apps work right in WINE -- in which > case it doesn't actually affect unix python at all. Mounting external drives, especially USB memory sticks which tend to be FAT-initialized by the manufacturers, is another common case. But I don't understand why PEP 383 needs to care at all. From zookog at gmail.com Fri May 1 17:31:01 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Fri, 1 May 2009 09:31:01 -0600 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: Following-up to my own post to correct a major error: On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx wrote: > Folks: > > My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary > binary names from the filesystem and store them so that I can regenerate > the same byte string later, but it also requires that I *know* whether > what I got was a valid string in the expected encoding (which might be > utf-8) or whether it was not and I need to fall back to storing the > bytes. Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this: Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another. That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases: Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system). Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()! Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it. But if this is right, then I can do away with some of the mechanism that I've built up, and instead: Backport PEP 383 to Python 2. And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks. Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2. Regards, Zooko From google at mrabarnett.plus.com Fri May 1 17:33:47 2009 From: google at mrabarnett.plus.com (MRAB) Date: Fri, 01 May 2009 16:33:47 +0100 Subject: [Python-Dev] Oddity PEP 0 key Message-ID: <49FB165B.9070909@mrabarnett.plus.com> I've just noticed an oddity in the key in PEP 0. Most letters are used more than once. Wouldn't it be clearer if different letters were used for "Accepted" and "Active" instead of them both being 'A', for example? -> A - Accepted proposal -> R - Rejected proposal W - Withdrawn proposal -> D - Deferred proposal F - Final proposal -> A - Active proposal -> D - Draft proposal -> R - Replaced proposal From google at mrabarnett.plus.com Fri May 1 17:52:50 2009 From: google at mrabarnett.plus.com (MRAB) Date: Fri, 01 May 2009 16:52:50 +0100 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: <49FB1AD2.9010704@mrabarnett.plus.com> Zooko O'Whielacronx wrote: > Following-up to my own post to correct a major error: > > > On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx wrote: >> Folks: >> >> My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary >> binary names from the filesystem and store them so that I can regenerate >> the same byte string later, but it also requires that I *know* whether >> what I got was a valid string in the expected encoding (which might be >> utf-8) or whether it was not and I need to fall back to storing the >> bytes. > > Okay, I am wrong about this. Having a flag to remember whether I had to > fall back to the utf-8b trick is one method to implement my requirement, > but my actual requirement is this: > > Requirement: either the unicode string or the bytes are faithfully > transmitted from one system to another. > > That is: if you read a filename from the filesystem, and transmit that > filename to another system and use it, then there are two cases: > > Requirement 1: the byte string was valid in the encoding of source > system, in which case the unicode name is faithfully transmitted > (i.e. the bytes that finally land on the target system are the result of > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). > > Requirement 2: the byte string was not valid in the encoding of source > system, in which case the bytes are faithfully transmitted (i.e. the > bytes that finally land on the target system are the same as the bytes > that originated in the source system). > > Now I finally understand how fiendishly clever MvL's PEP 383 > generalization of Markus Kuhn's utf-8b trick is! The only thing > necessary to achieve both of those requirements above is that the > 'python-escape' error handler is used on the target system .encode() as > well as on the source system .decode()! > > Well, I'm going to have to let this sink in and maybe write some code to > see if I really understand it. > > But if this is right, then I can do away with some of the mechanism that > I've built up, and instead: > > Backport PEP 383 to Python 2. > > And, document the PEP 383 trick in some generic, widely respected format > such as an Internet Draft so that I can explain to other users of the > Tahoe data (many of whom use other languages than Python) what they have > to do if they find invalid utf-8 in the data. Oh good, I just realized > that Tahoe emits only utf-8, so all I have to do is point them to the > utf-8b documents (such as they are) and explain that to read filenames > produced by Tahoe they have to implement utf-8b. That's really good > that they don't have to implement MvL's generalization of that trick to > other encodings, since utf-8b is already understood by some folks. > > > Okay, I find it surprisingly easy to make subtle errors in this encoding > stuff, so please let me know if you spot one. Is it true that > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', > 'python-escape') will always produce srcbytes ? That is my Requirement > 2. > No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8', 'python-escape') == srcbytes. The encodings on both ends need to be the same. For example: >>> b'\x80'.decode('windows-1252') u'\u20ac' >>> u'\u20ac'.encode('utf-8') '\xe2\x82\xac' Currently: >>> b'\x80'.decode('utf-8') Traceback (most recent call last): File "", line 1, in b'\x80'.decode('utf-8') File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte But under this PEP: >>> b'x80'.decode('utf-8', 'python-escape') u'\xdc80' >>> u'\xdc80'.encode('utf-8', 'python-escape') '\x80' From status at bugs.python.org Fri May 1 18:07:30 2009 From: status at bugs.python.org (Python tracker) Date: Fri, 1 May 2009 18:07:30 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20090501160730.695547822F@psf.upfronthosting.co.za> ACTIVITY SUMMARY (04/24/09 - 05/01/09) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue number. Do NOT respond to this message. 2190 open (+34) / 15527 closed (+29) / 17717 total (+63) Open issues with patches: 861 Average duration of open issues: 645 days. Median duration of open issues: 394 days. Open Issues Breakdown open 2156 (+33) pending 33 ( +1) Issues Created Or Reopened (63) _______________________________ os.path.walk fails to descend into a directory whose name ends w 04/24/09 CLOSED http://bugs.python.org/issue5832 created linuxelf readline update 04/24/09 http://bugs.python.org/issue5833 created jrevans1 patch The word "error" used instead of "failure" 04/25/09 CLOSED http://bugs.python.org/issue5834 created kurtmckee Deprecate PyOS_ascii_formatd 04/25/09 CLOSED http://bugs.python.org/issue5835 created eric.smith Clean up float parsing code for nans and infs 04/25/09 CLOSED http://bugs.python.org/issue5836 created marketdickinson support.EnvironmentVarGuard broken 04/25/09 CLOSED http://bugs.python.org/issue5837 created doerwalter easy Test issue 04/25/09 CLOSED http://bugs.python.org/issue5838 created ajaksu2 RegOpenKeyEx key failed on Vista 64Bit with return 2 04/25/09 http://bugs.python.org/issue5839 created makursi "Thread State and the Global Interpreter Lock" section of the do 04/25/09 http://bugs.python.org/issue5840 created exarkun patch add py3k warnings to commands 04/25/09 CLOSED http://bugs.python.org/issue5841 created dsm001 patch Move test outside of urlparse module 04/25/09 http://bugs.python.org/issue5842 created Merwok Possible normalization error in urlparse.urlunparse 04/25/09 http://bugs.python.org/issue5843 created Merwok internal error on write while reading 04/25/09 http://bugs.python.org/issue5844 created dsm001 patch rlcompleter should be enabled automatically 04/25/09 http://bugs.python.org/issue5845 created cben Deprecate obsolete functions in unittest 04/25/09 http://bugs.python.org/issue5846 created michael.foord IDLE/Win Installer: drop -n switch for 2.7/3.1; install 3.1 as i 04/26/09 http://bugs.python.org/issue5847 created kbk Minor unittest doc patch 04/26/09 CLOSED http://bugs.python.org/issue5848 created michael.foord patch, patch, easy, needs review Idle 3.01 - invalid syntec error 04/26/09 CLOSED http://bugs.python.org/issue5849 created r2d2floyd Full example for emulating a container type 04/27/09 CLOSED http://bugs.python.org/issue5850 created yaneurabeya Add a stream parameter to gc.set_debug 04/27/09 http://bugs.python.org/issue5851 created nicdumz can't use "glog" to find the path with square bracket 04/27/09 CLOSED http://bugs.python.org/issue5852 created winterTTr mimetypes.guess_type() hits recursion limit 04/27/09 CLOSED http://bugs.python.org/issue5853 created djc logging module's __all__ attribute not in sync with documentatio 04/27/09 CLOSED http://bugs.python.org/issue5854 created flub easy Perhaps exponential performance of sum(listoflists, []) 04/27/09 CLOSED http://bugs.python.org/issue5855 created sjohn Minor typo in traceback example 04/27/09 CLOSED http://bugs.python.org/issue5856 created nielsdevos patch Return namedtuples from tokenize token generator 04/27/09 CLOSED http://bugs.python.org/issue5857 created mallyvai needs review Make complex repr and str more like float repr and str 04/27/09 http://bugs.python.org/issue5858 created marketdickinson Remove implicit '%f' -> '%g' switch from float formatting. 04/27/09 CLOSED http://bugs.python.org/issue5859 created marketdickinson patch TextIOWrapper: bad error reporting when write() is forbidden 04/27/09 CLOSED http://bugs.python.org/issue5860 created pitrou test_urllib fails on windows 04/28/09 http://bugs.python.org/issue5861 created ocean-city multiprocessing 'using a remote manager' example errors and poss 04/28/09 http://bugs.python.org/issue5862 created r.david.murray bz2.BZ2File should accept other file-like objects. 04/28/09 http://bugs.python.org/issue5863 created MizardX format(1234.5, '.4') gives misleading result 04/28/09 http://bugs.python.org/issue5864 created marketdickinson patch mathmodule.c fails to compile due to missing math_log1p() functi 04/28/09 CLOSED http://bugs.python.org/issue5865 created alanh cPickle defect with tuples and different from pickle output 04/28/09 http://bugs.python.org/issue5866 created jelle No way to create an abstract classmethod 04/28/09 http://bugs.python.org/issue5867 created della mimetypes.MAGIC_FUNCTION initialization not thread-safe in Pytho 04/28/09 CLOSED http://bugs.python.org/issue5868 created apoirier 100th character truncation in 2.4 tarfile.py 04/28/09 CLOSED http://bugs.python.org/issue5869 created neville.bagnall patch subprocess.DEVNULL 04/28/09 http://bugs.python.org/issue5870 created MrJean1 email.header.Header allow to embed raw newlines into a message 04/28/09 http://bugs.python.org/issue5871 created jwilk New C API for declaring Python types 04/29/09 http://bugs.python.org/issue5872 created larry patch Minidom: parsestring() error 04/29/09 CLOSED http://bugs.python.org/issue5873 created naf305 distutils.tests.test_config_cmd is locale-sensitive 04/29/09 CLOSED http://bugs.python.org/issue5874 created georg.brandl test_distutils failing on OpenSUSE 10.3, Py3k 04/29/09 http://bugs.python.org/issue5875 created ShuaibKhan __repr__ returning unicode doesn't work when called implicitly 04/29/09 http://bugs.python.org/issue5876 created liori Add a function for updating URL query parameters 04/29/09 http://bugs.python.org/issue5877 created mrts Regular Expression instances 04/29/09 CLOSED http://bugs.python.org/issue5878 created ecasbas multiprocessing - example "pool of http servers " fails on windo 04/29/09 http://bugs.python.org/issue5879 created ghum Remove unneeded "context" pointer from getters and setters 04/29/09 http://bugs.python.org/issue5880 created larry patch Remove extraneous backwards-compatibility attributes from some m 04/29/09 http://bugs.python.org/issue5881 created larry patch __repr__ is ignored when formatting exceptions 04/29/09 CLOSED http://bugs.python.org/issue5882 created ellisj detach() implementation 04/29/09 http://bugs.python.org/issue5883 created benjamin.peterson patch pydoc to return error status code 04/30/09 http://bugs.python.org/issue5884 created mixmastamyk uuid.uuid1() is too slow 04/30/09 http://bugs.python.org/issue5885 created wangchun curses/__init__.py: global name '_os' is not defined 04/30/09 CLOSED http://bugs.python.org/issue5886 created andrix patch mmap.write_byte out of bounds - no error, position gets screwed 04/30/09 http://bugs.python.org/issue5887 created bmearns mmap ehancement - resize with sequence notation 04/30/09 http://bugs.python.org/issue5888 created bmearns Extra comma in enum - fails on AIX 04/30/09 CLOSED http://bugs.python.org/issue5889 created srid Subclassing property doesn't preserve the auto __doc__ behavior 04/30/09 http://bugs.python.org/issue5890 created gsakkis strange list.sort() behavior on import, del and inport again 05/01/09 CLOSED http://bugs.python.org/issue5891 created dstemmer strange list.sort() behavior on import, del and inport again 05/01/09 CLOSED http://bugs.python.org/issue5892 created dstemmer Add support to pydoc to output .rst restructured text 05/01/09 http://bugs.python.org/issue5893 created gregory.p.smith Lookup of localised language name by ISO 639 language code and r 05/01/09 http://bugs.python.org/issue5894 created pander Issues Now Closed (104) _______________________ pyvm module patch 515 days http://bugs.python.org/issue1522 benjamin.peterson patch Bad OOB data management when using asyncore with select.poll() 514 days http://bugs.python.org/issue1541 georg.brandl patch str.format() wrongly formats complex() numbers (Py30a2) 505 days http://bugs.python.org/issue1588 eric.smith patch sqlite3 docs should mention utf8 requirement 434 days http://bugs.python.org/issue2127 georg.brandl patch, easy aifc cannot handle unrecognised chunk type "CHAN" 419 days http://bugs.python.org/issue2245 r.david.murray easy float compared to decimal is silently incorrect. 34 days http://bugs.python.org/issue2531 jdunck patch 3.0 pickle docs -- what about old-style classes? 385 days http://bugs.python.org/issue2572 georg.brandl PyString_FromStringAndSize() to be considered unsafe 384 days http://bugs.python.org/issue2587 iankko Python does not accept unicode keywords 375 days http://bugs.python.org/issue2646 ajaksu2 26backport ctypes defines global symbols 316 days http://bugs.python.org/issue3102 theller patch Wish: disable tests in unittest 304 days http://bugs.python.org/issue3202 benjamin.peterson patch various doc typos 291 days http://bugs.python.org/issue3320 georg.brandl patch file.readline: bad exception recovery 260 days http://bugs.python.org/issue3521 benjamin.peterson patch, easy Tuple comparison masking exception 226 days http://bugs.python.org/issue3829 rhettinger idle should be installed as idle3.0 220 days http://bugs.python.org/issue3896 ajaksu2 smtplib cannot sendmail over TLS 217 days http://bugs.python.org/issue3921 ajaksu2 patch, easy Python 2.6 Doc/tools folder bigger than in 2.6rc2 205 days http://bugs.python.org/issue4013 georg.brandl C/API documentation: request for documentation of change to Py_s 196 days http://bugs.python.org/issue4129 asmodai patch Email example should use SMTP.quit() rather than SMTP.close() 181 days http://bugs.python.org/issue4239 asmodai ctypes could include data type limits 145 days http://bugs.python.org/issue4538 theller Need to rework the dbm lib/include selection process 144 days http://bugs.python.org/issue4587 doko patch, needs review Idle for Python 3.0 is default even without doing make fullinsta 129 days http://bugs.python.org/issue4693 ajaksu2 failure in test_httpservers 101 days http://bugs.python.org/issue4951 tarek patch Incorrect title case 98 days http://bugs.python.org/issue4971 loewis Specifying common controls DLL in manifest 97 days http://bugs.python.org/issue5019 robind ctypes unwilling to allow pickling wide character 90 days http://bugs.python.org/issue5049 theller patch Inadequate documentation of the built-in function open 91 days http://bugs.python.org/issue5061 georg.brandl IDLE improve Subprocess Startup Error message 91 days http://bugs.python.org/issue5065 ajaksu2 Avoid redundant call to FormatError() 88 days http://bugs.python.org/issue5078 theller patch indentation in IDLE 2.6 different from IDLE 2.5, 2.4 or vim 82 days http://bugs.python.org/issue5129 kbk patch, 26backport wrong paths for ctypes cleanup 78 days http://bugs.python.org/issue5161 theller setting __class__ in __del__ is bad. mmkay. negative ref count! 67 days http://bugs.python.org/issue5283 benjamin.peterson patch email/base64mime.py cannot work 67 days http://bugs.python.org/issue5304 ajaksu2 easy ctypes configuration fails on mips-linux (and probably Irix) 41 days http://bugs.python.org/issue5507 theller test_math.testFsum failure on release30-maint 26 days http://bugs.python.org/issue5593 marketdickinson file "" on disk creates garbage output in stack trace 26 days http://bugs.python.org/issue5668 ajaksu2 shutils test fails on ZFS (on FUSE, on Linux) 27 days http://bugs.python.org/issue5676 benjamin.peterson patch inspect.findsource() should look only for sources 13 days http://bugs.python.org/issue5742 ajaksu2 patch idle pydoc et al removed from 3.1 without versioned replacements 11 days http://bugs.python.org/issue5756 kbk IDLE cannot find windows chm file 8 days http://bugs.python.org/issue5783 kbk patch, 26backport Rationalize isdigit / isalpha / tolower / ... uses throughout Py 8 days http://bugs.python.org/issue5793 eric.smith easy test_distutils fails - sysconfig._config_vars is None 3 days http://bugs.python.org/issue5810 tarek Fix five small bugs in the bininstall and altbininstall pseudota 3 days http://bugs.python.org/issue5818 benjamin.peterson patch Documentation: mention 'close' and iteration for tarfile.TarFile 2 days http://bugs.python.org/issue5821 georg.brandl patch new unittest function listed as assertIsNotNot() instead of asse 2 days http://bugs.python.org/issue5826 michael.foord Invalid behavior of unicode.lower 1 days http://bugs.python.org/issue5828 loewis patch heapq item comparison problematic with sched's events 0 days http://bugs.python.org/issue5830 rhettinger os.path.walk fails to descend into a directory whose name ends w 0 days http://bugs.python.org/issue5832 potten The word "error" used instead of "failure" 0 days http://bugs.python.org/issue5834 georg.brandl Deprecate PyOS_ascii_formatd 2 days http://bugs.python.org/issue5835 eric.smith Clean up float parsing code for nans and infs 2 days http://bugs.python.org/issue5836 marketdickinson support.EnvironmentVarGuard broken 0 days http://bugs.python.org/issue5837 doerwalter easy Test issue 0 days http://bugs.python.org/issue5838 marketdickinson add py3k warnings to commands 0 days http://bugs.python.org/issue5841 georg.brandl patch Minor unittest doc patch 1 days http://bugs.python.org/issue5848 georg.brandl patch, patch, easy, needs review Idle 3.01 - invalid syntec error 0 days http://bugs.python.org/issue5849 doerwalter Full example for emulating a container type 2 days http://bugs.python.org/issue5850 rhettinger can't use "glog" to find the path with square bracket 0 days http://bugs.python.org/issue5852 amaury.forgeotdarc mimetypes.guess_type() hits recursion limit 1 days http://bugs.python.org/issue5853 pitrou logging module's __all__ attribute not in sync with documentatio 0 days http://bugs.python.org/issue5854 vsajip easy Perhaps exponential performance of sum(listoflists, []) 0 days http://bugs.python.org/issue5855 pitrou Minor typo in traceback example 0 days http://bugs.python.org/issue5856 georg.brandl patch Return namedtuples from tokenize token generator 1 days http://bugs.python.org/issue5857 rhettinger needs review Remove implicit '%f' -> '%g' switch from float formatting. 4 days http://bugs.python.org/issue5859 marketdickinson patch TextIOWrapper: bad error reporting when write() is forbidden 0 days http://bugs.python.org/issue5860 benjamin.peterson mathmodule.c fails to compile due to missing math_log1p() functi 0 days http://bugs.python.org/issue5865 marketdickinson mimetypes.MAGIC_FUNCTION initialization not thread-safe in Pytho 0 days http://bugs.python.org/issue5868 pitrou 100th character truncation in 2.4 tarfile.py 0 days http://bugs.python.org/issue5869 neville.bagnall patch Minidom: parsestring() error 0 days http://bugs.python.org/issue5873 georg.brandl distutils.tests.test_config_cmd is locale-sensitive 0 days http://bugs.python.org/issue5874 tarek Regular Expression instances 0 days http://bugs.python.org/issue5878 georg.brandl __repr__ is ignored when formatting exceptions 0 days http://bugs.python.org/issue5882 benjamin.peterson curses/__init__.py: global name '_os' is not defined 0 days http://bugs.python.org/issue5886 amaury.forgeotdarc patch Extra comma in enum - fails on AIX 1 days http://bugs.python.org/issue5889 srid strange list.sort() behavior on import, del and inport again 0 days http://bugs.python.org/issue5891 loewis strange list.sort() behavior on import, del and inport again 0 days http://bugs.python.org/issue5892 loewis Fix for bugs relating to ntpath.expanduser() 210 days http://bugs.python.org/issue957650 gjb1002 patch urllib2 http auth 1689 days http://bugs.python.org/issue1025540 gregory.p.smith endianness detection fails on IRIX 5.3 1617 days http://bugs.python.org/issue1070140 ajaksu2 proposed patch for tls wrapped ssl support added to smtplib 1417 days http://bugs.python.org/issue1217246 ajaksu2 patch MSI installer does not pass values as SecureProperty from UI 1311 days http://bugs.python.org/issue1298962 ajaksu2 Integer bit operations performance improvement. 1073 days http://bugs.python.org/issue1492860 marketdickinson easy test_float segfaults with SIGFPE on FreeBSD 6.0 / Alpha 1066 days http://bugs.python.org/issue1496032 marketdickinson Use dynload_shlib on newer HP-UX versions 1026 days http://bugs.python.org/issue1516897 ajaksu2 Allowing multiple instances of IDLE with sub-processes 1004 days http://bugs.python.org/issue1529142 kbk patch Tracing and profiling functions can cause hangs in threads 999 days http://bugs.python.org/issue1531859 ajaksu2 patch Tru64 make install failure 954 days http://bugs.python.org/issue1558802 ajaksu2 Install on WinXP always goes to C:\ 943 days http://bugs.python.org/issue1565468 ajaksu2 Modules/readline.c fails to compile on AIX 4.2 891 days http://bugs.python.org/issue1597798 ajaksu2 Would you mind renaming object.h to pyobject.h? 844 days http://bugs.python.org/issue1626545 ajaksu2 patch Python 2.5 gets curses.h warning on HPUX 824 days http://bugs.python.org/issue1642054 ajaksu2 proxy_bypass in urllib handling of macro 821 days http://bugs.python.org/issue1648102 orsenthil patch, easy HP-UX: compiler warnings: alignment 815 days http://bugs.python.org/issue1649011 ajaksu2 Python package support not properly documented 715 days http://bugs.python.org/issue1719423 georg.brandl Document effects of PY_SSIZE_T_CLEAN on argument parsing 693 days http://bugs.python.org/issue1729742 loewis Solaris 64 bit LD_LIBRARY_PATH_64 needs to be set 687 days http://bugs.python.org/issue1733484 ajaksu2 Modules/ld_so_aix needs to strip path off of whichcc call 687 days http://bugs.python.org/issue1733509 ajaksu2 zlib configure behaves differently than main configure 687 days http://bugs.python.org/issue1733513 ajaksu2 HP shared object option 687 days http://bugs.python.org/issue1733523 ajaksu2 HP automatic build of zlib 687 days http://bugs.python.org/issue1733532 ajaksu2 HP 64 bit does not run 687 days http://bugs.python.org/issue1733544 ajaksu2 AIX shared object build of python 2.5 does not work 687 days http://bugs.python.org/issue1733546 ajaksu2 Fast path for unicodedata.normalize() 688 days http://bugs.python.org/issue1734234 pitrou patch Python - Operation time out problem 628 days http://bugs.python.org/issue1768858 ajaksu2 Top Issues Most Discussed (10) ______________________________ 28 str.format() wrongly formats complex() numbers (Py30a2) 505 days closed http://bugs.python.org/issue1588 12 support.EnvironmentVarGuard broken 0 days closed http://bugs.python.org/issue5837 10 mathmodule.c fails to compile due to missing math_log1p() funct 0 days closed http://bugs.python.org/issue5865 10 format(1234.5, '.4') gives misleading result 3 days open http://bugs.python.org/issue5864 9 Invalid behavior of unicode.lower 1 days closed http://bugs.python.org/issue5828 9 IDLE cannot find windows chm file 8 days closed http://bugs.python.org/issue5783 8 failure in test_httpservers 101 days closed http://bugs.python.org/issue4951 7 mimetypes.guess_type() hits recursion limit 1 days closed http://bugs.python.org/issue5853 7 C/API documentation: request for documentation of change to Py_ 196 days closed http://bugs.python.org/issue4129 6 detach() implementation 2 days open http://bugs.python.org/issue5883 From chris at simplistix.co.uk Fri May 1 18:26:29 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 17:26:29 +0100 Subject: [Python-Dev] .pth files are evil In-Reply-To: <49E60832.8030806@egenix.com> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> Message-ID: <49FB22B5.3040507@simplistix.co.uk> M.-A. Lemburg wrote: > """ > If the package really requires adding one or more directories on sys.path (e.g. > because it has not yet been structured to support dotted-name import), a "path > configuration file" named package.pth can be placed in either the site-python or > site-packages directory. > ... > A typical installation should have no or very few .pth files or something is > wrong, and if you need to play with the search order, something is very wrong. > """ I'll say! I think .pth files are absolute evil and I wish they could just be banned. +1 on anything that makes them closer to going away or reduces the possibility of yet another similar feature from hurting the comprehensibility of a python setup. Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Fri May 1 18:30:16 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 17:30:16 +0100 Subject: [Python-Dev] PEP 382: little help for stupid people? In-Reply-To: <49E60832.8030806@egenix.com> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> Message-ID: <49FB2398.5000708@simplistix.co.uk> M.-A. Lemburg wrote: > The much more common use case is that of wanting to have a base package > installation which optional add-ons that live in the same logical > package namespace. > > The PEP provides a way to solve this use case by giving both developers > and users a standard at hand which they can follow without having to > rely on some non-standard helpers and across Python implementations. > > My proposal tries to solve this without adding yet another .pth > file like mechanism - hopefully in the spirit of the original Python > package idea. Okay, I need to issue a plea for a little help. I think I kinda get what this PEP is about now, and as someone who wants to ship a base package with several add-ons that live in the same logical package namespace, I'm very interested. However, despite trying to follow this thread *and* having tried to read the PEP a couple of times, I still don't know how I'd go about doing this. I did give some examples from what I'd be looking to do much earlier. I'll ask again in the vague hope of you or someone else explaining things to me like I'm a 5 year old - something I'm mentally equipped to be well ;-) In either of the proposals on the table, what code would I write and where to have a base package with a set of add-on packages? Simple examples would be greatly appreciated, and might bring things into focus for some of the less mentally able bystanders - like myself! cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Fri May 1 18:32:14 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 17:32:14 +0100 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <20090415175704.966B13A4100@sparrow.telecommunity.com> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> Message-ID: <49FB240E.8030905@simplistix.co.uk> P.J. Eby wrote: > At 06:15 PM 4/15/2009 +0200, M.-A. Lemburg wrote: >> The much more common use case is that of wanting to have a base package >> installation which optional add-ons that live in the same logical >> package namespace. > > Please see the large number of Zope and PEAK distributions on PyPI as > minimal examples that disprove this being the common use case. If you mean "the common use case as opposed to having code in the __init__.py of the namespace package", I think you'll find that's because people (especially me!) don't know how to do this, not because we don't want to! Chris - who would actually like to know how to do this, with or without the PEP, and how to indicate interdependencies in situations like this to setuptools... -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Fri May 1 18:35:43 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 17:35:43 +0100 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <20090415192021.558E53A4119@sparrow.telecommunity.com> References: <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> <20090415185221.GB13696@amk-desktop.matrixgroup.net> <20090415192021.558E53A4119@sparrow.telecommunity.com> Message-ID: <49FB24DF.2020701@simplistix.co.uk> P.J. Eby wrote: > It's unclear, however, who is using base packages besides mx.* and ll.*, > although I'd guess from the PyPI listings that perhaps Django is. (It > seems that "base" packages are more likely to use a 'base-extension' > naming pattern, vs. the 'namespace.project' pattern used by "pure" > packages.) I'll stress it again in case you missed it the first time: I think the main reason people use "pure namespace" versus "base namespace" packages is because hardly anyone know how to do the latter, not because there is no desire to do so! I, for one, have been trying to figure out how to do "base namespace" packages for years... Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From martin at v.loewis.de Fri May 1 18:38:46 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 18:38:46 +0200 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: <49FB2596.1090706@v.loewis.de> > Okay, I am wrong about this. Having a flag to remember whether I had to > fall back to the utf-8b trick is one method to implement my requirement, > but my actual requirement is this: > > Requirement: either the unicode string or the bytes are faithfully > transmitted from one system to another. I don't understand this requirement very well, in particular not the "faithfully" part. > That is: if you read a filename from the filesystem, and transmit that > filename to another system and use it, then there are two cases: What do you mean by "use it"? Things like opening files? How does that work? In general, a file name valid on one system is invalid on a different system - or, at least, refers to a different file over there. This is independent of encodings. > Requirement 1: the byte string was valid in the encoding of source > system, in which case the unicode name is faithfully transmitted > (i.e. the bytes that finally land on the target system are the result of > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). In all your descriptions, I'm puzzled as to where exactly you get the source bytes from. If you use the PEP 383 interfaces, you will start with character strings, not byte strings, always. > Okay, I find it surprisingly easy to make subtle errors in this encoding > stuff, so please let me know if you spot one. Is it true that > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', > 'python-escape') will always produce srcbytes ? I think you mixed up bytes and unicode here: if srcbytes is indeed a bytes object, then you can't apply .encode to it. Regards, Martin From martin at v.loewis.de Fri May 1 18:41:03 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 18:41:03 +0200 Subject: [Python-Dev] PEP 382: little help for stupid people? In-Reply-To: <49FB2398.5000708@simplistix.co.uk> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <49FB2398.5000708@simplistix.co.uk> Message-ID: <49FB261F.9080306@v.loewis.de> > In either of the proposals on the table, what code would I write and > where to have a base package with a set of add-on packages? I don't quite understand the question. Why would you want to write code (except for the code that actually is in the packages)? PEP 382 is completely declarative - no need to write code. Regards, Martin From chris at simplistix.co.uk Fri May 1 18:58:18 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 17:58:18 +0100 Subject: [Python-Dev] PEP 382: little help for stupid people? In-Reply-To: <49FB261F.9080306@v.loewis.de> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <49FB2398.5000708@simplistix.co.uk> <49FB261F.9080306@v.loewis.de> Message-ID: <49FB2A2A.4090606@simplistix.co.uk> Martin v. L?wis wrote: >> In either of the proposals on the table, what code would I write and >> where to have a base package with a set of add-on packages? > > I don't quite understand the question. Why would you want to write code > (except for the code that actually is in the packages)? > > PEP 382 is completely declarative - no need to write code. "code" is anything I need to write to make this work... So, what do I need to do? Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Fri May 1 19:14:12 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 18:14:12 +0100 Subject: [Python-Dev] headers api for email package In-Reply-To: References:

<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E08F8C.5030205@simplistix.co.uk> Message-ID: <49FB2DE4.10008@simplistix.co.uk> >>> Where you just want "a damned valid email and stop making my life >>> hard!": >>> >>> Message['Subject']='Some text' >> >> Yes. In which case I propose we guess the encoding as 1) ascii, 2) >> utf-8, 3) wtf? Well, we're talking about Python 3 here right? In which case the above involves only unicode, so why do we need to guess anything? Just use utf-8 and be done with it... > However, it's not supposed to be used by mail composers, who are > expected to know the encoding. It's for mail gateways that are > transforming something and don't know the encoding. I'm not > sure what this means for the email module, which certainly > will be used in a mail gateways....maybe it's the responsibility > of the application code to explicitly say 'unknown encoding'? Indeed, surely this happens when you have bytes and need to do something with it? That's not what my example above is about... >>> Where you care about what encoding is used: >>> >>> Message['Subject']=Header('Some text',encoding='utf-8') >> >> Yes. ...it's covered by this. >>> If you have bytes, for whatever reason: >>> >>> Message['Subject']=b'some bytes'.decode('utf-8') >>> >>> ...because only you know what encoding those bytes use! >> >> So you're saying that __setitem__() should not accept raw bytes? Indeed :-) Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From chris at simplistix.co.uk Fri May 1 19:18:35 2009 From: chris at simplistix.co.uk (Chris Withers) Date: Fri, 01 May 2009 18:18:35 +0100 Subject: [Python-Dev] [Email-SIG] headers api for email package In-Reply-To: <873accv5jr.fsf@xemacs.org> References:

<86F681EB-2645-4C8C-B02F-06E9F4344139@python.org> <07025875-59B6-4508-96E5-BAFE4D36FF3B@python.org> <49E08F8C.5030205@simplistix.co.uk> <873accv5jr.fsf@xemacs.org> Message-ID: <49FB2EEB.1000400@simplistix.co.uk> Stephen J. Turnbull wrote: > > > str(message['Subject']) > > > > Yes for unstructured headers like Subject. For structured headers... > > hmm. > > Well, suppose we get really radical here. *People* see email as > (rich-)text. So ... message['Subject'] returns an object, partly to > be consistent with more complex headers' APIs, but partly to remind us > that nothing in email is as simple as it seems. Now, > str(message['Subject']) is really for presentation to the user, right? > OK, so let's make it a presentation function! Decode the MIME-words, > optionally unfold folded lines, optionally compress spaces, etc. This > by default returns the subject field as a single, possibly quite long, > line. Then a higher-level API can rewrap it, add fonts etc, for fancy > presentation. This also suggests that we don't the field tag (ie, > "Subject") to be part of this value. > > Of course a *really* smart higher-level API would access structured > headers based on their structure, not on the one-size-fits-all str() > conversion. All sounds good to me. > Then MTAs see email as a string of octets. So guess what: > > > > bytes(message['Subject']) > > gives wire format. Yow! I think I'm just joking. Right? Why? That also sounds fine to me and "feels right"... > > > Where you just want "a damned valid email and stop making my life > > > hard!": > > -1 I mean, yeah, Brother, I feel your pain but it just isn't that > easy. If that were feasible, it would be *criminal* to have a > .set_header() method at all! In fact, Don't agree... > > > Message['Subject']='Some text' > > is going to (a) need to take *only* unicodes, or (b) raise Exceptions > at the slightest provocation when handed bytes. It should only take unicodes and bitch profusely about anything else. > And things only get worse if you try to provide this interface for say > "From" (let alone "Content-Type"). Is it really worth doing the > mapping interface if it's only usable with free-form headers (ie, only > Subject among the commonly used headers)? Sure, for other headers it might *not* accept unicodes... > How do you distinguish "raw" bytes from "encoded bytes"? > __setitem__() shouldn't accept bytes at all. Right on :-) Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk From martin at v.loewis.de Fri May 1 19:38:12 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 19:38:12 +0200 Subject: [Python-Dev] PEP 382: little help for stupid people? In-Reply-To: <49FB2A2A.4090606@simplistix.co.uk> References: <49D4DA72.60401@v.loewis.de> <49D52115.6020001@egenix.com> <49D66C6E.3090602@v.loewis.de> <49DB475B.8060504@egenix.com> <20090407140317.EBD383A4063@sparrow.telecommunity.com> <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <49FB2398.5000708@simplistix.co.uk> <49FB261F.9080306@v.loewis.de> <49FB2A2A.4090606@simplistix.co.uk> Message-ID: <49FB3384.1030106@v.loewis.de> >>> In either of the proposals on the table, what code would I write and >>> where to have a base package with a set of add-on packages? >> >> I don't quite understand the question. Why would you want to write code >> (except for the code that actually is in the packages)? >> >> PEP 382 is completely declarative - no need to write code. > > "code" is anything I need to write to make this work... > > So, what do I need to do? Ok, so create three tar files: 1. base.tar, containing simplistix/ simplistix/__init__.py 2. addon1.tar, containing simplistix/addon1.pth (containing a single "*") simplistix/feature1.py 3. addon2.tar, containing simplistix/addon2.pth simplistix/feature2.py Unpack each of them anywhere on sys.path, in any order. Regards, Martin From martin at v.loewis.de Fri May 1 19:41:39 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 19:41:39 +0200 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <49FB24DF.2020701@simplistix.co.uk> References: <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> <20090415185221.GB13696@amk-desktop.matrixgroup.net> <20090415192021.558E53A4119@sparrow.telecommunity.com> <49FB24DF.2020701@simplistix.co.uk> Message-ID: <49FB3453.4060906@v.loewis.de> >> It's unclear, however, who is using base packages besides mx.* and >> ll.*, although I'd guess from the PyPI listings that perhaps Django >> is. (It seems that "base" packages are more likely to use a >> 'base-extension' naming pattern, vs. the 'namespace.project' pattern >> used by "pure" packages.) > > I'll stress it again in case you missed it the first time: I think the > main reason people use "pure namespace" versus "base namespace" packages > is because hardly anyone know how to do the latter, not because there is > no desire to do so! > > I, for one, have been trying to figure out how to do "base namespace" > packages for years... You mean, without PEP 382? That won't be possible, unless you can coordinate all addon packages. Base packages are a feature solely of PEP 382. Regards, Martin From pje at telecommunity.com Fri May 1 20:49:40 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 01 May 2009 14:49:40 -0400 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <49FB24DF.2020701@simplistix.co.uk> References: <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> <20090415185221.GB13696@amk-desktop.matrixgroup.net> <20090415192021.558E53A4119@sparrow.telecommunity.com> <49FB24DF.2020701@simplistix.co.uk> Message-ID: <20090501184706.66ED13A4070@sparrow.telecommunity.com> At 05:35 PM 5/1/2009 +0100, Chris Withers wrote: >P.J. Eby wrote: >>It's unclear, however, who is using base packages besides mx.* and >>ll.*, although I'd guess from the PyPI listings that perhaps Django >>is. (It seems that "base" packages are more likely to use a >>'base-extension' naming pattern, vs. the 'namespace.project' >>pattern used by "pure" packages.) > >I'll stress it again in case you missed it the first time: I think >the main reason people use "pure namespace" versus "base namespace" >packages is because hardly anyone know how to do the latter, not >because there is no desire to do so! I didn't say there's *no* desire, however IIRC the only person who *ever* asked on distutils-sig how to do a base package with setuptools was the author of the ll.* packages. And in the case of at least the zope.* peak.* and osaf.* namespace packages it was specifically *not* the intention to have a base __init__. From pje at telecommunity.com Fri May 1 20:51:20 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 01 May 2009 14:51:20 -0400 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <49FB3453.4060906@v.loewis.de> References: <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> <20090415185221.GB13696@amk-desktop.matrixgroup.net> <20090415192021.558E53A4119@sparrow.telecommunity.com> <49FB24DF.2020701@simplistix.co.uk> <49FB3453.4060906@v.loewis.de> Message-ID: <20090501184843.D08E43A4070@sparrow.telecommunity.com> At 07:41 PM 5/1/2009 +0200, Martin v. L?wis wrote: > >> It's unclear, however, who is using base packages besides mx.* and > >> ll.*, although I'd guess from the PyPI listings that perhaps Django > >> is. (It seems that "base" packages are more likely to use a > >> 'base-extension' naming pattern, vs. the 'namespace.project' pattern > >> used by "pure" packages.) > > > > I'll stress it again in case you missed it the first time: I think the > > main reason people use "pure namespace" versus "base namespace" packages > > is because hardly anyone know how to do the latter, not because there is > > no desire to do so! > > > > I, for one, have been trying to figure out how to do "base namespace" > > packages for years... > >You mean, without PEP 382? > >That won't be possible, unless you can coordinate all addon packages. >Base packages are a feature solely of PEP 382. Actually, if you are using only the distutils, you can do this by listing only modules in the addon projects; this is how the ll.* tools are doing it. That only works if the packages are all being installed in the same directory, though, not as eggs. From martin at v.loewis.de Fri May 1 20:58:28 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 01 May 2009 20:58:28 +0200 Subject: [Python-Dev] PEP 382: Namespace Packages In-Reply-To: <20090501184843.D08E43A4070@sparrow.telecommunity.com> References: <49DB6A1F.50801@egenix.com> <20090407174355.B62983A4063@sparrow.telecommunity.com> <49E4A58F.70309@egenix.com> <20090414162603.70C843A4100@sparrow.telecommunity.com> <49E4F93B.6010802@egenix.com> <20090415003026.B0A783A4114@sparrow.telecommunity.com> <49E59202.6050809@egenix.com> <20090415144147.6845F3A4100@sparrow.telecommunity.com> <49E60832.8030806@egenix.com> <20090415175704.966B13A4100@sparrow.telecommunity.com> <20090415185221.GB13696@amk-desktop.matrixgroup.net> <20090415192021.558E53A4119@sparrow.telecommunity.com> <49FB24DF.2020701@simplistix.co.uk> <49FB3453.4060906@v.loewis.de> <20090501184843.D08E43A4070@sparrow.telecommunity.com> Message-ID: <49FB4654.9000408@v.loewis.de> > Actually, if you are using only the distutils, you can do this by > listing only modules in the addon projects; this is how the ll.* tools > are doing it. That only works if the packages are all being installed > in the same directory, though, not as eggs. Right: if all portions install into the same directory, you can have base packages already. Regards, Martin From benjamin at python.org Fri May 1 21:32:18 2009 From: benjamin at python.org (Benjamin Peterson) Date: Fri, 1 May 2009 14:32:18 -0500 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FB165B.9070909@mrabarnett.plus.com> References: <49FB165B.9070909@mrabarnett.plus.com> Message-ID: <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> 2009/5/1 MRAB : > I've just noticed an oddity in the key in PEP 0. Most letters are used > more than once. Wouldn't it be clearer if different letters were used > for "Accepted" and "Active" instead of them both being 'A', for example? > > -> A - Accepted proposal > -> R - Rejected proposal > ? W - Withdrawn proposal > -> D - Deferred proposal > ? F - Final proposal > -> A - Active proposal > -> D - Draft proposal > -> R - Replaced proposal Yes, that makes more sense. Would you like to submit a patch against the PEP 0 generator? (It's in peps/pep0) -- Regards, Benjamin From tjreedy at udel.edu Fri May 1 22:21:36 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Fri, 01 May 2009 16:21:36 -0400 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

Message-ID: Zooko O'Whielacronx wrote: > Following-up to my own post to correct a major error: > Is it true that > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', > 'python-escape') will always produce srcbytes ? That is my Requirement If you start with bytes, decode with utf-8b to unicode (possibly 'invalid'), and encode the result back to bytes with utf-8b, you should get the original bytes, regardless of what they were. That is the point of PEP 383 -- to reliably roundtrip file 'names' that start as bytes and must end as the same bytes but which may not otherwise have a unicode decoding. If you start with invalid unicode text, encode to bytes with utf-8b, and decode back to unicode, you might instead get a different and valid unicode text. An example was given in the discussion. I believe this would be hard to avoid. An any case, it does not matter for the use case of starting with bytes that one wants to temporarily but surely work with as text. Terry Jan Reedy From cs at zip.com.au Fri May 1 23:39:28 2009 From: cs at zip.com.au (Cameron Simpson) Date: Sat, 2 May 2009 07:39:28 +1000 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: <49FB2596.1090706@v.loewis.de> Message-ID: <20090501213928.GA15679@cskk.homeip.net> On 01May2009 18:38, Martin v. L?wis wrote: | > Okay, I am wrong about this. Having a flag to remember whether I had to | > fall back to the utf-8b trick is one method to implement my requirement, | > but my actual requirement is this: | > | > Requirement: either the unicode string or the bytes are faithfully | > transmitted from one system to another. | | I don't understand this requirement very well, in particular not | the "faithfully" part. | | > That is: if you read a filename from the filesystem, and transmit that | > filename to another system and use it, then there are two cases: | | What do you mean by "use it"? Things like opening files? How does | that work? In general, a file name valid on one system is invalid | on a different system - or, at least, refers to a different file | over there. This is independent of encodings. I think he's doing a file transfer of some kind and needs to preserve the names. Or I would guess the two systems are not both UNIX or there is some subtlety not yet mentioned, or he'd just use tar or some other byte-level UNIX tool. | > Requirement 1: the byte string was valid in the encoding of source | > system, in which case the unicode name is faithfully transmitted | > (i.e. the bytes that finally land on the target system are the result of | > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). | | In all your descriptions, I'm puzzled as to where exactly you get | the source bytes from. If you use the PEP 383 interfaces, you will | start with character strings, not byte strings, always. But if both system do present POSIX layers, it's bytes underneath and the system tools will natively use bytes. He wants to ensure that he can read using python, using listdir, and elsewhere when he writing using python, preserve the bytes layer. I think. In fact it sounds like he may be translating valid unicode and carefully not altering byte names that don't decode. That in turn implies that the codec may be different on the two systems. | > Okay, I find it surprisingly easy to make subtle errors in this encoding | > stuff, so please let me know if you spot one. Is it true that | > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', | > 'python-escape') will always produce srcbytes ? | | I think you mixed up bytes and unicode here: if srcbytes is indeed | a bytes object, then you can't apply .encode to it. I think he has encode/decode swapped (I did too back in the uber-thread; if your mapping is one-to-one the distinction is almost arbitrary). However, his assertion/hope is true only if srcencoding == 'utf-8'. The PEP itself says that it works if the decode and encode use the same mapping. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ "How do you know I'm Mad?" asked Alice. "You must be," said the Cat, "or you wouldn't have come here." From google at mrabarnett.plus.com Fri May 1 23:52:02 2009 From: google at mrabarnett.plus.com (MRAB) Date: Fri, 01 May 2009 22:52:02 +0100 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> Message-ID: <49FB6F02.7050204@mrabarnett.plus.com> Benjamin Peterson wrote: > 2009/5/1 MRAB : >> I've just noticed an oddity in the key in PEP 0. Most letters are used >> more than once. Wouldn't it be clearer if different letters were used >> for "Accepted" and "Active" instead of them both being 'A', for example? >> >> -> A - Accepted proposal >> -> R - Rejected proposal >> W - Withdrawn proposal >> -> D - Deferred proposal >> F - Final proposal >> -> A - Active proposal >> -> D - Draft proposal >> -> R - Replaced proposal > > Yes, that makes more sense. Would you like to submit a patch against > the PEP 0 generator? (It's in peps/pep0) > I'm still trying to think which letters to use! From fuzzyman at voidspace.org.uk Fri May 1 23:55:16 2009 From: fuzzyman at voidspace.org.uk (Michael Foord) Date: Fri, 01 May 2009 22:55:16 +0100 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FB6F02.7050204@mrabarnett.plus.com> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> Message-ID: <49FB6FC4.1030800@voidspace.org.uk> MRAB wrote: > Benjamin Peterson wrote: >> 2009/5/1 MRAB : >>> I've just noticed an oddity in the key in PEP 0. Most letters are used >>> more than once. Wouldn't it be clearer if different letters were used >>> for "Accepted" and "Active" instead of them both being 'A', for >>> example? >>> >>> -> A - Accepted proposal >>> -> R - Rejected proposal >>> W - Withdrawn proposal >>> -> D - Deferred proposal >>> F - Final proposal >>> -> A - Active proposal >>> -> D - Draft proposal >>> -> R - Replaced proposal >> >> Yes, that makes more sense. Would you like to submit a patch against >> the PEP 0 generator? (It's in peps/pep0) >> > I'm still trying to think which letters to use! P for Proposal (to replace Active Proposal)? Every active PEP is a proposal... Michael > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk > -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog From barry at python.org Fri May 1 23:59:49 2009 From: barry at python.org (Barry Warsaw) Date: Fri, 1 May 2009 17:59:49 -0400 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FB6FC4.1030800@voidspace.org.uk> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> Message-ID: On May 1, 2009, at 5:55 PM, Michael Foord wrote: > P for Proposal (to replace Active Proposal)? Every active PEP is a > proposal... +1 Maybe even s/Active/Proposed/g ? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 304 bytes Desc: This is a digitally signed message part URL: From google at mrabarnett.plus.com Sat May 2 00:24:32 2009 From: google at mrabarnett.plus.com (MRAB) Date: Fri, 01 May 2009 23:24:32 +0100 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FB6FC4.1030800@voidspace.org.uk> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> Message-ID: <49FB76A0.7030909@mrabarnett.plus.com> Michael Foord wrote: > MRAB wrote: >> Benjamin Peterson wrote: >>> 2009/5/1 MRAB : >>>> I've just noticed an oddity in the key in PEP 0. Most letters are used >>>> more than once. Wouldn't it be clearer if different letters were used >>>> for "Accepted" and "Active" instead of them both being 'A', for >>>> example? >>>> >>>> -> A - Accepted proposal >>>> -> R - Rejected proposal >>>> W - Withdrawn proposal >>>> -> D - Deferred proposal >>>> F - Final proposal >>>> -> A - Active proposal >>>> -> D - Draft proposal >>>> -> R - Replaced proposal >>> >>> Yes, that makes more sense. Would you like to submit a patch against >>> the PEP 0 generator? (It's in peps/pep0) >>> >> I'm still trying to think which letters to use! > > P for Proposal (to replace Active Proposal)? Every active PEP is a > proposal... > The full list is: S - Standards Track PEP I - Informational PEP P - Process PEP A - Accepted proposal R - Rejected proposal W - Withdrawn proposal D - Deferred proposal F - Final proposal A - Active proposal D - Draft proposal R - Replaced proposal using one letter from each set. From looking more closely at the code: Only 'Informational' or 'Process' PEPs can be 'Active'. 'Draft' and 'Active' are shown as a single space instead of 'D' or 'A'. Therefore: S - Standards Track PEP I - Informational PEP P - Process PEP A - Accepted proposal R - Rejected proposal W - Withdrawn proposal D - Deferred proposal F - Final proposal [A - Active proposal # blank, so can be omitted from key] [D - Draft proposal # blank, so can be omitted from key] R - Replaced proposal leaving just 'Rejected' and 'Replaced' to be disambiguated. From eric at trueblade.com Sat May 2 00:55:04 2009 From: eric at trueblade.com (Eric Smith) Date: Fri, 01 May 2009 18:55:04 -0400 Subject: [Python-Dev] svn down? Message-ID: <49FB7DC8.9060508@trueblade.com> When checking in, I get: Transmitting file data .svn: Commit failed (details follow): svn: Can't create directory '/data/repos/projects/db/transactions/72186-1.txn': Read-only file system With 'svn up', I get: svn: Can't find a temporary directory: Internal error From benjamin at python.org Sat May 2 01:12:23 2009 From: benjamin at python.org (Benjamin Peterson) Date: Fri, 1 May 2009 18:12:23 -0500 Subject: [Python-Dev] svn down? In-Reply-To: <49FB7DC8.9060508@trueblade.com> References: <49FB7DC8.9060508@trueblade.com> Message-ID: <1afaf6160905011612n22ccf803hde0b02deb1e6ef57@mail.gmail.com> 2009/5/1 Eric Smith : > When checking in, I get: > > Transmitting file data .svn: Commit failed (details follow): > svn: Can't create directory > '/data/repos/projects/db/transactions/72186-1.txn': Read-only file system > > With 'svn up', I get: > > svn: Can't find a temporary directory: Internal error I get that, too. In addition, I can't ssh to dinsdale. -- Regards, Benjamin From benjamin at python.org Sat May 2 03:27:48 2009 From: benjamin at python.org (Benjamin Peterson) Date: Fri, 1 May 2009 20:27:48 -0500 Subject: [Python-Dev] yield from? Message-ID: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> What's the status of yield from? There's still a small window open for a patch to be checked into 3.1's branch. I haven't been following the python-ideas threads, so I'm not sure if it's ready yet. -- Regards, Benjamin From zookog at gmail.com Sat May 2 03:42:47 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Fri, 1 May 2009 19:42:47 -0600 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: <49FB2596.1090706@v.loewis.de> References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

<49FB2596.1090706@v.loewis.de> Message-ID: Folks: Being new to the use of gmail, I accidentally sent the following only to MvL and not to the list. He promptly replied with a helpful counterexample showing that my design can suffer collisions. :-) Regards, Zooko On Fri, May 1, 2009 at 10:38 AM, "Martin v. L?wis" wrote: >> >> Requirement: either the unicode string or the bytes are faithfully >> transmitted from one system to another. > > I don't understand this requirement very well, in particular not > the "faithfully" part. > >> That is: if you read a filename from the filesystem, and transmit that >> filename to another system and use it, then there are two cases: > > What do you mean by "use it"? Things like opening files? How does > that work? In general, a file name valid on one system is invalid > on a different system - or, at least, refers to a different file > over there. This is independent of encodings. Tahoe is a backup and filesharing program, so you might for example, execute "tahoe cp -r Mot?rhead tahoe:" to copy all the contents of your "Mot?rhead" directory to your Tahoe filesystem. Later you or a friend, might execute "tahoe cp -r tahoe:Mot?rhead ." to copy everything from that directory within your Tahoe filesystem to your local filesystem. So in this case the flow of information is local_system_1 -> Tahoe -> local_system_2. The Requirement 1 is that for each filename encountered which is a valid encoding in local_system_1, then the resulting (unicode) name is transmitted through the Tahoe filesystem and then written out into local_system_2 in the expected way (i.e. just by using the Python unicode APIs and passing the unicode object to them). Requirement 2 is that for each filename encountered which is not a valid encoding in local_system_1, then the original bytes are transmitted through the Tahoe filesystem and then, if the target system is a byte-oriented system such as Linux, the original bytes are written into the target filesystem. (If the target is not Linux then mojibake! but we don't have to go into that now.) Does that make sense? > In all your descriptions, I'm puzzled as to where exactly you get > the source bytes from. If you use the PEP 383 interfaces, you will > start with character strings, not byte strings, always. On Mac and Windows, we use the Python unicode APIs e.g. os.listdir(u"Mot?rhead"). On Linux and Solaris, we use the Python bytestring APIs e.g. os.listdir("Mot?rhead".encode(sys.getfilesystemencoding())). >> Okay, I find it surprisingly easy to make subtle errors in this encoding >> stuff, so please let me know if you spot one. Is it true that >> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', >> 'python-escape') will always produce srcbytes ? > > I think you mixed up bytes and unicode here: if srcbytes is indeed > a bytes object, then you can't apply .encode to it. Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me. Thank you very much. Regards, Zooko [1] http://allmydata.org/trac/tahoe/ticket/534#comment:47 From guido at python.org Sat May 2 04:10:47 2009 From: guido at python.org (Guido van Rossum) Date: Fri, 1 May 2009 19:10:47 -0700 Subject: [Python-Dev] yield from? In-Reply-To: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> References: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> Message-ID: Alas, I haven't been following it either recently. Too bad, really, because before I left (now three weeks ago) it was already pretty close. We could perhaps even check in Greg's patch (which I tried and looked like a solid implementation of his proposal at the time) and finagle it for b2. One problem though is that Greg's code is based on 2.6... On Fri, May 1, 2009 at 6:27 PM, Benjamin Peterson wrote: > What's the status of yield from? There's still a small window open for > a patch to be checked into 3.1's branch. I haven't been following the > python-ideas threads, so I'm not sure if it's ready yet. -- --Guido van Rossum (home page: http://www.python.org/~guido/) From foom at fuhm.net Sat May 2 04:12:15 2009 From: foom at fuhm.net (James Y Knight) Date: Fri, 1 May 2009 22:12:15 -0400 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

<49FB2596.1090706@v.loewis.de> Message-ID: <51167066-A162-4AAF-B40D-52C1918032D8@fuhm.net> On May 1, 2009, at 9:42 PM, Zooko O'Whielacronx wrote: > Yep, I reversed the order of encode() and decode(). However, my whole > statement was utterly wrong and shows that I still didn't fully get it > yet. I have flip-flopped again and currently think that PEP 383 is > useless for this use case and that my original plan [1] is still the > way to go. Please let me know if you spot a flaw in my plan or a > ridiculousity in my requirements, or if you see a way that PEP 383 can > help me. If I were designing a new system such as this, I'd probably just go for utf8b *always*. That is, set the filesystem encoding to utf-8b. The end. All files always keep the same bytes transferring between unix systems. Thus, for the 99% of the world that uses either windows or a utf-8 locale, they get useful filenames inside tahoe. The other 1% of the world that uses something like latin-1, EUC_JP, etc. on their local system sees mojibake filenames in tahoe, but will see the same filename that they put in when they take it back out. Gnome already uses only utf-8 for filename displays for a few years now, for example, so this isn't exactly an unheard-of position to take... But if you don't do that, then, I still don't see what purpose your requirements serve. If I have two systems: one with a UTF-8 locale, and one with a Latin-1 locale, why should transmitting filenames from system 1 to system 2 through tahoe preserve the raw bytes, but doing the reverse *not* preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, so they'll all decode into unicode without error, and then be reencoded in utf-8...). This seems rather a useless behavior to me. James From alexander.belopolsky at gmail.com Sat May 2 04:46:00 2009 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Fri, 1 May 2009 22:46:00 -0400 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FB76A0.7030909@mrabarnett.plus.com> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> <49FB76A0.7030909@mrabarnett.plus.com> Message-ID: .. > leaving just 'Rejected' and 'Replaced' to be disambiguated. 'X' or 'Z' for "Rejected"? Looks like a perfect start for a bikeshed discussion. :-) From stephen at xemacs.org Sat May 2 07:34:15 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 02 May 2009 14:34:15 +0900 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> Message-ID: <87ljpghxuw.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > On May 1, 2009, at 5:55 PM, Michael Foord wrote: > > > P for Proposal (to replace Active Proposal)? Every active PEP is a > > proposal... > > +1 > > Maybe even s/Active/Proposed/g ? Shouldn't that be s/Active/Proposed/ From stephen at xemacs.org Sat May 2 07:49:34 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 02 May 2009 14:49:34 +0900 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> <49FB76A0.7030909@mrabarnett.plus.com> Message-ID: <87k550hx5d.fsf@uwakimon.sk.tsukuba.ac.jp> Alexander Belopolsky writes: > .. > > leaving just 'Rejected' and 'Replaced' to be disambiguated. > > 'X' or 'Z' for "Rejected"? Looks like a perfect start for a bikeshed > discussion. :-) The Japanese contingent suggests O (UPPERCASE LATIN LETTER O) for accepted and X for rejected. (Actually these should be U+25EF and U+00D7, respectively.) From arfrever.fta at gmail.com Sat May 2 12:34:05 2009 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Sat, 2 May 2009 12:34:05 +0200 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <87ljpghxuw.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FB165B.9070909@mrabarnett.plus.com> <87ljpghxuw.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <200905021234.08766.Arfrever.FTA@gmail.com> 2009-05-02 07:34:15 Stephen J. Turnbull napisa?(a): > Barry Warsaw writes: > > On May 1, 2009, at 5:55 PM, Michael Foord wrote: > > > > > P for Proposal (to replace Active Proposal)? Every active PEP is a > > > proposal... > > > > +1 > > > > Maybe even s/Active/Proposed/g ? > > Shouldn't that be > > s/Active/Proposed/ No. From `info sed 'sed Programs' 'The "s" Command'`: > The `s' Command > =============== > > The syntax of the `s' (as in substitute) command is > `s/REGEXP/REPLACEMENT/FLAGS'. The `/' characters may be uniformly > replaced by any other single character within any given `s' command. > The `/' character (or whatever other character is used in its stead) > can appear in the REGEXP or REPLACEMENT only if it is preceded by a `\' > character. > ... > The `s' command can be followed by zero or more of the following > FLAGS: > > `g' > Apply the replacement to _all_ matches to the REGEXP, not just the > first. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: From aahz at pythoncraft.com Sat May 2 14:34:04 2009 From: aahz at pythoncraft.com (Aahz) Date: Sat, 2 May 2009 05:34:04 -0700 Subject: [Python-Dev] FWD: svn down? Message-ID: <20090502123404.GA27305@panix.com> ----- Forwarded message from "\"Martin v. L?wis\"" ----- > Date: Sat, 02 May 2009 08:18:56 +0200 > From: "\"Martin v. L?wis\"" > To: Aahz > CC: pydotorg at python.org > Subject: Re: [Pydotorg] FWD: [Python-Dev] svn down? > >> Benjamin Peterson reports being unable to ssh to dinsdale > > I have rebooted the machine; it seems now to be working again. > > Regards, > Martin ----- End forwarded message ----- -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith From google at mrabarnett.plus.com Sat May 2 16:12:07 2009 From: google at mrabarnett.plus.com (MRAB) Date: Sat, 02 May 2009 15:12:07 +0100 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> <49FB76A0.7030909@mrabarnett.plus.com> Message-ID: <49FC54B7.8010807@mrabarnett.plus.com> Alexander Belopolsky wrote: > .. >> leaving just 'Rejected' and 'Replaced' to be disambiguated. > > 'X' or 'Z' for "Rejected"? Looks like a perfect start for a bikeshed > discussion. :-) > Are there Unicode codepoints for smilies? I'm thinking of :-) for 'Accepted' and :-( for 'Rejected'. :-) From ajaksu at gmail.com Sat May 2 17:11:49 2009 From: ajaksu at gmail.com (Daniel Diniz) Date: Sat, 2 May 2009 12:11:49 -0300 Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FC54B7.8010807@mrabarnett.plus.com> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> <49FB76A0.7030909@mrabarnett.plus.com> <49FC54B7.8010807@mrabarnett.plus.com> Message-ID: <2d75d7660905020811p1bdd2b5k51030ef1f8ab046f@mail.gmail.com> MRAB wrote: > Are there Unicode codepoints for smilies? I'm thinking of :-) for > 'Accepted' and :-( for 'Rejected'. :-) Yes there are, but we'd need to set the font size to 'humongous' to see the smilies: ? ?. In py3k: print(chr(0x2639), chr(0x263a)) In trunk: print(unichr(0x2639), unichr(0x263a)) -------------- next part -------------- A non-text attachment was scrubbed... Name: smilies.png Type: image/png Size: 3574 bytes Desc: not available URL: From ijmorlan at uwaterloo.ca Sat May 2 17:04:22 2009 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Sat, 2 May 2009 11:04:22 -0400 (EDT) Subject: [Python-Dev] Oddity PEP 0 key In-Reply-To: <49FC54B7.8010807@mrabarnett.plus.com> References: <49FB165B.9070909@mrabarnett.plus.com> <1afaf6160905011232j2fee6103t1b25075733c39bf8@mail.gmail.com> <49FB6F02.7050204@mrabarnett.plus.com> <49FB6FC4.1030800@voidspace.org.uk> <49FB76A0.7030909@mrabarnett.plus.com> <49FC54B7.8010807@mrabarnett.plus.com> Message-ID: On Sat, 2 May 2009, MRAB wrote: > Alexander Belopolsky wrote: >> .. >>> leaving just 'Rejected' and 'Replaced' to be disambiguated. >> >> 'X' or 'Z' for "Rejected"? Looks like a perfect start for a bikeshed >> discussion. :-) >> > Are there Unicode codepoints for smilies? I'm thinking of :-) for > 'Accepted' and :-( for 'Rejected'. :-) U+2639 WHITE FROWNING FACE U+263A WHITE SMILING FACE Also, U+2694 CROSSED SWORDS for "vehement discussion on mailing list", U+2696 SCALES for "BDFL is considering", and U+2678 BLACK UNIVERSAL RECYCLING SYMBOL for "proposal previously rejected is being re-proposed due to changed circumstances". For code don't forget great math operator symbols like U+2264 LESS-THAN OR EQUAL TO and U+222A UNION. But I doubt if anybody would want to bake in an absolute requirement for Unicode support in order to be able to read or write Python code. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From benjamin at python.org Sat May 2 20:41:51 2009 From: benjamin at python.org (Benjamin Peterson) Date: Sat, 2 May 2009 13:41:51 -0500 Subject: [Python-Dev] yield from? In-Reply-To: References: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> Message-ID: <1afaf6160905021141m68b4b25cm7e60aaf6f5dce4e3@mail.gmail.com> 2009/5/1 Guido van Rossum : > Alas, I haven't been following it either recently. Too bad, really, > because before I left (now three weeks ago) it was already pretty > close. We could perhaps even check in Greg's patch (which I tried and > looked like a solid implementation of his proposal at the time) and > finagle it for b2. One problem though is that Greg's code is based on > 2.6... I don't believe the compiler has changed between 2.6 and the trunk, so a patch against the trunk would probably not be too hard. I volunteer to review it if it is produced. -- Regards, Benjamin From g.brandl at gmx.net Sat May 2 21:01:28 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 May 2009 21:01:28 +0200 Subject: [Python-Dev] multi-with statement Message-ID: Hi, this is just a short notice that Mattias Br?ndstr?m and I have finished a patch to implement the previously discussed and mostly warmly welcomed extension to with's syntax, allowing with A() as a, B() as b: to be written instead of with A() as a: with B() as b: This syntax was chosen (over "with A(), B() as a, b:") because it has more syntactical similarity to the written-out version. Also, our current uses of "as" all have only one expression on the right. The patch implements it as a simple AST transformation, which guarantees semantic equivalence. It is at . If there is no strong opposition, I will commit it and port it to py3k before 3.1 enters beta stage. cheers, Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From fredrik.johansson at gmail.com Sat May 2 21:26:14 2009 From: fredrik.johansson at gmail.com (Fredrik Johansson) Date: Sat, 2 May 2009 21:26:14 +0200 Subject: [Python-Dev] multi-with statement In-Reply-To: References: Message-ID: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> On Sat, May 2, 2009 at 9:01 PM, Georg Brandl wrote: > Hi, > > this is just a short notice that Mattias Br?ndstr?m and I have finished a > patch to implement the previously discussed and mostly warmly welcomed > extension to with's syntax, allowing > > ? with A() as a, B() as b: > > to be written instead of > > ? with A() as a: > ? ? ? with B() as b: > > This syntax was chosen (over "with A(), B() as a, b:") because it has more > syntactical similarity to the written-out version. ?Also, our current uses > of "as" all have only one expression on the right. > > The patch implements it as a simple AST transformation, which guarantees > semantic equivalence. ?It is at . > > If there is no strong opposition, I will commit it and port it to py3k > before 3.1 enters beta stage. > > cheers, > Georg I was hoping for the other syntax in order to be able to create a nested context in advance as a simple tuple: with A, B: pass context = A, B with context: pass (I.e. a tuple, or perhaps any iterable, would be a valid context manager.) With the syntax in the patch, I will still have to implement a custom nesting context manager to do this, which sort of defeats the purpose. Fredrik From aleaxit at gmail.com Sat May 2 21:44:06 2009 From: aleaxit at gmail.com (Alex Martelli) Date: Sat, 2 May 2009 12:44:06 -0700 Subject: [Python-Dev] multi-with statement In-Reply-To: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> References: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> Message-ID: FWIW, I prefer Fredrik's wish too. Alex On Sat, May 2, 2009 at 12:26 PM, Fredrik Johansson < fredrik.johansson at gmail.com> wrote: > On Sat, May 2, 2009 at 9:01 PM, Georg Brandl wrote: > > Hi, > > > > this is just a short notice that Mattias Br?ndstr?m and I have finished a > > patch to implement the previously discussed and mostly warmly welcomed > > extension to with's syntax, allowing > > > > with A() as a, B() as b: > > > > to be written instead of > > > > with A() as a: > > with B() as b: > > > > This syntax was chosen (over "with A(), B() as a, b:") because it has > more > > syntactical similarity to the written-out version. Also, our current > uses > > of "as" all have only one expression on the right. > > > > The patch implements it as a simple AST transformation, which guarantees > > semantic equivalence. It is at . > > > > If there is no strong opposition, I will commit it and port it to py3k > > before 3.1 enters beta stage. > > > > cheers, > > Georg > > I was hoping for the other syntax in order to be able to create a > nested context in advance as a simple tuple: > > with A, B: > pass > > context = A, B > with context: > pass > > (I.e. a tuple, or perhaps any iterable, would be a valid context manager.) > > With the syntax in the patch, I will still have to implement a custom > nesting context manager to do this, which sort of defeats the purpose. > > Fredrik > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/aleaxit%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sat May 2 21:45:47 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 2 May 2009 19:45:47 +0000 (UTC) Subject: [Python-Dev] CVE-2008-5983 "untrusted python modules search path" Message-ID: Hello, I don't think it has already posted to the list, apologies if it has. Some Linux tools and vendors have been hit by an alleged "security hole" where an embedded Python interpreter will prepend the current working directory to sys.path as soon as PySys_SetArgv() is called by the embedding application. This means, for example, that a Python file in the working directory can break plugins or extensions written for that application if the Python file happens to shadow another module. Regardless of whether this is a security hole or not, it certainly can make things disturbingly surprising when the situation arises. In the bug report (http://bugs.python.org/issue5753), I suggested we add a new function PySys_SetArgvEx() which would take an additional parameter telling whether to touch sys.path or not (in the same spirit as Py_InitializeEx() providing a more flexible API than Py_Initialize()). On the other hand, I don't think we can change the default behaviour of PySys_SetArgv(), since there are probably tools and applications relying on it (the obvious use case which comes to my mind is a third-party interactive interpreter). Any opinions? Regards Antoine. From g.brandl at gmx.net Sat May 2 22:12:10 2009 From: g.brandl at gmx.net (Georg Brandl) Date: Sat, 02 May 2009 22:12:10 +0200 Subject: [Python-Dev] multi-with statement In-Reply-To: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> References: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> Message-ID: Fredrik Johansson schrieb: > On Sat, May 2, 2009 at 9:01 PM, Georg Brandl wrote: >> Hi, >> >> this is just a short notice that Mattias Br?ndstr?m and I have finished a >> patch to implement the previously discussed and mostly warmly welcomed >> extension to with's syntax, allowing >> >> with A() as a, B() as b: >> >> to be written instead of >> >> with A() as a: >> with B() as b: > I was hoping for the other syntax in order to be able to create a > nested context in advance as a simple tuple: > > with A, B: > pass > > context = A, B > with context: > pass > > (I.e. a tuple, or perhaps any iterable, would be a valid context manager.) I see; you want to construct your context manager programmatically and pass it to "with" without knowing what is in there. While this would be possible, we have to be aware that with this we would effectively change the context manager protocol, rather like the iterator protocol's __getitem__ alternate realization. This muddies the definition of a context manager. (The interesting thing is that you could already implement *that* version without any new syntactic support, by giving tuples an __enter__/__exit__ method pair.) > With the syntax in the patch, I will still have to implement a custom > nesting context manager to do this, which sort of defeats the purpose. Not really. Having an unknown number of stacked context managers is not the purpose -- for that, I'd still say a custom nesting context manager is better, because it is also more explicit when created not at the "with" site. (You could even write it as a tuple subclass, if you like the tuple interface.) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. From rdmurray at bitdance.com Sun May 3 00:33:15 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Sat, 2 May 2009 18:33:15 -0400 (EDT) Subject: [Python-Dev] multi-with statement In-Reply-To: References: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> Message-ID: oN Sat, 2 May 2009 at 22:12, Georg Brandl wrote: > I see; you want to construct your context manager programmatically and pass > it to "with" without knowing what is in there. > > While this would be possible, we have to be aware that with this we would > effectively change the context manager protocol, rather like the iterator > protocol's __getitem__ alternate realization. This muddies the definition > of a context manager. > > (The interesting thing is that you could already implement *that* version > without any new syntactic support, by giving tuples an __enter__/__exit__ > method pair.) > >> With the syntax in the patch, I will still have to implement a custom >> nesting context manager to do this, which sort of defeats the purpose. > > Not really. Having an unknown number of stacked context managers is not > the purpose -- for that, I'd still say a custom nesting context manager > is better, because it is also more explicit when created not at the "with" > site. (You could even write it as a tuple subclass, if you like the tuple > interface.) As I understand it, the primary problem the patch Georg is talking about solves is the fact that currently if you pass multiple contexts to contextlib.nested, and one of the later items in the argument list throws an error, the context(s) from the earlier context manager(s) does not get cleaned up properly. This patch solves that problem very neatly. I'm +1 on the patch, including preferring the syntax over the alternative. Georg, maybe you should post the link to the python-ideas discussion? --David From ben+python at benfinney.id.au Sun May 3 01:54:38 2009 From: ben+python at benfinney.id.au (Ben Finney) Date: Sun, 03 May 2009 09:54:38 +1000 Subject: [Python-Dev] Oddity PEP 0 key References: <49FB165B.9070909@mrabarnett.plus.com> <87ljpghxuw.fsf@uwakimon.sk.tsukuba.ac.jp> <200905021234.08766.Arfrever.FTA@gmail.com> Message-ID: <871vr7m56p.fsf@benfinney.id.au> Arfrever Frehtes Taifersar Arahesis writes: > 2009-05-02 07:34:15 Stephen J. Turnbull napisa?(a): > > Barry Warsaw writes: > > > Maybe even s/Active/Proposed/g ? > > > > Shouldn't that be > > > > s/Active/Proposed/ > > No. > From `info sed 'sed Programs' 'The "s" Command'`: Stephen was, I suspect, feeling a little frisky when he wrote that, and attempted a joke (the shortcut ?? is often used in this forum for ?insert a silly grin here?). Knowing him, I grade the joke ?4 out of 10, could do better?. -- \ ?Think for yourselves and let others enjoy the privilege to do | `\ so too.? ?Voltaire, _Essay On Tolerance_ | _o__) | Ben Finney -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: From zookog at gmail.com Sun May 3 06:33:54 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Sat, 2 May 2009 22:33:54 -0600 Subject: [Python-Dev] PEP 383 and GUI libraries In-Reply-To: <51167066-A162-4AAF-B40D-52C1918032D8@fuhm.net> References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

<49FB2596.1090706@v.loewis.de> <51167066-A162-4AAF-B40D-52C1918032D8@fuhm.net> Message-ID: [cross-posting to python-dev and tahoe-dev] On Fri, May 1, 2009 at 8:12 PM, James Y Knight wrote: > > If I were designing a new system such as this, I'd probably just go for > utf8b *always*. Ah, this would be a very tempting possibility -- abandon all unix users who are slow to embrace our utf-8b future! However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it. It currently uses utf-8 for its internal storage (note: nothing to do with reading or writing files from external sources -- only for storing filenames in the decentralized storage system which is accessed by Tahoe clients), and we can't start putting non-utf-8-valid sequences in the "filename" slot because other Tahoe clients would then get a UnicodeDecodeError exception when trying to read those directories. We *could* create a new metadata entry to hold things other than utf-8. Current Tahoe clients would never look at that entry (the metadata is a JSON-serialized dictionary, so we can add a new key name into it without disturbing the existing clients), but future Tahoe clients could look for that new key. That is where it is possible that future versions of Tahoe might be able to benefit from utf-8b or PEP 383, although what PEP 383 offers for this use case remains unclear to me. > But if you don't do that, then, I still don't see what purpose your > requirements serve. If I have two systems: one with a UTF-8 locale, and one > with a Latin-1 locale, why should transmitting filenames from system 1 to > system 2 through tahoe preserve the raw bytes, but doing the reverse *not* > preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, > so they'll all decode into unicode without error, and then be reencoded in > utf-8...). This seems rather a useless behavior to me. I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself. Hopefully the following will help. There are two different things stored in Tahoe for each directory entry: the filename and the metadata. Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r. Requirement 1 (unicode): Each filename that you see needs to be valid unicode (it is stored internally in utf-8). This eliminates utf-8b and PEP 383 from being directly applicable to the filename part, although perhaps they could be useful for the metadata part (about which more below). Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory, if that bytestring is the valid encoding of some string in your stated locale, then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, because it could be that the locale was wrong for this byte string and the resulting successfully-decoded unicode name is gibberish. This is especially acute if the locale is an 8-bit encoding such as latin-1 or windows-1252. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way? It seems like we're not going to do better than requirement 2 (faithful if unicode). Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility). I have heard some developers say that they don't want to support this requirement and would rather tell the users to fix their filenames before they can back up or share those files through Tahoe. On the other hand, users have said that they require this and they are not going to go mucking about with all their filenames just so that they can use my backup and filesharing tool. Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file. Therefore these three requirements imply that we have to detect such collisions and deal with them somehow. (Thanks to Martin v. L?wis for reminding me of this.) Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"): Suppose you have a directory with some files with Japanese names, encoded using shift-jis, and some files with Russian names, encoded using koi8-r. Suppose your locale is set to shift-jis, and then you do "tahoe cp -r myfiles/ tahoe:". Then suppose you or someone else does "tahoe cp -r tahoe: copy_of_myfiles/". The "round-tripping" feature is that the files with Russian names that did not accidentally decode cleanly with shift-jis still have the same bytes in their names as they did in the original myfiles directory. As I write this, I am becoming skeptical of this (faithful bytes if not unicode, a.k.a. "round-tripping"), thanks in part to criticism from James Knight, MvL, Thomas Breuel, and others. One reason to be skeptical is that about a third of the Russian files will happen to decode cleanly as shift-jis anyway, and will therefore come out as something entirely different if the target filesystem's encoding is something other than shift-jis. But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/". So I'm ready to reject this one. Now about the "metadata" part which is separate from the filename itself. I have another requirement: Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. For example, if a bytestring decodes cleanly with the locale's suggested encoding, and we use the resulting unicode as the filename, then we also store the original byte string in the metadata since we don't know if the locale's suggested encoding was good. This allows the later invention of a tool which shows the user what the filename would have been with other encodings and let the user choose one that makes sense. It is important to note that this does not impose any requirement on the *filename* itself -- all such information can be stored in the metadata. Okay, in light of the above four requirements and the rejection of #4, I hereby propose to change from the previous Tahoe design [2] to the following: To copy an entry from a local filesystem into Tahoe: 1. On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata. 2. On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object. 2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata. 2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake! 3. (handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name. If the failed_decode flags differ then this is a case of collision -- the old entry and the new entry had (as far as we are concerned) different names that accidentally generated the same unicode. Alter the new entry's name, for example by appending "~1" and then trying again and incrementing the number until it doesn't match any extant entry. To copy an entry from Tahoe into a local filesystem: Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted. Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with? I'm sure that if can help with the use case of "I'm doing os.listdir() and then I'm going to turn around and use the resulting unicode objects on the same local filesystem in the same Python process". I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises. Does that sound right to you? Perhaps this could be documented somehow to help other programmers along the way. Thanks very much for your help, everyone. Regards, Zooko [1] http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/ [2] http://allmydata.org/trac/tahoe/ticket/534#comment:47 From greg.ewing at canterbury.ac.nz Sun May 3 09:47:17 2009 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 03 May 2009 19:47:17 +1200 Subject: [Python-Dev] yield from? In-Reply-To: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> References: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> Message-ID: <49FD4C05.2020301@canterbury.ac.nz> Benjamin Peterson wrote: > What's the status of yield from? There's still a small window open for > a patch to be checked into 3.1's branch. I haven't been following the > python-ideas threads, so I'm not sure if it's ready yet. The PEP itself seems to have settle down, and is awaiting a verdict from Guido. The prototype implementation doesn't quite match the PEP in some of the fine details yet. Also it's for 2.6 rather than 3.x; someone with more knowledge of 3.x internals would be better placed than me to convert it. -- Greg From martin at v.loewis.de Sun May 3 10:17:04 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 03 May 2009 10:17:04 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler Message-ID: <49FD5300.6010906@v.loewis.de> With issue 3672 resolved, it is now unnecessary to introduce an utf-8b codec, since the utf-8 codec will properly report errors for all byte sequences invalid in UTF-8, including lone surrogates. Therefore, utf-8b can be implemented solely through the error handler. Glenn Linderman suggested that the name "python-escape" is not very descriptive, so I've changed the name to "utf8b". I've updated the PEP accordingly. Regards, Martin From stephen at xemacs.org Sun May 3 11:32:38 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 03 May 2009 18:32:38 +0900 Subject: [Python-Dev] PEP 383 and Tahoe [was: GUI libraries] In-Reply-To: References: <49F965DB.6050601@v.loewis.de> <49F96770.4080206@g.nevcal.com> <49F96B80.5090808@v.loewis.de>

<49FB2596.1090706@v.loewis.de> <51167066-A162-4AAF-B40D-52C1918032D8@fuhm.net> Message-ID: <877i0yilah.fsf@uwakimon.sk.tsukuba.ac.jp> Zooko O'Whielacronx writes: > However, it is moot because Tahoe is not a new system. It is currently > at v1.4.1, has a strong policy of backwards-compatibility, and already > has lots of data, lots of users, and programmers building on top of > it. Cool! Question: is there a way to negotiate versions, or better yet, features? > I see I'm not explaining the Tahoe requirements clearly. It's probably > that I'm not understanding them clearly myself. Well, it's a high-dimensional problem. Keeping track of all the variables is hard. That's why something like PEP 383 can be important to you even though it's only a partial solution; it eliminates one variable. > Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system > and then you inspect the files in the Tahoe filesystem, such as by > examining the web interface [1] or by running "tahoe ls", either of > which you could do either from the same machine where you ran "tahoe > cp" or from a different machine (which could be using any operating > system). We have the following requirements about what ends up in your > Tahoe directory after that cp -r. Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the directory listing? A copy to whatever system I'm on? The bytes that the Tahoe host has just loaded into a network card buffer to tell me about it? The bytes on disk at the Tahoe host? You'll find it a lot easier to explain things if you adopt a precise, consistent terminology. > Requirement 1 (unicode): Each filename that you see needs to be valid > unicode What does "see" mean? In directory listings? Under what circumstances, if any, can what I see be different from what I get? > Requirement 2 (faithful if unicode): For each filename (byte string) > in your myfiles directory, My local myfiles directory, or my Tahoe myfiles directory? > if that bytestring is the valid encoding of some string in your > stated locale, Who stated the locale? How? Are you referring to what getfilesystemencoding returns? This is a "(unicode) string", right? > then the resulting filename in Tahoe is that (unicode) > string. Nobody ever doesn't want this, right? Well, maybe some > people don't want this sometimes, [...]. However, what's the > alternative? Guessing that their locale shouldn't be set to > latin-1 and instead decoding their bytes some other way? Sure. Emacsen do that, you know. Of course it's hard to guess something else if ISO-8859/1 is the preferred encoding, but it does happen. This probably cannot be done accurately enough for Tahoe, though. > It seems like we're not going to do better than > requirement 2 (faithful if unicode). > > Requirement 3 (no file left behind): For each filename (byte string) > in your myfiles directory, whether or not that byte string is the > valid encoding of anything in your stated locale, then that file will > be added into the Tahoe filesystem under *some* name (a good candidate > would be mojibake, e.g. decode the bytes with latin-1, but that is not > the only possibility). That's not even a possibility, actually. Technically, Latin-1 has a "hole" from U+0080 to U+009F. You need to add the C1 controls to fill in that gap. (I don't think it actually matters in practice, everybody seems to implement ISO-8859/1 as though it contained the control characters ... except when detecting encodings ... but it pays to be precise in these things ....) > Now already we can say that these three requirements mean that there > can be collisions -- for example a directory could have two entries, > one of which is not a valid encoding in the locale, and whatever > unicode string we invent to name it with in order to satisfy > requirements 3 (no file left behind) and 1 (unicode) might happen to > be the same as the (correctly-encoded) name of the other file. This is false with rather high probability, but you need some extra structure to deal with it. First, claim the Unicode private planes for Tahoe. Then allocate characters from the private planes on demand as encountered, *including* such characters encountered in external file names to be stored in Tahoe *and* the surrogates used by PEP 383. "Display names" using these private characters would be valid Unicode, but not very useful. However, an algorithmically generated font (like the 4-hex-digit-square used to give a glyph to unknown code points in the BMP) could be used by those who care. Also store mappings from (system encoding, UTF-8b representation) to private char and back. For simplicity, that could be global on your server (IIRC, there are at least two private planes up there, so you'd need to run into almost 128Ki *unique* such characters to run out). I guess you'd be subject to a DOS attack where somebody decided to map all of 80000-odd CNS characters into private space, and then write 80000 files, each with a different 1-character name .... Note that Martin does *not* do this in PEP 383 because PEP 383 only cares about the semantics that a filename read from a directory can be used to access the file associated with it in that directory. For that, a private, non-Unicode encoding is perfectly acceptable. But you want valid Unicode. This scheme gives it to you. The registry of characters is somewhat unpleasant, but it does allow you to detect filenames that are the same reliably. > Possible Requirement 4 (faithful bytes if not unicode, a.k.a. > "round-tripping"): PEP 383 gives you this, but you must store the encoding used for each such file name. > One reason to be skeptical is that about a third of the Russian > files will happen to decode cleanly as shift-jis anyway, and will > therefore come out as something entirely different if the target > filesystem's encoding is something other than shift-jis. The only way to handle this is to store the encoding used to convert to Unicode as part of *every* file's metadata. This could be also used in Tahoe to warn the user that the current system encoding does not match the alleged_encoding used to make the backup. Some users might prefer to use the alleged_encoding on restore. > But an even worse problem -- the show-stopper for me -- is that I > don't want what Tahoe shows when you do "tahoe ls" or view it in a > web browser to differ from what it writes out when you do "tahoe cp > -r tahoe: newfiles/". But as a requirement, that's incoherent. What you are "seeing" is Unicode, what it will write out is bytes. That means that if multiple locales are in use on both the backup and restore systems, and the nominal system encodings are different, people whose personal default locales are not the same as the system's will see what they expect on the backup system (using system ls), mojibake on Tahoe (using tahoe ls), and *different* mojibake on the restore system (system ls, again). Note that "use Tahoe, not system, ls" doesn't help at all (unless the weirdo has learned to read mojibake, which actually does happen, but it's not worth betting on). How likely is that? Hate to tell you this: if you need the "unknown bytes scheme at all, this scenerio is *extremely* likely. How do you think that KOI8-R got into a directory on a Shift-JIS system in the first place? Yup, a Russian visiting professor in Tokyo who set his personal locale to ru_RU.KOI8-R wrote it there. And he's very likely to have the same personal locale on a very up-to-date system with a UTF-8 system encoding when he gets back to Moscow. Bingo! it's mojibake all the way to Moscow. > Now about the "metadata" part which is separate from the filename > itself. I have another requirement: > > Requirement 5 (no loss of information): I don't want Tahoe to destroy > information -- every transformation should be (in principle) > reversible by some future computer-augmented archaeologist. For > example, if a bytestring decodes cleanly with the locale's suggested > encoding, and we use the resulting unicode as the filename, then we > also store the original byte string in the metadata since we don't > know if the locale's suggested encoding was good. UTF-8b would be just as good for storing the original bytestring, as long as you keep the original encoding. It's actually probably preferable if PEP 383 can be assumed to be implemented in the versions of Python you use. > This allows the later invention of a tool It will be called "Emacs", by the way. > which shows the user what the filename would > have been with other encodings and let the user choose one that makes > sense. > To copy an entry from a local filesystem into Tahoe: > > 1. On Windows or Mac read the filename with the unicode APIs. > Normalize the string with filename = unicodedata.normalize('NFC', > filename). Leave the "original_bytes" key and the "failed_decode" flag > out of the metadata. NFD is probably better for fuzzy matching and display on legacy terminals. > 2. On Linux or Solaris read the filename with the string APIs, and > store the result in the "original_bytes" part of the metadata. Call > sys.getfilesystemencoding() to get an alleged_encoding. Then, call > bytes.decode(alleged_encoding, 'strict') to try to get a unicode > object. > > 2.a. If this decoding succeeds then normalize the unicode filename > with filename = unicodedata.normalize('NFC', filename), store the > resulting filename and leave the "failed_decode" flag out of the > metadata. Per the koi8-lucky example, you don't know if it succeeded for the right reason or the wrong reason. You really should store the alleged_encoding used in the metadata, always. Note that you should *also* store the failed_decode flag, because the presence of multiple fail_decodes is a very strong indication that some of the users had default encoding != system encoding. If you use the scheme I propose above, of course you have the same information by scanning the file name for Tahoe-only private use characters, but that would be relatively expensive. > 2.b. If this decoding fails, then we decode it again with > bytes.decode('latin-1', 'strict'). Do not normalize it. Store the > resulting unicode object into the "filename" part, set the > "failed_decode" flag to True. This is mojibake! Not necessarily. Most ISO-8859/X names will fail to decode if the alleged_encoding is UTF-8, for example, but many (even for X != 1) will be correctly readable because of the policy of trying to share code points across Latin-X encodings. Certainly ISO-8859/1 (and much ISO-8859/15) will be correct. > 3. (handling collisions) In either case 2.a or 2.b the resulting > unicode string may already be present in the directory. If so, check > the failed_decode flags on the current entry and the new entry. If > they are both set or both unset then the new entry overwrites the old > entry -- they had the same name. If both are set, you're OK, because you are forcing ISO-8859/1. If both are unset, however, you don't know for sure because alleged_encoding is not necessarily a constant. > To copy an entry from Tahoe into a local filesystem: > > Always use the Python unicode API. The original_bytes field and the > failed_decode field in the metadata are not consulted. > > Now a question for python-dev people: could utf-8b or PEP 383 be > useful for requirements like the four requirements listed above? If > not, what requirements does PEP 383 help with? By giving you a standard, invertible way to represent anything that the OS can throw at you, it helps with all of them. > I'm not sure that it can help if you are going to store the results > of your os.listdir() persistently or if you are going to transmit > them over a network. Indeed, using the results that way could lead > to unpleasant surprises. No more than any other system for giving a canonical Unicode spelling to the results of an OS call. From l.mastrodomenico at gmail.com Sun May 3 15:29:27 2009 From: l.mastrodomenico at gmail.com (Lino Mastrodomenico) Date: Sun, 3 May 2009 15:29:27 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FD5300.6010906@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> Message-ID: 2009/5/3 "Martin v. L?wis" : > With issue 3672 resolved, it is now unnecessary to introduce > an utf-8b codec, since the utf-8 codec will properly report errors > for all byte sequences invalid in UTF-8, including lone surrogates. > Therefore, utf-8b can be implemented solely through the error handler. That's even nicer. One minor detail though, in the sentence: "non-decodable bytes >128 will be represented as lone half surrogate" ">" should be ">=". -- Lino Mastrodomenico From solipsis at pitrou.net Sun May 3 15:43:06 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 3 May 2009 13:43:06 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> Message-ID: Martin v. L?wis v.loewis.de> writes: > > Glenn Linderman suggested that the name "python-escape" is not very > descriptive, so I've changed the name to "utf8b". If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"? Also, if utf8-b is not provided as a codec, will there be an easy way for user code to use the same encoding as the IO layer does? (e.g. os.fsdecode/os.fsencode)? From ncoghlan at gmail.com Sun May 3 17:09:47 2009 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 4 May 2009 01:09:47 +1000 Subject: [Python-Dev] multi-with statement In-Reply-To: References: <3d0cebfb0905021226y501a5990q5b3ccc016255cdef@mail.gmail.com> Message-ID: <972FCC04-5F53-4098-8AFA-FC70CDF55BEB@gmail.com> (I still don't really have net access back after moving house - just chiming in briefly via my mobile) Anyway, I think there is one very good reason for NOT defining a multi- with statement in terms of an existing tuple: it gains us nothing except speed over contextlib.nested. The whole point of the new syntactic support is to execute each expression inside the context of the preceding managers. That requirement precludes the idea of using an intermediate tuple, since every expression would have to be evaluated before the tuple could be created. I'm still not 100% convinced the saving in indentation levels due to this change would be worth the increase in complexity and ambiguity though. -- Nick Coghlan, Brisbane, Australia On 03/05/2009, at 6:12 AM, Georg Brandl wrote: > Fredrik Johansson schrieb: >> On Sat, May 2, 2009 at 9:01 PM, Georg Brandl >> wrote: >>> Hi, >>> >>> this is just a short notice that Mattias Br?ndstr?m and I have f >>> inished a >>> patch to implement the previously discussed and mostly warmly >>> welcomed >>> extension to with's syntax, allowing >>> >>> with A() as a, B() as b: >>> >>> to be written instead of >>> >>> with A() as a: >>> with B() as b: > >> I was hoping for the other syntax in order to be able to create a >> nested context in advance as a simple tuple: >> >> with A, B: >> pass >> >> context = A, B >> with context: >> pass >> >> (I.e. a tuple, or perhaps any iterable, would be a valid context >> manager.) > > I see; you want to construct your context manager programmatically > and pass > it to "with" without knowing what is in there. > > While this would be possible, we have to be aware that with this we > would > effectively change the context manager protocol, rather like the > iterator > protocol's __getitem__ alternate realization. This muddies the > definition > of a context manager. > > (The interesting thing is that you could already implement *that* > version > without any new syntactic support, by giving tuples an __enter__/ > __exit__ > method pair.) > >> With the syntax in the patch, I will still have to implement a custom >> nesting context manager to do this, which sort of defeats the >> purpose. > > Not really. Having an unknown number of stacked context managers is > not > the purpose -- for that, I'd still say a custom nesting context > manager > is better, because it is also more explicit when created not at the > "with" > site. (You could even write it as a tuple subclass, if you like the > tuple > interface.) > > Georg > > -- > Thus spake the Lord: Thou shalt indent with four spaces. No more, no > less. > Four shall be the number of spaces thou shalt indent, and the number > of thy > indenting shall be four. Eight shalt thou not indent, nor either > indent thou > two, excepting that thou then proceed to four. Tabs are right out. > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com From murman at gmail.com Sun May 3 17:35:16 2009 From: murman at gmail.com (Michael Urman) Date: Sun, 3 May 2009 10:35:16 -0500 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> Message-ID: On Sun, May 3, 2009 at 08:43, Antoine Pitrou wrote: > Also, if utf8-b is not provided as a codec, will there be an easy way for user > code to use the same encoding as the IO layer does? (e.g. > os.fsdecode/os.fsencode)? I like the idea of fsencode/fsdecode functions, but we need to be careful deciding what they accept and produce on Windows. I'd expect them to be identity functions, but then the difference in platform behavior suggests perhaps they should be in os.path. Unicode to Unicode on Windows would further mean fsencode wouldn't be useful for sending filenames over sockets, and "utf8" will be prone to exceptions on the very names we're trying to support right now. Is there an advantage to not providing the the "utf8b" behavior as a registered codec? -- Michael Urman From martin at v.loewis.de Sun May 3 19:32:47 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 03 May 2009 19:32:47 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> Message-ID: <49FDD53F.9080101@v.loewis.de> > That's even nicer. One minor detail though, in the sentence: > > "non-decodable bytes >128 will be represented as lone half surrogate" > > ">" should be ">=". Thanks, fixed. Martin From martin at v.loewis.de Sun May 3 19:39:41 2009 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Sun, 03 May 2009 19:39:41 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> Message-ID: <49FDD6DD.6050808@v.loewis.de> > If the error handler is supposed to be used for codecs other than utf-8, > perhaps it should renamed something more generic, e.g. "surrogate-escape"? Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points. > Also, if utf8-b is not provided as a codec, will there be an easy way for user > code to use the same encoding as the IO layer does? s.encode(os.getfilesystemencoding(), "utf8b") will do just that (in fact, that's exactly what the IO layer does). Regards, Martin From greg at krypto.org Sun May 3 21:20:07 2009 From: greg at krypto.org (Gregory P. Smith) Date: Sun, 3 May 2009 12:20:07 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FDD6DD.6050808@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> Message-ID: <52dc1c820905031220l2f0671b0u425660b85e20d12f@mail.gmail.com> On Sun, May 3, 2009 at 10:39 AM, "Martin v. L?wis" wrote: > > If the error handler is supposed to be used for codecs other than utf-8, > > perhaps it should renamed something more generic, e.g. > "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it's an algorithm based on 16-bit or 32-bit code points. To me that lack of relationship with utf8 suggests that it should not be called utf8b... But I don't have any good suggestions. > > > Also, if utf8-b is not provided as a codec, will there be an easy way for > user > > code to use the same encoding as the IO layer does? > > s.encode(os.getfilesystemencoding(), "utf8b") will do just that (in > fact, that's exactly what the IO layer does). > > Regards, > Martin > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/greg%40krypto.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Sun May 3 22:27:59 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sun, 03 May 2009 22:27:59 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <52dc1c820905031220l2f0671b0u425660b85e20d12f@mail.gmail.com> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> <52dc1c820905031220l2f0671b0u425660b85e20d12f@mail.gmail.com> Message-ID: <49FDFE4F.30200@v.loewis.de> > > If the error handler is supposed to be used for codecs other than > utf-8, > > perhaps it should renamed something more generic, e.g. > "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it's an algorithm based on 16-bit or 32-bit code points. > > > To me that lack of relationship with utf8 suggests that it should not be > called utf8b Perhaps. However, giving it that name was Markus Kuhn's choice - and while it may be confusing, it's (IMO) useful to be consistent with this background. Regards, Martin From greg at krypto.org Sun May 3 23:11:51 2009 From: greg at krypto.org (Gregory P. Smith) Date: Sun, 3 May 2009 14:11:51 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FDFE4F.30200@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> <52dc1c820905031220l2f0671b0u425660b85e20d12f@mail.gmail.com> <49FDFE4F.30200@v.loewis.de> Message-ID: <52dc1c820905031411x488c7d51u4f068a9d419b0318@mail.gmail.com> On Sun, May 3, 2009 at 1:27 PM, "Martin v. L?wis" wrote: > > > If the error handler is supposed to be used for codecs other than > > utf-8, > > > perhaps it should renamed something more generic, e.g. > > "surrogate-escape"? > > > > Perhaps. However, utf-8b doesn't really have to do anything with > utf-8 - > > it's an algorithm based on 16-bit or 32-bit code points. > > > > > > To me that lack of relationship with utf8 suggests that it should not be > > called utf8b > > Perhaps. However, giving it that name was Markus Kuhn's choice - and > while it may be confusing, it's (IMO) useful to be consistent with this > background. > > Regards, > Martin > > Ah, right. My original searches for utf8b didn't turn up much but searching on his name turns some up. Good choice of name then. http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html http://bsittler.livejournal.com/10381.html http://hyperreal.org/~est/utf-8b/ -gps -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin at python.org Mon May 4 00:50:29 2009 From: benjamin at python.org (Benjamin Peterson) Date: Sun, 3 May 2009 17:50:29 -0500 Subject: [Python-Dev] yield from? In-Reply-To: <49FD4C05.2020301@canterbury.ac.nz> References: <1afaf6160905011827l132a0014o6b1032e20a08552c@mail.gmail.com> <49FD4C05.2020301@canterbury.ac.nz> Message-ID: <1afaf6160905031550h59af1bbaoc298b0f97f7c25c8@mail.gmail.com> 2009/5/3 Greg Ewing : > Benjamin Peterson wrote: >> >> What's the status of yield from? There's still a small window open for >> a patch to be checked into 3.1's branch. I haven't been following the >> python-ideas threads, so I'm not sure if it's ready yet. > > The PEP itself seems to have settle down, and is > awaiting a verdict from Guido. Guido is now on vacation until the 18th, so I think this will have to be deferred until 2.7/3.2. > > The prototype implementation doesn't quite match > the PEP in some of the fine details yet. Also > it's for 2.6 rather than 3.x; someone with more > knowledge of 3.x internals would be better placed > than me to convert it. -- Regards, Benjamin From jimjjewett at gmail.com Mon May 4 06:36:05 2009 From: jimjjewett at gmail.com (Jim Jewett) Date: Mon, 4 May 2009 00:36:05 -0400 Subject: [Python-Dev] PEP 383 and GUI libraries Message-ID: (sent only to python-dev, as I am not a subscriber of tahoe-dev) Zooko wrote: > [Tahoe] currently uses utf-8 for its internal storage (note: nothing to > do with reading or writing files from external sources -- only for > storing filenames in the decentralized storage system which is > accessed by Tahoe clients), and we can't start putting non-utf-8-valid > sequences in the "filename" slot because other Tahoe clients would > then get a UnicodeDecodeError exception when trying to read those > directories. So what do you do when someone has an existing file whose name is supposed to be in utf-8, but whose actual bytes are not valid utf-8? If you have somehow solved that problem, then you're already done -- the PEP's encoding is a no-op on anything that isn't already invalid unicode. If you have not solved that problem, then those clients will already be getting a UnicodeDecodeError; all the PEP does is make it at least possible for them to recover. ... > Requirement 1 (unicode): Each filename that you see needs to be valid > unicode (it is stored internally in utf-8). (repeating) What does Tahoe do if this is violated? Do you throw an exception right there and not let them copy the file to tahoe? If so, then that same error correction means that utf8b will never differ from utf-8, and you have nothing to worry about. > Requirement 2 (faithful if unicode): Doesn't the PEP meet this? > Requirement 3 (no file left behind): Doesn't the PEP also meet this? I thought the concern was just that the name used would not be valid unicode, unless the original name was itself valid unicode. > Possible Requirement 4 (faithful bytes if not unicode, a.k.a. > "round-tripping"): Doesn't the PEP also support this? (Only) the invalid bytes get escaped and therefore must be unescaped, but the escapement is reversible. > 3. (handling collisions) In either case 2.a or 2.b the resulting > unicode string may already be present in the directory. This collision is what the use of half-surrogates (as the escape characters) avoids. Such collisions can't be present unless the data was invalid unicode, in which case it was the result of an escapement (unless something other than python is creating new invalid filenames). -jJ From larry at hastings.org Mon May 4 11:10:51 2009 From: larry at hastings.org (Larry Hastings) Date: Mon, 04 May 2009 02:10:51 -0700 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef Message-ID: <49FEB11B.2040304@hastings.org> I should have brought this up to python-dev before--sorry for being so slow. It's already in the tracker for a couple of days: http://bugs.python.org/issue5880 The idea: PyGetSetDef has this "void *closure" field that acts like a context pointer. You stick it in the PyGetSetDef, and it gets passed back to you when your getter or setter is called. It's a reasonable API design, but in practice you almost never need it. Meanwhile, it clutters up CPython, particularly typeobject.c; there are all these function calls that end with ", NULL);", just to satisfy the getter/setter prototype internally. Most of the time, the "closure" parameter is not only unused, it is skipped. PyGetSetDef definitions generally skip it, and often getter and setter implementations omit it. The "closure" was only actually *used* once in CPython, a silly use in Objects/longobject.c where it was abused as an integer value. And yes, I said "was": inspired by this discussion, Mark Dickinson removed this use in r72202 (trunk) and r72203 (py3k). So the "closure" field is now 100% unused in the python and py3k trunks. Mr. Dickinson also located an extension using the "closure" pointer, pyephem, which... *also* uses it to store an integer. Indeed, I have yet to see a use where someone stores a pointer in "closure". Anyone who needed functionality like this could roll it themselves with stub functions: PyObject *my_getter_with_context(PyObject *self, void *context) { /* ... */ } PyObject *my_getter_A(PyObject *self) { return my_getter_with_context(self, "A"); } PyObject *my_getter_B(PyObject *self) { return my_getter_with_context(self, "B"); } /* etc. */ (Although it'd make my example more realistic if "context" were an int!) So: you don't need it, it clutters up our code (particularly typeobject.c), and it adds overhead. The only good reason to keep it is backwards compatibility, which I admit is a fine reason. Whaddya think? To be honest I'd be surprised if you guys went for this. But I thought it was worth suggesting. /larry/ From eric at trueblade.com Mon May 4 13:37:33 2009 From: eric at trueblade.com (Eric Smith) Date: Mon, 04 May 2009 07:37:33 -0400 Subject: [Python-Dev] Changing float.__format__ Message-ID: <49FED37D.30906@trueblade.com> In issue 5920, Mark Dickinson raises an issue having to do with float.__format__ and how it handles the default format presentation type (that is, none of 'f', 'g', or 'e') versus how str() works on floats: http://bugs.python.org/issue5920 I agree with him that the current behavior is confusing and should be changed. I'm going to make this change, unless anyone objects. Please comment on the issue itself if you have any feedback. Eric. From dickinsm at gmail.com Mon May 4 14:13:25 2009 From: dickinsm at gmail.com (Mark Dickinson) Date: Mon, 4 May 2009 13:13:25 +0100 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <49FEB11B.2040304@hastings.org> References: <49FEB11B.2040304@hastings.org> Message-ID: <5c6f2a5d0905040513t42f167f9pf44d4a28d355df47@mail.gmail.com> On Mon, May 4, 2009 at 10:10 AM, Larry Hastings wrote: > So: you don't need it, it clutters up our code (particularly typeobject.c), > and it adds overhead. ?The only good reason to keep it is backwards > compatibility, which I admit is a fine reason. Presumably whoever added the context field had a reason for doing so. Does anyone remember what the intended use was? Trawling through the history, all I could find was this comment, attached to revision 23270: [Modified Thu Sep 20 21:45:26 2001 UTC (7 years, 7 months ago) by gvanrossum] """ Add optional docstrings to getset descriptors. Fortunately, there's no backwards compatibility to worry about, so I just pushed the 'closure' struct member to the back -- it's never used in the current code base (I may eliminate it, but that's more work because the getter and setter signatures would have to change.) """ Still, binary compatibility seems like a fairly strong reason not to remove the closure field. Mark From gregor.lingl at aon.at Mon May 4 16:33:58 2009 From: gregor.lingl at aon.at (Gregor Lingl) Date: Mon, 04 May 2009 16:33:58 +0200 Subject: [Python-Dev] turtle.py update for 3.1 Message-ID: <49FEFCD6.1040001@aon.at> Hi, Encouraged by a conversation with Martin at PyCon 2009 I've prepared a version 1.1b of the turtle module and I'd like to get some advice or assistance to get it into the beta as explained below. Thus I'd appreciate very much if also the release manager would take notice of this posting. python 2.0 had the version 1.0 and for now I'll give a terse summary of the changes I did: 1. a few bugfixes, with 1 - 5 lines of code changed for each; these concern bugs that prevented turtle to run correctly 2. I've added four methods to the class TurtleScreeenBase: _onkeypress(fun, key) (supplementing _onkeyrelease) mainloop() (which is now a Screen-method and a function) textinput(title, prompt) numinput(title, prompt, default, minval, maxval) the latter two remedy the complete lack of input methods _onkey, an internal method name is changed to _onkeyrelease 3. I've added one method to the class TurtleScreen: onkeypress(fun, key=None) implemented in analogy to the already present onkey() which got onkeyrelease as an alias. 4. I've changed several portions of the code that affect the representation of the turtleshape thus making it more compact (by removing some duplicated code) and more powerful, i. e. by adding the possibility to apply shearings to turtleshapes (in addition to the already present scaling and rotating transformations). Thus now the full range of (non singular) linear transformations is available. New methods in class RawTurtle: shearfactor(shear=None) set or get the shearfactor shapetransform(t11, t12, t21, t22) set or get the shape transform directly get_shapepoly() return the polygon of the current shape I've enhanced the functionality of tiltangle(angle=None) to contain also that of settiltangle and I propose to declare settiltangle as deprecated. 5. I've removed a lot of codelines that were commented out during the process of transferring the module from 2.6 to 3.0 6. I've implemented the bugfix for http://bugs.python.org/issue4117 according do my proposition there and I strongly recommend this change again, as the bug described is very annoying, the fix is easy and no one proposed a better solution. 7. I've tested the present version 1.1 extensivly. It runs all the demo scripts without problems and many others too (some of them significantly better than version 1.1). I'd like to add two additional scripts to the demo directory, one of them using new features so it only runs with this new version. I've *not* touched the issue of the Screen singleton, so that remains unchanged as it was as a result of Martins patch. Thus, as a summary, this update does some bugfixes and eliminates three deficiencies of the module: (1) accept keypress event, (2) provide user input functions and (3) complement scaling and rotating of turtleshapes by shearing, thus providing the full range of linear transforms. HOW TO PROCEED NOW? (1) Submit the new version as a single file (2) submit a unified diff containing all the changes (3) Divide the changes into several chunks of related changes and submit the according diffs separately That would pose the problems, that there are lines in the code that are affected by several changes, e. g. those lines that define __all__ And also: does the order of applying the patches matter? How do I have to account for this? (4) Some other approach? I'd appreciate to discuss open issues as needed and I'm prepared to give more elaborate explanations and rationales as wanted or as needed. Docs for the changes are (to a large extent) contained in the docstrings and I'm going to update the Documentation of the turtle module (on the basis of theses docstrings) now. Thanks in advance for your support Gregor From phd at phd.pp.ru Mon May 4 17:07:49 2009 From: phd at phd.pp.ru (Oleg Broytmann) Date: Mon, 4 May 2009 19:07:49 +0400 Subject: [Python-Dev] PyPI copyright Message-ID: <20090504150749.GG16721@phd.pp.ru> http://pypi.python.org/pypi "Copyright ? 1990-2007, Python Software Foundation" :s/2007/2009/ Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd at phd.pp.ru Programmers don't die, they just GOSUB without RETURN. From mail at apexo.de Mon May 4 17:28:54 2009 From: mail at apexo.de (Christian Schubert) Date: Mon, 4 May 2009 17:28:54 +0200 Subject: [Python-Dev] RFC: Threading-Aware Profiler for Python Message-ID: <200905041728.55350.mail@apexo.de> Hi, Python ships with a profiler module which, unfortunately, is almost useless in a multi-threaded environment. * I've created an alternative profiler module which queries per-thread CPU usage via netlink/taskstats, which limits the applicability to Linux (which shouldn't be much of an issue, profiling is usually not done by end users). It implements two modes: a "sampling" (does CPU time accounting based on stack fraames 100 times per second, by default) and a "deterministic" profiler (does CPU time accounting on each function call/return, based on sys.profiler interface). The deterministic profiler is currently implemented in pure python (except for taskstats interface) and much slower than the sampling profiler. Usage (don't forget make to build the c module): python >> from Profiler import * >> def f(): do_something() >> sampling_profiler(f) or >> deterministic_profiler(f) Output is currently in the form of annotated source code (xyz.py.html, in the same directory where xyz.py resides). Before the *_profiler function returns, it iterates over all code objects it encountered and annotates the source files with 2 columns in front: - 1st column: real time - 2nd column: CPU time numbers are log2(time_in_ns), colors are green-to-yellow for below-average and yellow-to-red for above-average metrics (relative to the average metric for all lines of the code object with a metric > 0). Is there common need for such a module? Is it possible to have this included in the standard cPython distribution? Which functional changes (besides a modification of the annotation output which shouldn't spread its result all over the FS) would be required to get this included? Which non-functional changes would be required to get this included? Please direct traffic regarding this subject to pyprof-devel at lists.sourceforge.net (no I'm not subscribed to python-dev). SF project page: https://sourceforge.net/projects/pyprof/ git repository: git://pyprof.git.sourceforge.net/gitroot/pyprof Regards, Christian *) to be more exact there are at least three profiler modules: profile, cProfile, and hotshot, while I did only try (and failed) to use profile in a multi-threaded environment (by manually setting threading.profile to the profiling function), glancing at the source, I'm pretty sure that cProfile behaves similarly; I didn't test the hotshot module, but it does some other trade-offs (space-for-time), so I think that "pyprof" still adds some value -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: From aahz at pythoncraft.com Mon May 4 17:56:04 2009 From: aahz at pythoncraft.com (Aahz) Date: Mon, 4 May 2009 08:56:04 -0700 Subject: [Python-Dev] RFC: Threading-Aware Profiler for Python In-Reply-To: <200905041728.55350.mail@apexo.de> References: <200905041728.55350.mail@apexo.de> Message-ID: <20090504155604.GA21330@panix.com> On Mon, May 04, 2009, Christian Schubert wrote: > > Python ships with a profiler module which, unfortunately, is almost > useless in a multi-threaded environment. * > > I've created an alternative profiler module which queries per-thread > CPU usage via netlink/taskstats, which limits the applicability to > Linux (which shouldn't be much of an issue, profiling is usually > not done by end users). It implements two modes: a "sampling" (does > CPU time accounting based on stack fraames 100 times per second, by > default) and a "deterministic" profiler (does CPU time accounting > on each function call/return, based on sys.profiler interface). The > deterministic profiler is currently implemented in pure python (except > for taskstats interface) and much slower than the sampling profiler. If you want to discuss this, please subscribe to python-ideas and repost your message. Generally speaking, in order to include modules like this, they need to prove themselves over time and may require PEP approval. If you choose to move the discussion to python-ideas, it would help if you mention known uses of your module. -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "It is easier to optimize correct code than to correct optimized code." --Bill Harlan From fumanchu at aminus.org Mon May 4 18:15:24 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 4 May 2009 09:15:24 -0700 Subject: [Python-Dev] RFC: Threading-Aware Profiler for Python In-Reply-To: <200905041728.55350.mail@apexo.de> References: <200905041728.55350.mail@apexo.de> Message-ID: Christian Schubert wrote: > I've created an alternative profiler module which queries per-thread > CPU usage via netlink/taskstats, which limits the applicability to > Linux (which shouldn't be much of an issue, profiling is usually not > done by end users). One of the uses for a profiling module is to compare runs on various platforms. And please, stop perpetuating the myth that only end-users use anything but Linux. Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From janssen at parc.com Mon May 4 18:19:26 2009 From: janssen at parc.com (Bill Janssen) Date: Mon, 4 May 2009 09:19:26 PDT Subject: [Python-Dev] RFC: Threading-Aware Profiler for Python In-Reply-To: <200905041728.55350.mail@apexo.de> References: <200905041728.55350.mail@apexo.de> Message-ID: <38623.1241453966@parc.com> Hi, Christian. Christian Schubert wrote: > I've created an alternative profiler module which queries per-thread > CPU usage via netlink/taskstats, which limits the applicability to > Linux (which shouldn't be much of an issue, profiling is usually not > done by end users). A surprisingly large # of developers are running on OS X these days, though. I suggest make it work there, too. Bill From larry at hastings.org Mon May 4 19:08:12 2009 From: larry at hastings.org (Larry Hastings) Date: Mon, 04 May 2009 10:08:12 -0700 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <5c6f2a5d0905040513t42f167f9pf44d4a28d355df47@mail.gmail.com> References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905040513t42f167f9pf44d4a28d355df47@mail.gmail.com> Message-ID: <49FF20FC.2060202@hastings.org> Mark Dickinson wrote: > Still, binary compatibility seems like a fairly strong reason not to > remove the closure field. My understanding is that there a) 2.x extension modules are not binary compatible with 3.x, and b) there are essentially no 3.x extension modules in the field. Is that accurate? If we don't have an installed base (yet) to worry about, now's the time to make this change. /larry/ From amauryfa at gmail.com Mon May 4 19:17:15 2009 From: amauryfa at gmail.com (Amaury Forgeot d'Arc) Date: Mon, 4 May 2009 19:17:15 +0200 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <49FF20FC.2060202@hastings.org> References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905040513t42f167f9pf44d4a28d355df47@mail.gmail.com> <49FF20FC.2060202@hastings.org> Message-ID: Hi, Larry Hastings wrote: > > Mark Dickinson wrote: >> >> Still, binary compatibility seems like a fairly strong reason not to >> remove the closure field. > > My understanding is that there a) 2.x extension modules are not binary > compatible with 3.x, and b) there are essentially no 3.x extension modules > in the field. ?Is that accurate? ?If we don't have an installed base (yet) > to worry about, now's the time to make this change. cx_Oracle at least uses this closure field, and has already been ported to 3.x: http://www.google.com/codesearch?q=Connection_SetOCIAttr+trunk -- Amaury Forgeot d'Arc From larry at hastings.org Mon May 4 21:04:55 2009 From: larry at hastings.org (Larry Hastings) Date: Mon, 04 May 2009 12:04:55 -0700 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905040513t42f167f9pf44d4a28d355df47@mail.gmail.com> <49FF20FC.2060202@hastings.org> Message-ID: <49FF3C57.6030106@hastings.org> Amaury Forgeot d'Arc wrote: > Larry Hastings wrote: > >> My understanding is that there a) 2.x extension modules are not binary >> compatible with 3.x, and b) there are essentially no 3.x extension modules >> in the field. Is that accurate? If we don't have an installed base (yet) >> to worry about, now's the time to make this change. >> > cx_Oracle at least uses this closure field, and has already been ported to 3.x: > http://www.google.com/codesearch?q=Connection_SetOCIAttr+trunk And they're using it as a pointer, too! Nice to see it not abused for once. If it helps, I volunteer to port cx_Oracle to the new PyGetSetDef if my patch is accepted. The resulting code would be backwards-compatible with Python 3.0, so it could be incorporated immediately. Given the lack of interest in the proposal so far, this is an easy vow to make! /larry/ From daniel at stutzbachenterprises.com Mon May 4 21:11:06 2009 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Mon, 4 May 2009 14:11:06 -0500 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <49FEB11B.2040304@hastings.org> References: <49FEB11B.2040304@hastings.org> Message-ID: On Mon, May 4, 2009 at 4:10 AM, Larry Hastings wrote: > So: you don't need it, it clutters up our code (particularly typeobject.c), > and it adds overhead. The only good reason to keep it is backwards > compatibility, which I admit is a fine reason. > If you make the change, will 3rd party code that relies on it fail in unexpected ways, or will they just get a compile error? -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Mon May 4 21:52:02 2009 From: p.f.moore at gmail.com (Paul Moore) Date: Mon, 4 May 2009 20:52:02 +0100 Subject: [Python-Dev] RFC: Threading-Aware Profiler for Python In-Reply-To: <38623.1241453966@parc.com> References: <200905041728.55350.mail@apexo.de> <38623.1241453966@parc.com> Message-ID: <79990c6b0905041252p42650f89s90a1fe1da284b556@mail.gmail.com> 2009/5/4 Bill Janssen : > Hi, Christian. > > Christian Schubert wrote: > >> I've created an alternative profiler module which queries per-thread >> CPU usage via netlink/taskstats, which limits the applicability to >> Linux (which shouldn't be much of an issue, profiling is usually not >> done by end users). > > A surprisingly large # of developers are running on OS X these days, > though. ?I suggest make it work there, too. And Windows. I doubt that the various Windows-specific modules available were developed on Linux. And I wouldn't assume that all of the platform-neutral modules are developed on Linux, or even that the developers have access to Linux. (I know I don't, short of building a brand new virtual machine...) Paul. From dickinsm at gmail.com Mon May 4 22:00:23 2009 From: dickinsm at gmail.com (Mark Dickinson) Date: Mon, 4 May 2009 21:00:23 +0100 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: References: <49FEB11B.2040304@hastings.org> Message-ID: <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> On Mon, May 4, 2009 at 8:11 PM, Daniel Stutzbach wrote: > If you make the change, will 3rd party code that relies on it fail in > unexpected ways, or will they just get a compile error? I *think* that third party code that's recompiled for 3.1 and that doesn't use the closure field will either just work, or will produce an easily-fixed compile error. Larry, does this sound right? But I guess the bigger issue is that extensions already compiled against 3.0 that use PyGetSetDef (even if they don't make use of the closure field) won't work with 3.1 without a recompile: they'll segfault, or otherwise behave unpredictably. If that's not considered a problem, then surely we ought to be getting rid of tp_reserved? Mark From daniel at stutzbachenterprises.com Mon May 4 22:07:50 2009 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Mon, 4 May 2009 15:07:50 -0500 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> Message-ID: On Mon, May 4, 2009 at 3:00 PM, Mark Dickinson wrote: > But I guess the bigger issue is that extensions already compiled against > 3.0 > that use PyGetSetDef (even if they don't make use of the closure field) > won't work with 3.1 without a recompile: they'll segfault, or otherwise > behave > unpredictably. > I was under the impression that binary compatibility was only guaranteed within a minor revision (e.g., 2.6.1 must run code compiled for 2.6.0, but 2.7.0 doesn't have to). I've been wrong before, though. Certainly the C extension module I maintain is sprinkled with #ifdef's so it will compile under 2.5, 2.6, and 3.0. ;-) -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Mon May 4 22:15:21 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Mon, 4 May 2009 20:15:21 +0000 (UTC) Subject: [Python-Dev] =?utf-8?q?Proposed=3A_drop_unnecessary_=22context=22?= =?utf-8?q?_pointer_from=09PyGetSetDef?= References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> Message-ID: Mark Dickinson gmail.com> writes: > > I *think* that third party code that's recompiled for 3.1 and that > doesn't use the closure field will either just work, or will produce an > easily-fixed compile error. Larry, does this sound right? This doesn't sound right. The functions in the third party code will get compiled with the wrong signature, so they can crash (or behave unexpectedly) when called by Python. From dickinsm at gmail.com Mon May 4 22:18:20 2009 From: dickinsm at gmail.com (Mark Dickinson) Date: Mon, 4 May 2009 21:18:20 +0100 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> Message-ID: <5c6f2a5d0905041318x504b83f5re90cafe5db099c89@mail.gmail.com> On Mon, May 4, 2009 at 9:15 PM, Antoine Pitrou wrote: > Mark Dickinson gmail.com> writes: >> >> I *think* that third party code that's recompiled for 3.1 and that >> doesn't use the closure field will either just work, or will produce an >> easily-fixed compile error. ?Larry, does this sound right? > > This doesn't sound right. The functions in the third party code will get > compiled with the wrong signature, so they can crash (or behave unexpectedly) > when called by Python. Yes, of course the signature of the getters and setters changes. Please ignore me. :-) Mark From larry at hastings.org Mon May 4 22:29:19 2009 From: larry at hastings.org (Larry Hastings) Date: Mon, 04 May 2009 13:29:19 -0700 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> Message-ID: <49FF501F.9040503@hastings.org> Mark Dickinson wrote: > I *think* that third party code that's recompiled for 3.1 and that > doesn't use the closure field will either just work, or will produce an > easily-fixed compile error. Larry, does this sound right? > Yep. > But I guess the bigger issue is that extensions already compiled against 3.0 > that use PyGetSetDef (even if they don't make use of the closure field) > won't work with 3.1 without a recompile: they'll segfault, or otherwise behave > unpredictably. > Well, I think they'd work if they didn't use the closure and they had only one entry in their array of PyGetSetDefs. But more than one, and yes it would behave unpredictably. Probably segfault. > If that's not considered a problem, then surely we ought to be getting rid of > tp_reserved? In principle they are equivalent, but in practice removing tp_reserved is a much bigger change. Removing the closure field would result in obvious compile errors, and plenty of folks wouldn't even experience those. Removing tp_reserved would affect everybody, with inscrutable compiler errors. Personally I'd be up for removing tp_reserved. But I lack the caution regarding backwards compatibility that has served Python so well, so you're ill-advised to listen to me. Daniel Stutzbach wrote: > I was under the impression that binary compatibility was only > guaranteed within a minor revision (e.g., 2.6.1 must run code compiled > for 2.6.0, but 2.7.0 doesn't have to). I've been wrong before, though. My understanding is that that's the explicit guarantee. However Python has been well-served by being much more cautious than that, a policy with which I cannot find fault. > Certainly the C extension module I maintain is sprinkled with #ifdef's > so it will compile under 2.5, 2.6, and 3.0. ;-) Happily this is one change where you could maintain backwards compatibility without #ifdefs. If you use the closure field, change your code to use stub functions and pass the closure data in yourself. /larry/ From greg at krypto.org Tue May 5 00:42:15 2009 From: greg at krypto.org (Gregory P. Smith) Date: Mon, 4 May 2009 15:42:15 -0700 Subject: [Python-Dev] turtle.py update for 3.1 In-Reply-To: <49FEFCD6.1040001@aon.at> References: <49FEFCD6.1040001@aon.at> Message-ID: <52dc1c820905041542k365221d8t41d324ee5a169724@mail.gmail.com> On Mon, May 4, 2009 at 7:33 AM, Gregor Lingl wrote: > Hi, > > Encouraged by a conversation with Martin at PyCon 2009 > I've prepared a version 1.1b of the turtle module and I'd like to > get some advice or assistance to get it into the beta as explained > below. Thus I'd appreciate very much if also the release manager > would take notice of this posting. > > python 2.0 had the version 1.0 and for now I'll give a terse > summary of the changes I did: > > 1. a few bugfixes, with 1 - 5 lines of code changed for each; > these concern bugs that prevented turtle to run correctly > > 2. I've added four methods to the class TurtleScreeenBase: > _onkeypress(fun, key) (supplementing _onkeyrelease) > mainloop() (which is now a Screen-method and a function) > textinput(title, prompt) > numinput(title, prompt, default, minval, maxval) > the latter two remedy the complete lack of input methods > > _onkey, an internal method name is changed to _onkeyrelease > > 3. I've added one method to the class TurtleScreen: > onkeypress(fun, key=None) implemented in analogy to the already > present onkey() > which got onkeyrelease as an alias. > > 4. I've changed several portions of the code that affect > the representation of the turtleshape thus making it > more compact (by removing some duplicated code) and more > powerful, i. e. by adding the possibility to apply > shearings to turtleshapes (in addition to the already present > scaling and rotating transformations). Thus now the full > range of (non singular) linear transformations is available. > > New methods in class RawTurtle: > shearfactor(shear=None) set or get the shearfactor > shapetransform(t11, t12, t21, t22) > set or get the shape transform directly > get_shapepoly() return the polygon of the current shape > > I've enhanced the functionality of tiltangle(angle=None) > to contain also that of settiltangle and I propose to > declare settiltangle as deprecated. > 5. I've removed a lot of codelines that were commented out > during the process of transferring the module from 2.6 > to 3.0 > > 6. I've implemented the bugfix for http://bugs.python.org/issue4117 > according do my proposition there and I strongly > recommend this change again, as the bug described is very > annoying, the fix is easy and no one proposed a better > solution. > > 7. I've tested the present version 1.1 extensivly. It runs > all the demo scripts without problems and many others > too (some of them significantly better than version 1.1). > I'd like to add two additional scripts to the demo > directory, one of them using new features so it only runs > with this new version. > > I've *not* touched the issue of the Screen singleton, so that > remains unchanged as it was as a result of Martins patch. > > Thus, as a summary, this update does some bugfixes and eliminates > three deficiencies of the module: (1) accept keypress event, > (2) provide user input functions and (3) complement scaling > and rotating of turtleshapes by shearing, thus providing > the full range of linear transforms. > > HOW TO PROCEED NOW? > > (1) Submit the new version as a single file > (2) submit a unified diff containing all the changes > (3) Divide the changes into several chunks of > related changes and submit the according diffs separately > That would pose the problems, that there are lines > in the code that are affected by several changes, > e. g. those lines that define __all__ > And also: does the order of applying the patches matter? > How do I have to account for this? > (4) Some other approach? I'm happy with option #1. If you find it reasonable to break things into mutliple changes, feel free to do it, but at this point the turtle module hasn't had a much love in ages so a large update in one commit is not a problem IMHO. > > > I'd appreciate to discuss open issues as needed and I'm > prepared to give more elaborate explanations and rationales > as wanted or as needed. > > Docs for the changes are (to a large extent) contained in the > docstrings and I'm going to update the Documentation of the > turtle module (on the basis of theses docstrings) now. > > Thanks in advance for your support > > Gregor > > > > > > > > > > > > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/greg%40krypto.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.ewing at canterbury.ac.nz Tue May 5 02:27:36 2009 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Tue, 05 May 2009 12:27:36 +1200 Subject: [Python-Dev] Building types programmatically (was: drop unnecessary "context" pointer from PyGetSetDef) In-Reply-To: <49FF501F.9040503@hastings.org> References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> <49FF501F.9040503@hastings.org> Message-ID: <49FF87F8.7060201@canterbury.ac.nz> Larry Hastings wrote: > > Removing tp_reserved would affect everybody, with inscrutable > compiler errors. This would have to be considered in conjunction with the proposed programmatic type-building API, I think. I'd like to see a migration towards something like that, BTW. Recently I had occasion to do some work on a Ruby extension module, and I was struck by how much more pleasant it was to be able to create a class and add a few functions to it using calls, rather than having to wrestle with a huge static struct declaration. While I like the Python language better than Ruby, I think Ruby's extension API is ahead in this particular area. -- Greg From zookog at gmail.com Tue May 5 05:36:50 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Mon, 4 May 2009 21:36:50 -0600 Subject: [Python-Dev] PEP 383 and Tahoe [was: GUI libraries] In-Reply-To: <877i0yilah.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49F965DB.6050601@v.loewis.de>

<49FB2596.1090706@v.loewis.de> <51167066-A162-4AAF-B40D-52C1918032D8@fuhm.net> <877i0yilah.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: Thank you for sharing your extensive knowledge of these issues, SJT. On Sun, May 3, 2009 at 3:32 AM, Stephen J. Turnbull wrote: > Zooko O'Whielacronx writes: > > > However, it is moot because Tahoe is not a new system. It is > > currently at v1.4.1, has a strong policy of backwards- > > compatibility, and already has lots of data, lots of users, and > > programmers building on top of it. > > Cool! Thanks! Actually yes it is extremely cool that it really does this encryption, erasure-encoding, capability-based access control, and decentralized topology all in a fully functional, stable system. If you're interested in such stuff then you should definitely check it out! > Question: is there a way to negotiate versions, or better yet, > features? For the peer-to-peer protocol there is, but the persistent storage is an inherently one-way communication. A Tahoe client writes down information, and at a later point a Tahoe client, possibly of a different version, reads it. There is no way for the original writer to ask what versions or features the readers may eventually have. But, the writer can write down optional information which will be invisible to readers that don't know to look for it, but adding it into the "metadata" dictionary. For example: http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/?t=json renders the directory contents into json and results in this: "r\u00e9sum\u00e9.html": [ "filenode", { "mutable": false, "verify_uri": "URI:CHK-Verifier:63y4b5bziddi73jc6cmyngyqdq:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "metadata": { "ctime": 1241365319.0695441, "mtime": 1241365319.0695441 }, "ro_uri": "URI:CHK:no2l46woyeri6xmhcrhhomgr5a:5p7cxw7ofacblmctmjtgmhi6jq7g5wf77tx6befn2rjsfpedzkia:3:10:8328", "size": 8328 } ], A new version of Tahoe writing entries like this is constrained to making the primary key (the filename) be a valid unicode string (if it wants older Tahoe clients to be able to read the directory at all). However, it is not constrained about what new keys it may add to the "metadata" dict, which is where we propose to add the "failed_decode" flag and the "original_bytes". > Well, it's a high-dimensional problem. Keeping track of all the > variables is hard. Well put. > That's why something like PEP 383 can be important > to you even though it's only a partial solution; it eliminates one > variable. Would that it were so! The possibility that PEP 383 could help me or other like me is why I am trying so hard to explain what kind of help I need. :-) > > Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux > > system and then you inspect the files in the Tahoe filesystem, > > such as by examining the web interface [1] or by running > > "tahoe ls", either of which you could do either from the same > > machine where you ran "tahoe cp" or from a different machine > > (which could be using any operating system). We have the > > following requirements about what ends up in your Tahoe directory > > after that cp -r. > > Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the > directory listing? A copy to whatever system I'm on? The bytes that > the Tahoe host has just loaded into a network card buffer to tell me > about it? The bytes on disk at the Tahoe host? You'll find it a lot > easier to explain things if you adopt a precise, consistent > terminology. Okay here's some more detail. There exists a Tahoe directory, the bytes of which are encrypted, erasure-coded, and spread out over multiple Tahoe servers. (To the servers it is utterly opaque, since it is encrypted with a symmetric encryption key that they don't have.) A Tahoe client has the decryption key and it recovers the cleartext bytes. (Note: the internal storage format is not the json encoding shown above -- it is a custom format -- the json format above is what is produced to be exported through the API, and it serves as a useful example for e-mail discussions.) Then for each bytestring childname in the directory it decodes it with utf-8 to get the unicode childname. Does that all make sense? > > Requirement 1 (unicode): Each filename that you see needs to be valid > > unicode > > What does "see" mean? In directory listings? Yes, either with "tahoe ls", with a FUSE plugin, wht the web UI. Remove the trailing "?t=json" from the URL above to see an example. > Under what > circumstances, if any, can what I see be different from what I get? This a good question! In the previous iteration of the Tahoe design, you could sometimes get something from "tahoe cp" which is different from what you saw with "tahoe ls". In the current design -- http://allmydata.org/trac/tahoe/ticket/534#comment:66 , this is no longer the case, because we abandon the requirement to have "round-trip fidelity of bytes". > > Requirement 2 (faithful if unicode): For each filename (byte > > string) in your myfiles directory, > > My local myfiles directory, or my Tahoe myfiles directory? The local one. > > if that bytestring is the valid encoding of some string in your > > stated locale, > > Who stated the locale? How? Are you referring to what > getfilesystemencoding returns? This is a "(unicode) string", right? Yes, and yes. > > Requirement 3 (no file left behind): For each filename (byte > > string) in your myfiles directory, whether or not that byte > > string is the valid encoding of anything in your stated locale, > > then that file will be added into the Tahoe filesystem under > > *some* name (a good candidate would be mojibake, e.g. decode the > > bytes with latin-1, but that is not the only possibility). > > That's not even a possibility, actually. Technically, Latin-1 has a > "hole" from U+0080 to U+009F. You need to add the C1 controls to fill > in that gap. (I don't think it actually matters in practice, > everybody seems to implement ISO-8859/1 as though it contained the > control characters ... except when detecting encodings ... but it pays > to be precise in these things ....) Perhaps windows-1252 would be a better codec for this purpose? However it would be clearer for the purposes of this discussion, and also perhaps for actual users of Tahoe, if instead of decoding with windows-1252 in order to get a mojibake name, Tahoe would simply generate a name like "badly_encoded_filename_#1". Let's run with that. For clarity, assume that the arbitrary unicode filename that Tahoe comes up with is "badly_encoded_filename_#1". This doesn't change anything in this story. In particular it doesn't change the fact that there might already be an entry in the directory which is named "badly_encoded_filename_#1" even though it was *not* a badly encoded filename, but a correctly encoded one. > > Now already we can say that these three requirements mean that > > there can be collisions -- for example a directory could have two > > entries, one of which is not a valid encoding in the locale, and > > whatever unicode string we invent to name it with in order to > > satisfy requirements 3 (no file left behind) and 1 (unicode) > > might happen to be the same as the (correctly-encoded) name of > > the other file. > > This is false with rather high probability, but you need some extra > structure to deal with it. First, claim the Unicode private planes > for Tahoe. [snip on long and intriguin instructions to perform unicode magic that I don't understand] Wait, wait. What good would this do? The current plan is that if the filenames collide we increment the number at the end "#$NUMBER", if we are just naming them "badly_encoded_filename_#1", or that we append "~1" if we are naming them by mojibake. And the current plan is that the original bytes are saved in the metadata for future cyborg archaeologists. How would this complex unicode magic that I don't understand improve the current plan? Would it provide filenames that are more meaningful or useful to the users than the "badly_encoded_filename_#1" or the mojibake? > The registry of characters is somewhat unpleasant, but it does allow > you to detect filenames that are the same reliably. There is no server, so to implement such a registry we would probably have to include a copy of the registry inside each (encrypted, erasure-encoded) directory. > > Possible Requirement 4 (faithful bytes if not unicode, a.k.a. > > "round-tripping"): > > PEP 383 gives you this, but you must store the encoding used for each > such file name. Well, at this point this has become an anti-requirement because it causes the filename as displayed when examining the directory to be different from the filename that results when cp'ing the directory. Also I don't see why PEP 383's implementation of this would be better than the previous iteration of the design in which this was accomplished by simply storing the original bytes and then writing them back out again on demand, or the design before that in which this was accomplished by mojibake'ing the bytes (by decoding them with windows-1252) and setting a flag indicating that this has been done. I think I understand now that PEP 383 is better for the case that you can't store extra metadata (such as our failed_decode flag or our original_bytes), but you can ensure that the encoding that will be used later matches the one that was used for decoding now. Neither of these two criteria apply to Tahoe, and I suspect that neither of them apply to most uses other than the entirely local and non-persistent "for x in os.listdir(): open(x)". > > But an even worse problem -- the show-stopper for me -- is that I > > don't want what Tahoe shows when you do "tahoe ls" or view it in a > > web browser to differ from what it writes out when you do > > "tahoe cp -r tahoe: newfiles/". > > But as a requirement, that's incoherent. What you are "seeing" is > Unicode, what it will write out is bytes. In the new plan, we write the unicode filename out using Python's unicode filesystem APIs, so Python will attempt to encode it into the appropriate filesystem encoding (raising UnicodeEncodeError if it won't fit). > That means that if multiple > locales are in use on both the backup and restore systems, and the > nominal system encodings are different, people whose personal default > locales are not the same as the system's will see what they expect on > the backup system (using system ls), mojibake on Tahoe (using tahoe > ls), and *different* mojibake on the restore system (system ls, > again). Let's see... Tahoe is a user-space program and lets Python determine what the appropriate "sys.getfilesystemencoding()" is based on what the user's locale was at Python startup. So I don't think what you wrote above is correct. I think that in the first transition, from source system to Tahoe, that either the name will be correctly transcoded (i.e., it looks the same to the user as long as the locale they are using to "look" at it, e.g. with "ls" or Nautilus or whatever is the same as the locale that was set when their Python process started up), or else it will be undecodable under their current locale and instead will be replaced with either mojibake or "badly_encoded_filename_#1". Hm, here is a good argument in favor of using mojibake to generate the arbitrary unicode name instead of naming it "badly_encoded_filename_#1": because that's probably what ls and Nautilus will show! Let me try that... Oh, cool, Nautilus and GNU ls both replace invalid chars with U+FFFD (like the 'replace' error handler does in Python's decode()) and append " (invalid encoding)" to the end. That sounds like an even better way to handle it than either mojibake or "badly_encoded_filename_#1", and it also means that it will look the same in Tahoe as it does in GNU ls and Nautilus. Excellent. On the next transition, from Tahoe to system, Tahoe uses the Python unicode API, which will attempt to encode the unicode filename into the local filesystem encoding and raise UnicodeEncodeError if it can't. > > Requirement 5 (no loss of information): I don't want Tahoe to > > destroy information -- every transformation should be (in > > principle) reversible by some future computer-augmented > > archaeologist. ... > UTF-8b would be just as good for storing the original bytestring, as > long as you keep the original encoding. It's actually probably > preferable if PEP 383 can be assumed to be implemented in the > versions of Python you use. It isn't -- Tahoe doesn't run on Python 3. Also Tahoe is increasingly interoperating with tools written in completely different languages. It is much easier for to tell all of those programmers (in my documentation) that in the filename slot is the (normal, valid, standard) unicode, and in the metadata slot there are the bytes than to tell them about utf-8b (which is not even implemented in their tools: JavaScript, JSON, C#, C, and Ruby). I imagine that it would be a deal-killer for many or most of them if I said they couldn't use Tahoe reliably without first implementing utf-8b for their toolsets. > > 1. On Windows or Mac read the filename with the unicode APIs. > > Normalize the string with filename = unicodedata.normalize('NFC', ... > NFD is probably better for fuzzy matching and display on legacy > terminals. I don't know anything about them, other than that Macintosh uses NFD and everything else uses NFC. Should I specify NFD? What are these "legacy terminals" of which you speak? Will NFD make it look better when I cat it to my vt102? (Just kidding -- I don't have one.) > Per the koi8-lucky example, you don't know if it succeeded for the > right reason or the wrong reason. You really should store the > alleged_encoding used in the metadata, always. Right -- got it. > > 2.b. If this decoding fails, then we decode it again with > > bytes.decode('latin-1', 'strict'). Do not normalize it. Store the > > resulting unicode object into the "filename" part, set the > > "failed_decode" flag to True. This is mojibake! > > Not necessarily. Most ISO-8859/X names will fail to decode if the > alleged_encoding is UTF-8, for example, but many (even for X != 1) > will be correctly readable because of the policy of trying to share > code points across Latin-X encodings. Certainly ISO-8859/1 (and > much ISO-8859/15) will be correct. Ah. What is the Japanese word for "word with some characters right and other characters mojibake!"? :-) > > Now a question for python-dev people: could utf-8b or PEP 383 be > > useful for requirements like the four requirements listed above? If > > not, what requirements does PEP 383 help with? > > By giving you a standard, invertible way to represent anything that > the OS can throw at you, it helps with all of them. So, it is invertible only if you can assume that the same encoding will be used on the second leg of the trip, right? Which you can do by writing down what encoding was used on this leg of the trip and forcing it to use the same encoding on the other leg. Except that we can't force that to happen on Windows at all as far as I understand, which is a show-stopper right there. But even if we could, this would require us to write down a bit of information and transmit it to the other side and use it to do the encoding. And if we are going to do that, why don't we just transmit the original bytes? Okay, maybe because that would roughly double the amount of data we have to transmit, and maybe we are stingy. But if we are stingy we could instead transmit a single added bit to indicate whether the name is normal or mojibake, and then use windows-1252 to stuff the bytes into the name. One of those options has the advantage of simplicity to the programmer ("There is the unicode, and there are the bytes."), and the other has the advantage of good compression. Both of them have the advantage that nobody involved has to understand and possibly implement a non-standard unicode hack. I'm trying not to be too pushy about this (heaven knows I've been completely wrong about things a dozen times in a row so far in this design process), but as far as I can understand it, PEP 383 can be used only when you can force the same encoding on both sides (the PEP says that encoding "only 'works' if the data get converted back to bytes with the python-escape error handler also"). That happens naturally when both sides are in the same Python process, so PEP 383 naturally looks good in that context. However, if the filenames are going to be stored persistently or transmitted over a network, then it seems simpler, easier, and more portable to use some other method than PEP 383 to handle badly encoded names. > > I'm not sure that it can help if you are going to store the results > > of your os.listdir() persistently or if you are going to transmit > > them over a network. Indeed, using the results that way could lead > > to unpleasant surprises. > > No more than any other system for giving a canonical Unicode spelling > to the results of an OS call. I think PEP 383 yields more surprises than the alternative of decoding with error handler 'replace' and then including the original bytes along with the unicode. During the course of this process I have also considered using two other mechanisms instead of decoding with error handler 'replace' -- mojibake using windows-1252 or a simple placeholder like "badly_encoded_filename_#1". Any of these three seem to be less surprising and similarly functional to PEP 383. I have to admit that they are not as elegant. Utf-8b is a really neat hack, and MvL's generalization of it to all unicode encodings is, too. I'm still being surprised by it after trying to understand it for many days now. For example, what happens if you decode a filename with PEP 383, store that filename somewhere, and then later try to write a file under that name on Windows? If it only 'works' if the data get converted back to bytes with the python-escape error handler, then can you use the python-escape error handler when trying to, say, create a new file on Windows? Regards, Zooko From jmillikin at gmail.com Tue May 5 07:19:36 2009 From: jmillikin at gmail.com (John Millikin) Date: Mon, 4 May 2009 22:19:36 -0700 Subject: [Python-Dev] Undocumented change / bug in Python3's PyMapping_Check Message-ID: <3283f7fe0905042219r23113ca6ud6dd3840d7462f37@mail.gmail.com> In Python 2, PyMapping_Check will return 0 for list objects. In Python 3, it returns 1. Obviously, this makes it rather difficult to differentiate between mappings and other sized iterables. In addition, it differs from the behavior of the ``collections.Mapping`` ABC -- isinstance([], collections.Mapping) returns False. I believe the new behavior is erroneous, but would like to confirm that before filing a bug. The behavior can be seen from a C extension, or if you're lazy, using ctypes: Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import ctypes >>> ctypes.CDLL('libpython2.6.so').PyMapping_Check(ctypes.py_object([])) 0 Python 3.0.1+ (r301:69556, Apr 15 2009, 15:59:22) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import ctypes >>> ctypes.CDLL('libpython3.0.so').PyMapping_Check(ctypes.py_object([])) 1 From larry at hastings.org Tue May 5 09:24:38 2009 From: larry at hastings.org (Larry Hastings) Date: Tue, 05 May 2009 00:24:38 -0700 Subject: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef In-Reply-To: <5c6f2a5d0905041318x504b83f5re90cafe5db099c89@mail.gmail.com> References: <49FEB11B.2040304@hastings.org> <5c6f2a5d0905041300qe500a21vc90b72382883236a@mail.gmail.com> <5c6f2a5d0905041318x504b83f5re90cafe5db099c89@mail.gmail.com> Message-ID: <49FFE9B6.4040609@hastings.org> Mark Dickinson wrote: >> This doesn't sound right. The functions in the third party code will get >> compiled with the wrong signature, so they can crash (or behave unexpectedly) >> when called by Python. >> > Yes, of course the signature of the getters and setters changes. Please > ignore me. :-) If they don't use the closure field, then either they won't compile due to type mismatches or they'll work fine. There's a lot of code in CPython that didn't need to be changed for my remove-closure patch; the functions didn't bother taking the "void * closure" that they were going to ignore anyway, and then they cast the function pointer in the PyGetSetDef to make the compiler shut up. Worked fine. And, in nearly all cases, the static PyGetSetDefs omit the closure member, which means C initializes them with a 0. /larry/ From mal at egenix.com Tue May 5 10:40:51 2009 From: mal at egenix.com (M.-A. Lemburg) Date: Tue, 05 May 2009 10:40:51 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FDD6DD.6050808@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> Message-ID: <49FFFB93.7020105@egenix.com> On 2009-05-03 19:39, Martin v. L?wis wrote: >> If the error handler is supposed to be used for codecs other than utf-8, >> perhaps it should renamed something more generic, e.g. "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it's an algorithm based on 16-bit or 32-bit code points. If the error handler doesn't have anything to do with UTF-8, then why do you use "utf8" in the name. Please use a more descriptive name for the handler which does not cause confusion with a existing codec. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 05 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2009-06-29: EuroPython 2009, Birmingham, UK 54 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From tjreedy at udel.edu Tue May 5 10:57:03 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 05 May 2009 04:57:03 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FFFB93.7020105@egenix.com> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> <49FFFB93.7020105@egenix.com> Message-ID: M.-A. Lemburg wrote: > On 2009-05-03 19:39, Martin v. L?wis wrote: >>> If the error handler is supposed to be used for codecs other than utf-8, >>> perhaps it should renamed something more generic, e.g. "surrogate-escape"? >> Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - >> it's an algorithm based on 16-bit or 32-bit code points. > > If the error handler doesn't have anything to do with UTF-8, then why > do you use "utf8" in the name. > > Please use a more descriptive name for the handler which does not cause > confusion with a existing codec. Having already been confused, I agree. From eric at trueblade.com Tue May 5 11:13:58 2009 From: eric at trueblade.com (Eric Smith) Date: Tue, 05 May 2009 05:13:58 -0400 Subject: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath In-Reply-To: <49FA4064.5000508@gmail.com> References: <49F8B222.7070204@hastings.org> <49F8D9A0.7000104@voidspace.org.uk> <49F8DBCD.6050504@trueblade.com> <49F9FCD0.80208@hastings.org> <49FA4064.5000508@gmail.com> Message-ID: <4A000356.30408@trueblade.com> Mark Hammond wrote: >> Is that enough consensus for it to go in? If so, are there any core >> developers who could help me get it in before the 3.1 feature freeze? >> The patch should be in good shape; it has unit tests and updated >> documentation. > > I've taken the liberty of explicitly CCing Martin just incase he missed > the thread with all the noise regarding PEP383. > > If there are no objections from Martin or anyone else here, please feel > free to assign it to me (and mail if I haven't taken action by the day > before the beta freeze...) Mark: I've reviewed this and it looks okay to me. It passes all the tests on Windows and Linux. But if you could take a look at it before the release tomorrow, I'd appreciate it. I feel good enough about it to check it in if no one else gets to it. Eric. From supreet.sethi at gmail.com Tue May 5 12:41:22 2009 From: supreet.sethi at gmail.com (s|s) Date: Tue, 5 May 2009 16:11:22 +0530 Subject: [Python-Dev] using help function in Py3k Message-ID: Hello, I Ran Python 3.0 for the first time. I used help() function and wrote "modules hash". It issues an error. Traceback (most recent call last): File "", line 1, in File "/home/ss/eproj/xapian/INST//lib/python3.0/site.py", line 427, in __call__ return pydoc.help(*args, **kwds) File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1675, in __call__ self.interact() File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1693, in interact self.help(request) File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1711, in help self.listmodules(request.split()[1]) File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1799, in listmodules apropos(key) File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1913, in apropos ModuleScanner().run(callback, key, onerror=onerror) File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1875, in run source = loader.get_source(modname) File "/home/ss/eproj/xapian/INST/lib/python3.0/pkgutil.py", line 293, in get_source self.source = self.file.read() File "/home/ss/eproj/xapian/INST//lib/python3.0/io.py", line 1720, in read decoder = self._decoder or self._get_decoder() File "/home/ss/eproj/xapian/INST//lib/python3.0/io.py", line 1506, in _get_decoder make_decoder = codecs.getincrementaldecoder(self._encoding) File "/home/ss/eproj/xapian/INST//lib/python3.0/codecs.py", line 960, in getincrementaldecoder decoder = lookup(encoding).incrementaldecoder LookupError: unknown encoding: uft-8 The reason for errors is test/ directory which has got tests for python parser are installed in Lib directory. I propose that these files should be installed by default in some other directory. Preferably in /share or /share/doc part of the tree. regards -- ~preet~ From aahz at pythoncraft.com Tue May 5 13:47:18 2009 From: aahz at pythoncraft.com (Aahz) Date: Tue, 5 May 2009 04:47:18 -0700 Subject: [Python-Dev] using help function in Py3k In-Reply-To: References: Message-ID: <20090505114718.GA16437@panix.com> On Tue, May 05, 2009, s|s wrote: > > I Ran Python 3.0 for the first time. I used help() function and wrote > "modules hash". It issues an error. Please file a report on bugs.python.org -- Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/ "It is easier to optimize correct code than to correct optimized code." --Bill Harlan From stephen at xemacs.org Tue May 5 15:09:25 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 05 May 2009 22:09:25 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FFFB93.7020105@egenix.com> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> <49FFFB93.7020105@egenix.com> Message-ID: <87eiv3hf22.fsf@uwakimon.sk.tsukuba.ac.jp> M.-A. Lemburg writes: > On 2009-05-03 19:39, Martin v. L?wis wrote: > >> If the error handler is supposed to be used for codecs other than utf-8, > >> perhaps it should renamed something more generic, e.g. "surrogate-escape"? > > > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > > it's an algorithm based on 16-bit or 32-bit code points. I don't understand this phrasing. The algorithm is only applicable to ASCII-compatible octet streams. It results in code points by a simple displacement of octet -> octet + 0xDC00. It cannot be used on (say) UTF-32 to deal with embedded surrogates. Certainly, the computation requires (at least) 16 bit numbers, but the input must be restricted to a stream of 8-bit code points, while the output is 16- or 32-bit code points. > Please use a more descriptive name [than "utf-8b"] for the handler > which does not cause confusion with a existing codec. But please don't use "surrogate-escape" or (as in the current PEP) "python-escape"; it's not an escaping (quotation) mechanism. "surrogate-replace", "surrogate-substitute", or "surrogate-translate" would be better names. From daniel at stutzbachenterprises.com Tue May 5 15:43:57 2009 From: daniel at stutzbachenterprises.com (Daniel Stutzbach) Date: Tue, 5 May 2009 08:43:57 -0500 Subject: [Python-Dev] using help function in Py3k In-Reply-To: References: Message-ID: On Tue, May 5, 2009 at 5:41 AM, s|s wrote: > LookupError: unknown encoding: uft-8 > uft-8? Looks like a variation of Issue 4540 (or a duplicate? I can't tell) -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric at trueblade.com Tue May 5 16:08:34 2009 From: eric at trueblade.com (Eric Smith) Date: Tue, 05 May 2009 10:08:34 -0400 Subject: [Python-Dev] [Fwd: [Python-checkins] r72331 - python/branches/py3k/Modules/posixmodule.c] Message-ID: <4A004862.5070605@trueblade.com> Modules/posixmodule.c now compiles for me, but I get a Bus Error in test_lchflags when running test_posixmodule on Mac OS X 10.5. I'll open a release blocker bug on this. -------- Original Message -------- Subject: [Python-checkins] r72331 - python/branches/py3k/Modules/posixmodule.c Date: Tue, 5 May 2009 15:07:31 +0200 (CEST) From: eric.smith To: python-checkins at python.org Author: eric.smith Date: Tue May 5 15:07:30 2009 New Revision: 72331 Log: Added missing semicolon. Modified: python/branches/py3k/Modules/posixmodule.c Modified: python/branches/py3k/Modules/posixmodule.c ============================================================================== --- python/branches/py3k/Modules/posixmodule.c (original) +++ python/branches/py3k/Modules/posixmodule.c Tue May 5 15:07:30 2009 @@ -1928,7 +1928,7 @@ if (!PyArg_ParseTuple(args, "O&i:lchmod", PyUnicode_FSConverter, &opath, &i)) return NULL; - path = bytes2str(opath, 1) + path = bytes2str(opath, 1); Py_BEGIN_ALLOW_THREADS res = lchmod(path, i); Py_END_ALLOW_THREADS _______________________________________________ Python-checkins mailing list Python-checkins at python.org http://mail.python.org/mailman/listinfo/python-checkins From stephen at xemacs.org Tue May 5 16:57:36 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 05 May 2009 23:57:36 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <49FD5300.6010906@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> Message-ID: <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > I've updated the PEP accordingly. I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point. Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b". (Elsewhere I've suggested others, but I think this is the best of the bunch.) Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option. I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they *should* be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms? I have a number of nitpicking comments and technical clarifications on the PEP. Rationale is in footnotes. There were also a few typos I noticed. 1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1] 2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement *must not* be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2] Rather than saying that "dealing with such conflicts is out of scope of this PEP", I would say """Dealing with such conflicts is the responsibility of the application. Since this PEP's mechanism produces valid Unicode where possible, and produces *invalid* code points only via the error handler, one strategy is for the application to validate all other sources of strings as Unicode conforming. There may be other useful application-specific strategies, as well.""" 3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as: """The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes.""" Typos (line references are to pep-0383.txt svn r72332): l. 86: "Byte-orientied" -> "Byte-oriented" l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" l. 130: "provide" -> "provided" l. 134: "calculating" -> "calculate" Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think. [2] Since this error handler is going to be the default for POSIX I/O, of course people are going to mostly ignore that restriction. The point is, passing such strings to systems that don't expect them is a bug, and the PEP should make it clear that it's the app's bug, not the other system's. On the other hand, using those strings in a context of consenting adults (and I do mean double-opt-in here) is perfectly acceptable. I'm specifically thinking of use in the Tahoe protocol discussed by Zooko O'Whielacronx; it may not be usable there for backward compatibility reasons, but "Unicode conformance" is not an issue in principle. This does imply that programs that take advantage of the error handler specified in this PEP are on their own if they accept data from any sources that are not known to be Unicode-conforming. OTOH, as far as I can see if other sources are known to be Unicode conformant, it's reasonably (but not perfectly) safe to combine them with strings from this PEP (and of course use either 'utf8b' or 'strict', as appropriate, when passing data out of Python). From zookog at gmail.com Tue May 5 17:18:29 2009 From: zookog at gmail.com (Zooko O'Whielacronx) Date: Tue, 5 May 2009 09:18:29 -0600 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull wrote: > > 2. ?The specification should state, and the discussion emphasize, that > ? ?strings which were produced by surrogate replacement *must not* be > ? ?used in data interchange with systems that do not specifically > ? ?accept such strings, and that this is the responsibility of the > ? ?application.[2] That sounds like a useful statement to make. How would an application make sure that they were producing only valid unicode? How about add an option to os.listdir() named "errors" with default value 'utf8b' (or 'surrogate-replace', or whatever the name is)? Then applications which need to produce only valid unicode strings could pass errors=strict, errors=ignore, or errors=replace? (If anyone really wants behavior like Python 3.0 then we could perhaps also add a new one just for os.listdir() named errors=skipfilename.) My most recent plan for Tahoe, as of the letter that I sent last night, is to emulate the behavior of Nautilus and GNU ls by using the 'replace' error handler and (emulating Nautilus) to append " (invalid encoding)" to the end of the string. (screenshot: http://zooko.com/Nautilus_vs_invalid_encoding.png ) So if I could ask os.listdir to return filenames with U+FFFD in place of undecodable characters, then I could subsequently do something like: for f in os.listdir(d, errors='replace'): if u"\ufffd" in f: f += " (invalid encoding)" (On top of that I would have to check for collisions, but that's out of scope.) Regards, Zooko From google at mrabarnett.plus.com Tue May 5 17:25:46 2009 From: google at mrabarnett.plus.com (MRAB) Date: Tue, 05 May 2009 16:25:46 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A005A7A.7070501@mrabarnett.plus.com> Stephen J. Turnbull wrote: > "Martin v. L?wis" writes: > > > I've updated the PEP accordingly. > > I have three substantive comments. First, although consequences for > Python 3 byte interfaces (ie, "none") are explicitly stated, as far as > I can see this PEP could apply to Python 2 as well. I don't think > it's intended that way. Either way, I think you should clarify that > point. > > Second, I suggest "surrogate-replace" as the name of the error handler > rather than "utf8b". (Elsewhere I've suggested others, but I think > this is the best of the bunch.) > +1 > Third, it is not clear to me why non-decodable ASCII should be an > error. There are plenty of low surrogates for the purpose. Is there > another technical reason? Stupid or not, Shift-JIS- and Big5-encoded > file systems are quite common in Asia still (including non-rewritable > media). I think surrogate-replacement of ASCII should at least be an > option. > > I don't think "people shouldn't be using non-ASCII-compatible > encodings for locale encodings" is a sufficient rationale for a hard > error here. I mean, of course they *should* be using UTF-8. Maybe > Python 3.1 should just go ahead and error on any other encoding on > POSIX platforms? > I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use. > I have a number of nitpicking comments and technical clarifications on > the PEP. Rationale is in footnotes. There were also a few typos I > noticed. > > 1. There is no such thing as a "half-surrogate" in Unicode. "Lone > surrogate" is clear enough. Or for somewhat fancier English, > "isolated surrogate" or "non-syntactic surrogate". To emphasize > that Python codecs will only produce them in contexts where a > Unicode character or high surrogate (for UTF-16 Python) is > syntactically required, "isolated low surrogate" or "isolated > trailing surrogate" might be good.[1] > > 2. The specification should state, and the discussion emphasize, that > strings which were produced by surrogate replacement *must not* be > used in data interchange with systems that do not specifically > accept such strings, and that this is the responsibility of the > application.[2] > > Rather than saying that "dealing with such conflicts is out of > scope of this PEP", I would say > > """Dealing with such conflicts is the responsibility of the > application. Since this PEP's mechanism produces valid Unicode > where possible, and produces *invalid* code points only via the > error handler, one strategy is for the application to validate all > other sources of strings as Unicode conforming. There may be > other useful application-specific strategies, as well.""" > > 3. In the discussion, the transition from the example of alternative > use of 'python-escape' to discussion of the error handler > interface extension is a bit abrupt. I suggest rewriting as: > > """The extension to the encode error handler interface proposed by > this PEP is necessary to implement the 'utf8b' error handler, > because there are required byte sequences which cannot be > generated from replacement Unicode. However, the encode error > handler interface presently requires replacement Unicode to be > provided in lieu of the non-encodable Unicode from the source > string. Then it promptly encodes that replacement Unicode. In > some error handlers, such as the 'utf8b' proposed here, it is also > simpler and more efficient for the error handler to provide a > pre-encoded replacement byte string, rather than forcing it to > calculating Unicode from which the encoder would create the > desired bytes.""" > > Typos (line references are to pep-0383.txt svn r72332): > > l. 86: "Byte-orientied" -> "Byte-oriented" > l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" > l. 130: "provide" -> "provided" > l. 134: "calculating" -> "calculate" > > > Footnotes: > [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least > once, in section 16.6, but the context is such that I take it to > refer to "half of the surrogate area". Section 3.8 doesn't use > these, instead noting that "leading" and "trailing" are sometimes > used instead of "high" and "low". Better to avoid the word "half" > in PEP 383, I think. > "Leading" and "trailing" simply state the order, not the set ("high" or "low"), so are not good terms to use. > [2] Since this error handler is going to be the default for POSIX I/O, > of course people are going to mostly ignore that restriction. The > point is, passing such strings to systems that don't expect them > is a bug, and the PEP should make it clear that it's the app's > bug, not the other system's. On the other hand, using those > strings in a context of consenting adults (and I do mean > double-opt-in here) is perfectly acceptable. I'm specifically > thinking of use in the Tahoe protocol discussed by Zooko > O'Whielacronx; it may not be usable there for backward > compatibility reasons, but "Unicode conformance" is not an issue > in principle. > > This does imply that programs that take advantage of the error > handler specified in this PEP are on their own if they accept data > from any sources that are not known to be Unicode-conforming. > OTOH, as far as I can see if other sources are known to be Unicode > conformant, it's reasonably (but not perfectly) safe to combine > them with strings from this PEP (and of course use either 'utf8b' > or 'strict', as appropriate, when passing data out of Python). > Should there be a function or method to check for conformance and lone surrogates? From stephen at xemacs.org Tue May 5 18:32:03 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 01:32:03 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <8763gfh5oc.fsf@uwakimon.sk.tsukuba.ac.jp> Zooko O'Whielacronx writes: > How would an application make sure that they were producing only > valid unicode? That's very difficult. There are a couple of sources that I can think of, in Python: C modules, chr(), \u literals, and now codecs with the 'utf8b'. There may be others. You'd need to review your own code for all of them very carefully, and you'd have to validate all strings returned by non-validating APIs (which is all of them in Python now, although many of them can probably be trusted, such as codecs not using the 'utf8b' error handler). > How about add an option to os.listdir() named "errors" with default > value 'utf8b' Seems reasonable to me, but Martin's probably thought more carefully about it. I don't think its applicable to your use case, though, because you want to be able to *access* those files as well as display the names to the users, right? You won't be able to access those files if you receive the names already munged by the error handler. From stephen at xemacs.org Tue May 5 19:31:28 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 02:31:28 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A005A7A.7070501@mrabarnett.plus.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> Message-ID: <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> MRAB writes: > > I don't think "people shouldn't be using non-ASCII-compatible > > encodings for locale encodings" is a sufficient rationale for a hard > > error here. I mean, of course they *should* be using UTF-8. Maybe > > Python 3.1 should just go ahead and error on any other encoding on > > POSIX platforms? > > > I don't see why the error handler couldn't in principle be used with > encodings other than UTF-8, although in that case all of the low > surrogates should be open to use. I should have been more clear here, I guess. The error handler *can*, and in the PEP *will be* by default, used with all "sane" locale encodings on POSIX. It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that. What "sane" means in this context is 1. ASCII NUL is the bytearray terminator, and can't be used as a byte in a file name. This rules out UTF-16, UTF-32, and widechar EUC encodings, as well as some very rare ones. 2. An ASCII character always translates to the Unicode character with the same code (ie, "to itself"). It is not a part of other sequences (control sequences, or a trailing byte). This rules out EBCDIC, ISO-2022-*, Shift JIS, and Big5, among the encodings I'm familiar with. EBCDIC because only by accident will an EBCDIC character map to the same ASCII character with the same code. The ISO-2022-* encodings are out because ASCII characters are used in escape sequences. Shift JIS and Big5 because in those encodings, a high-bit-set octet signals the start of a multibyte sequence, and some of the trailing bytes may be in the ASCII range. What's left? Well, UTF-8, all of the ISO-8859 sets, several national standards (such as the KOI8 family for Cyrillic), IBM and Microsoft "code pages", and the "packed" EUC encodings used for Japanese, Chinese, and Korean. These all have the character that ASCII is ASCII, and all non-ASCII characters are encoded using only high-bit-set octets. In fact, in practice, on Unix these are invariably what you encounter. So what's the problem? Backward compatibility for Microsoft OSes, which not only used to use MBCS national character sets, but "cleverly" packed more characters into the encoding by using ASCII as trailing bytes. Ie, the aforementioned "insane" Shift JIS (which is mandated by the leading Japanese cellphone service provider even today) and Big5 (the leading encoding for Chinese until very recently). These are very commonly found on archival media, and even on USB keys and so on which tend to be FAT-formatted. This doesn't prevent usage of the Unicode APIs, but up to Windows 2000 most Japanese vendors' OEM version of Windows used FAT format and Shift JIS as the file system encoding, and I know of Japanese offices where Windows 98 systems were in use as recently as early 2007. It's the removable media which are the problem, because on Windows you just use the Unicode APIs. But they're not available on Unix, so you need the byte-oriented APIs. Is this a real problem? I don't know, I don't do Windows, I don't do computing with my cellphone, and I don't need to get Japanese (that might be mixed with Russian ones!!) filenames off of ancient media or CIFS fileshares using Shift JIS. I guess it's possible that cellphones do everything *except* add filenames to directories in Shift JIS, but the filenames are in UTF-16. OTOH, it seems to me that an *optional* extension to handling error on ASCII is technically feasible and would be nearly trivial to add to the PEP. The biggest cost would be adding the error argument to various functions (as Zooko requested) so that surrogate-replace-extended could be specified if needed. > > Footnotes: > > [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least > > once, in section 16.6, but the context is such that I take it to > > refer to "half of the surrogate area". Section 3.8 doesn't use > > these, instead noting that "leading" and "trailing" are sometimes > > used instead of "high" and "low". Better to avoid the word "half" > > in PEP 383, I think. > > > "Leading" and "trailing" simply state the order, not the set ("high" or > "low"), so are not good terms to use. But it's the order that's important. If you've just finished reading a character, and encounter a trailing surrogate, then it was produced by the 'utf8b' error handler; nothing else in a Python codec can do that. If you've just finished reading a character, are in a UTF-16 Python, and encounter a leading surrogate, then you immediately gobble the following code, which must be a trailing surrogate, and combine them to produce a character. The remaining case is that you encounter a valid character. Anything else is an error, and (assuming no bugs), no Python codec will produce anything else. > > This does imply that programs that take advantage of the error > > handler specified in this PEP are on their own if they accept data > > from any sources that are not known to be Unicode-conforming. > > OTOH, as far as I can see if other sources are known to be Unicode > > conformant, it's reasonably (but not perfectly) safe to combine > > them with strings from this PEP (and of course use either 'utf8b' > > or 'strict', as appropriate, when passing data out of Python). > > > Should there be a function or method to check for conformance and > lone surrogates? string.encode('utf-8',errors=strict) will do for now. From google at mrabarnett.plus.com Tue May 5 19:45:45 2009 From: google at mrabarnett.plus.com (MRAB) Date: Tue, 05 May 2009 18:45:45 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A007B49.5000001@mrabarnett.plus.com> Stephen J. Turnbull wrote: > MRAB writes: > > > > I don't think "people shouldn't be using non-ASCII-compatible > > > encodings for locale encodings" is a sufficient rationale for a hard > > > error here. I mean, of course they *should* be using UTF-8. Maybe > > > Python 3.1 should just go ahead and error on any other encoding on > > > POSIX platforms? > > > > > I don't see why the error handler couldn't in principle be used with > > encodings other than UTF-8, although in that case all of the low > > surrogates should be open to use. > > I should have been more clear here, I guess. The error handler *can*, > and in the PEP *will be* by default, used with all "sane" locale > encodings on POSIX. > > It occurs to me that the PEP maybe should say that it is an error > to have your POSIX locale set to UTF-16 or something like that. > > What "sane" means in this context is > > 1. ASCII NUL is the bytearray terminator, and can't be used as a byte > in a file name. This rules out UTF-16, UTF-32, and widechar EUC > encodings, as well as some very rare ones. > [snip] It might be slightly OT, but sometimes strict UTF-8 encoding is violated by encoding U+0000 using 2 bytes (0xC0 0x80) so that 0x00 can be used as a terminator. I think I read that Microsoft sometimes does this. From stephen at xemacs.org Tue May 5 20:09:54 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 03:09:54 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A007B49.5000001@mrabarnett.plus.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A007B49.5000001@mrabarnett.plus.com> Message-ID: <87y6tbfmkt.fsf@uwakimon.sk.tsukuba.ac.jp> MRAB writes: > [snip] > It might be slightly OT, but sometimes strict UTF-8 encoding is violated > by encoding U+0000 using 2 bytes (0xC0 0x80) so that 0x00 can be used as > a terminator. I think I read that Microsoft sometimes does this. Nice hack! as long as you don't let it escape. But if 'strict' errors on this, then PEP 383 'utf8b' will do the right thing, I think. From l.mastrodomenico at gmail.com Tue May 5 20:16:03 2009 From: l.mastrodomenico at gmail.com (Lino Mastrodomenico) Date: Tue, 5 May 2009 20:16:03 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: 2009/5/5 Stephen J. Turnbull : > Third, it is not clear to me why non-decodable ASCII should be an > error. The PEP originally allowed the conversion to U+DCxx of bytes below 128 that cannot be decoded by the encoding used, but this creates potential security problems. See: -- Lino Mastrodomenico From martin at v.loewis.de Tue May 5 22:46:26 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 May 2009 22:46:26 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87eiv3hf22.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <49FDD6DD.6050808@v.loewis.de> <49FFFB93.7020105@egenix.com> <87eiv3hf22.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A00A5A2.2050400@v.loewis.de> > > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > > > it's an algorithm based on 16-bit or 32-bit code points. > > I don't understand this phrasing. The algorithm is only applicable to > ASCII-compatible octet streams. It results in code points by a simple > displacement of octet -> octet + 0xDC00. It cannot be used on (say) > UTF-32 to deal with embedded surrogates. > > Certainly, the computation requires (at least) 16 bit numbers, but the > input must be restricted to a stream of 8-bit code points, while the > output is 16- or 32-bit code points. Right - the algorithm maps between bytes and 16/32-bit code units. It works, in particular, for UTF-8, and was originally proposed to apply to UTF-8 - but it can work in any other place that converts bytes to 16/32-bit code units as well. Regards, Martin From martin at v.loewis.de Tue May 5 23:01:49 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 May 2009 23:01:49 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A00A93D.3030204@v.loewis.de> > I have three substantive comments. First, although consequences for > Python 3 byte interfaces (ie, "none") are explicitly stated, as far as > I can see this PEP could apply to Python 2 as well. I don't think > it's intended that way. Either way, I think you should clarify that > point. Done: the Python-Version header already clarifies that point. > Second, I suggest "surrogate-replace" as the name of the error handler > rather than "utf8b". I think this is bike-shedding. > Third, it is not clear to me why non-decodable ASCII should be an > error. There are plenty of low surrogates for the purpose. Is there > another technical reason? Stupid or not, Shift-JIS- and Big5-encoded > file systems are quite common in Asia still (including non-rewritable > media). I think surrogate-replacement of ASCII should at least be an > option. It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../ > 1. There is no such thing as a "half-surrogate" in Unicode. "Lone > surrogate" is clear enough. Or for somewhat fancier English, > "isolated surrogate" or "non-syntactic surrogate". To emphasize > that Python codecs will only produce them in contexts where a > Unicode character or high surrogate (for UTF-16 Python) is > syntactically required, "isolated low surrogate" or "isolated > trailing surrogate" might be good.[1] Fixed. I removed the world "half" everywhere. It really doesn't mean anything to me (it could have been called sunnygate instead, making no difference). I tried to understand "surrogate", and it was explained to me that "surrogate" is something that stands for something - but then I would argue that the two subsequence codes form a surrogate - they stand for something else. The individual surrogate code (in Unicode terminology) doesn't stand for anything. So don't you agree that it is the Unicode terminology that is in error, not the PEP? > 2. The specification should state, and the discussion emphasize, that > strings which were produced by surrogate replacement *must not* be > used in data interchange with systems that do not specifically > accept such strings, and that this is the responsibility of the > application.[2] No. The specification puts no requirements on applications whatsoever. So if you propose to use MUST NOT in the RFC 2119 sense, I strongly disagree. Applications that desire mojibake are free to produce it; we are consenting adults; and all that. > 3. In the discussion, the transition from the example of alternative > use of 'python-escape' to discussion of the error handler > interface extension is a bit abrupt. I suggest rewriting as: > > """The extension to the encode error handler interface proposed by > this PEP is necessary to implement the 'utf8b' error handler, > because there are required byte sequences which cannot be > generated from replacement Unicode. However, the encode error > handler interface presently requires replacement Unicode to be > provided in lieu of the non-encodable Unicode from the source > string. Then it promptly encodes that replacement Unicode. In > some error handlers, such as the 'utf8b' proposed here, it is also > simpler and more efficient for the error handler to provide a > pre-encoded replacement byte string, rather than forcing it to > calculating Unicode from which the encoder would create the > desired bytes.""" Unfortunately, I failed to understand where you want this text to go. What paragraphs should I remove, or (if none), after which paragraph should I insert this text? Regards, Martin From martin at v.loewis.de Tue May 5 23:44:25 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 May 2009 23:44:25 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A00B339.5050305@v.loewis.de> > It occurs to me that the PEP maybe should say that it is an error > to have your POSIX locale set to UTF-16 or something like that. No. It is *impossible* to have UTF-16 as the locale character set, not an error. Your statement is like saying "it is an error to breathe in the vacuum". In any case, the discussion says # Encodings that are not compatible with ASCII are not supported by # this specification; bytes in the ASCII range that fail to decode # will cause an exception. It is widely agreed that such encodings # should not be used as locale charsets. Regards, Martin From mal at egenix.com Wed May 6 02:26:31 2009 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 06 May 2009 02:26:31 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A00A93D.3030204@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> Message-ID: <4A00D937.6080403@egenix.com> Martin v. L?wis wrote: >> I have three substantive comments. First, although consequences for >> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as >> I can see this PEP could apply to Python 2 as well. I don't think >> it's intended that way. Either way, I think you should clarify that >> point. > > Done: the Python-Version header already clarifies that point. > >> Second, I suggest "surrogate-replace" as the name of the error handler >> rather than "utf8b". > > I think this is bike-shedding. The name "utf8b" suggested in the PEP is not in line with the codec design and causes confusion with an existing codec of a similar name. Error handlers and codecs are two different things, so the namespaces need to be clearly separate. Please change the name of the error handler to a different name that does not resemble or cause confusion with a codec name and fits the scheme of error handler names we already have in place in Python for replacing error handlers, i.e. "XYZreplace". Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2009-06-29: EuroPython 2009, Birmingham, UK 53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From stephen at xemacs.org Wed May 6 07:10:41 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 14:10:41 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A00B339.5050305@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> Message-ID: <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > It occurs to me that the PEP maybe should say that it is an error > > to have your POSIX locale set to UTF-16 or something like that. > > No. It is *impossible* to have UTF-16 as the locale character set, > not an error. Your statement is like saying "it is an error to > breathe in the vacuum". I realize this is not useful, so maybe you don't need to mention it. However, it certainly is possible to set LANG with an absurd, or merely dangerous, encoding. > In any case, the discussion says > > # Encodings that are not compatible with ASCII are not supported by > # this specification; bytes in the ASCII range that fail to decode > # will cause an exception. It is widely agreed that such encodings > # should not be used as locale charsets. Which is your excuse for not supporting Shift JIS fully. It doesn't stop people from setting LC_ALL=ja_JP.shift_jis, or using Shift JIS as the default encoding for certain media. From stephen at xemacs.org Wed May 6 07:35:30 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 14:35:30 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> Lino Mastrodomenico writes: > 2009/5/5 Stephen J. Turnbull : > > Third, it is not clear to me why non-decodable ASCII should be an > > error. > > The PEP originally allowed the conversion to U+DCxx of bytes below 128 > that cannot be decoded by the encoding used, but this creates > potential security problems. > > See: Yeah, yeah, this is the same old same old from PEP 3131. Anything that handles the various attacks based on ASCII-alike characters should at least rule out invalid Unicode, too! And where is this U+DC2F supposed to be coming from, anyway? The user's *local* environment or the user's *local* filesystem! Codecs not using 'utf8b' can't produce it, so the only other cases are chr() and \u literals in the *local* process, or an already broken module in your code. I really can't imagine that any sane programmer these days would be using 'utf8b' on bytes received from the Internet! Of course I can't prove that there's no vector for an exploit here (in fact, I'm sure there is one with sufficiently careless handling of input), but I think "consenting adults" covers the Shift JIS use case. Make it an option, but it should be explicitly part of the PEP. From stephen at xemacs.org Wed May 6 08:06:07 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 15:06:07 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A00A93D.3030204@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> Message-ID: <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > Done: the Python-Version header already clarifies that point. Ah, OK. I wish my day job required reading more PEPs so I'd be more familiar with these formalities. :-) > > Second, I suggest "surrogate-replace" as the name of the error handler > > rather than "utf8b". > > I think this is bike-shedding. I don't personally care (I already was aware of UTF-8B), but there are plenty of others who do. I think that's a good name to make Marc-Andre and Terry happier. You have to fix the existing uses of the obsolete "python-escape", anyway. > It's a security risk. If U+DCXX would map to \xXX, then somebody could > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets > sanitized, nobody would expect that this will actually access ../ The odds that anybody will actually take notice of U+002E U+002E U+002F in a string are sufficiently small that any number of exploits have already been based on it. I agree that there is some additional risk from this if people make the check for "../" before they prepend "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to the pain of having a error handler whose raison d'etre is to not raise exceptions go ahead and raise them anyway. See also my reply to Lino Mastrodomenico. Again, an option is good enough for my purposes as long as interfaces for os.listdir() and the like support setting the error handler (cf. Zooko's proposal), but I think the option should be available. > I tried to understand "surrogate", and it was explained to me that > "surrogate" is something that stands for something - but then I > would argue that the two subsequence codes form a surrogate - they > stand for something else. The individual surrogate code (in Unicode > terminology) doesn't stand for anything. So don't you agree that > it is the Unicode terminology that is in error, not the PEP? Plausibly so. Keep making comments like that and nobody will ever let you off the hook for being a non-native speaker! However, "surrogate" in English is typically used in situation that are too complex to be covered by simply "substitution." I've always read "surrogate" as "alternative form of encoding", and "surrogate code point" as "code point in that alternative form of encoding". Where it's an alternative to code-point-is-scalar-value. I think probably the authors of the terminology just made the best of a bad situation, I can't think of a better single word for this. > No. The specification puts no requirements on applications whatsoever. > So if you propose to use MUST NOT in the RFC 2119 sense, I strongly > disagree. I do propose that. But you're writing the PEP, so this battle will have to be deferred. Eventually Python will have to take a stand on Unicode conformance, but it's not urgent yet. > > 3. In the discussion, the transition from the example of alternative > > use of 'python-escape' to discussion of the error handler > > interface extension is a bit abrupt. I suggest rewriting as: > > > > """The extension to the encode error handler interface proposed by > > this PEP is necessary to implement the 'utf8b' error handler, > > because there are required byte sequences which cannot be > > generated from replacement Unicode. However, the encode error > > handler interface presently requires replacement Unicode to be > > provided in lieu of the non-encodable Unicode from the source > > string. Then it promptly encodes that replacement Unicode. In > > some error handlers, such as the 'utf8b' proposed here, it is also > > simpler and more efficient for the error handler to provide a > > pre-encoded replacement byte string, rather than forcing it to > > calculating Unicode from which the encoder would create the > > desired bytes.""" > > Unfortunately, I failed to understand where you want this text to > go. What paragraphs should I remove, or (if none), after which > paragraph should I insert this text? Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129. I think I forgot to do this before: "I hereby dedicate all text I suggest for inclusion in the PEP to the public domain." From martin at v.loewis.de Wed May 6 09:31:00 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 09:31:00 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A00D937.6080403@egenix.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> Message-ID: <4A013CB4.9010204@v.loewis.de> > The name "utf8b" suggested in the PEP is not in line with the codec > design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). > Error handlers and codecs are two different things, so the namespaces > need to be clearly separate. They *are* separate naemspaces; that's guaranteed by the implementation. Regards, Martin From martin at v.loewis.de Wed May 6 09:36:01 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 09:36:01 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A013DE1.5000401@v.loewis.de> Stephen J. Turnbull wrote: > "Martin v. L?wis" writes: > > > It occurs to me that the PEP maybe should say that it is an error > > > to have your POSIX locale set to UTF-16 or something like that. > > > > No. It is *impossible* to have UTF-16 as the locale character set, > > not an error. Your statement is like saying "it is an error to > > breathe in the vacuum". > > I realize this is not useful, so maybe you don't need to mention it. > However, it certainly is possible to set LANG with an absurd, or > merely dangerous, encoding. How so? The C library will filter it out. > > In any case, the discussion says > > > > # Encodings that are not compatible with ASCII are not supported by > > # this specification; bytes in the ASCII range that fail to decode > > # will cause an exception. It is widely agreed that such encodings > > # should not be used as locale charsets. > > Which is your excuse for not supporting Shift JIS fully. It doesn't > stop people from setting LC_ALL=ja_JP.shift_jis, Well, it *does* stop them from doing so if their systems don't support the locale setting. In any case, if they do this, PEP 383 will not support them. > or using Shift JIS as the default encoding for certain media. I fail to see how this could ever matter. If, by "media", you mean things like removable disks, and the file name encoding used on them, it's fairly irrelevant for the PEP, since Python won't start using Shift JIS as its file system encoding just because that's the encoding used on the disk. Regards, Martin From martin at v.loewis.de Wed May 6 09:53:33 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 09:53:33 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A0141FD.2050307@v.loewis.de> > > > Second, I suggest "surrogate-replace" as the name of the error handler > > > rather than "utf8b". > > > > I think this is bike-shedding. > > I don't personally care (I already was aware of UTF-8B), but there are > plenty of others who do. I think it is a fairly bad name, because it is easy to confuse it with the "surrogates" error handler (unless you suggest to rename that also). > You have to fix the existing uses of > the obsolete "python-escape", anyway. Indeed - but only in the PEP. In the implementation, it's already utf8b throughout. Now it is also in the PEP; thanks for pointing that out. > > It's a security risk. If U+DCXX would map to \xXX, then somebody could > > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets > > sanitized, nobody would expect that this will actually access ../ > > The odds that anybody will actually take notice of U+002E U+002E > U+002F in a string are sufficiently small that any number of exploits > have already been based on it. I agree that there is some additional > risk from this if people make the check for "../" before they prepend > "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to > the pain of having a error handler whose raison d'etre is to not raise > exceptions go ahead and raise them anyway. The problem is that functions like normpath will recognize ../, and that applications rely on them for file name sanitation. If they could be tricked into writing outside of their target folders, this would be a huge security risk. OTOH, I don't care breaking applications on misconfigured systems. People using SJIS as their locale encodings have bigger problems than Python raising exceptions. > See also my reply to Lino Mastrodomenico. URL? > But you're writing the PEP, so this battle will have to be deferred. > Eventually Python will have to take a stand on Unicode conformance, > but it's not urgent yet. I think it's always applications that are conforming or not, rather than libraries. Libraries should allow to write conforming applications. They may refuse to write certain non-conforming applications (although users then replace the library with one that does allow them to do what they want). Libraries can never enforce that applications conform to some standard. > Sorry! I suggest substituting the paragraph above for the paragraph > which begins "The encode error handler interface presentlyrequires..." > at line 129. Ah, ok. This was Glen Linderman's text before - now it's yours :-) > I think I forgot to do this before: "I hereby dedicate all text > I suggest for inclusion in the PEP to the public domain." :-) Martin From martin at v.loewis.de Wed May 6 10:03:47 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 10:03:47 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A014463.4070109@v.loewis.de> > Yeah, yeah, this is the same old same old from PEP 3131. Anything > that handles the various attacks based on ASCII-alike characters > should at least rule out invalid Unicode, too! > > And where is this U+DC2F supposed to be coming from, anyway? The > user's *local* environment or the user's *local* filesystem! Why is that not a threat? Suppose you have a setuid application, and you pass some string on the command line that decodes to /../. Then the setuid application will be tricked into modifying files it didn't mean to modify. Likewise, it might come from a relational database. Use a relational database that supports unicode code units, or lone surrogates through utf-8, and fill in some bogus data. Then have the Python application (running as root) read it. > Of course I can't prove that there's no vector for an exploit here (in > fact, I'm sure there is one with sufficiently careless handling of > input), but I think "consenting adults" covers the Shift JIS use case. > Make it an option, but it should be explicitly part of the PEP. Nothing is lost at the moment. If users complain, we can still think of ways to enhance the experience. In any case, Python 3.1b1 may get released today, so it's way too late for new features in the PEP. They can wait for Python 3.2. Regards, Martin From ziade.tarek at gmail.com Wed May 6 11:01:14 2009 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Wed, 6 May 2009 11:01:14 +0200 Subject: [Python-Dev] Help on issue 5941 Message-ID: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> Hello, I need some help on http://bugs.python.org/issue5941 The bug is quite simple: the Distutils unixcompiler used to set the archiver command to "ar -rc". For quite a while now, this behavior has changed in order to be able to customize the compiler behavior from the environment. That introduced a regression because the mechanism in Distutils that looks for the AR variable in the environment also looks into the Makefile of Python. (in the Makefile then is os.environ) And as a matter of fact, AR is set to "ar" in there, so the -cr option is not set anymore. So my question is : should I make a change into the Makefile by adding for example a variable called AR_OPTIONS then build the ar command with AR + AR_OPTIONS *or* that doesn't make sense and I just need to change the behavior so it doesn't look for AR into the Makefile. (just in os.environ) Thanks Tarek -- Tarek Ziad? | http://ziade.org From solipsis at pitrou.net Wed May 6 11:17:43 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 09:17:43 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> <4A0141FD.2050307@v.loewis.de> Message-ID: Martin v. L?wis v.loewis.de> writes: > > > I don't personally care (I already was aware of UTF-8B), but there are > > plenty of others who do. > > I think it is a fairly bad name, because it is easy to confuse it with > the "surrogates" error handler (unless you suggest to rename that also). I didn't bother to say it at the time, but I think "surrogates" is a pretty bad name. It should be more indicative of what it does, e.g. "surrogates-pass", or "surrogates-accept". > > > It's a security risk. If U+DCXX would map to \xXX, then somebody could > > > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets > > > sanitized, nobody would expect that this will actually access ../ Agreed this is an annoying security breach. The whole point of the PEP is that application developers do not have to care about filename encoding issues, which is defeated is they have to check for strange (illegal) combinations of characters. By the way, what are the ASCII characters that are not suppported by Shift-JIS? Not many I suppose? (if I read the Wikipedia entry correctly, it's only the backslash and the tilde). Regards Antoine. From stephen at xemacs.org Wed May 6 11:39:02 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 18:39:02 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A013DE1.5000401@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> <4A013DE1.5000401@v.loewis.de> Message-ID: <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > I fail to see how this could ever matter. If, by "media", you mean > things like removable disks, and the file name encoding used on them, > it's fairly irrelevant for the PEP, since Python won't start using > Shift JIS as its file system encoding just because that's the encoding > used on the disk. I'm sorry for the lack of clarity of my posts, but somehow you're completely missing the point. The point is precisely that Python *won't* use Shift JIS as the file system encoding (if it did there would be no problem with reading Shift JIS), but the people who created the media *did*. Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this, but it is sure to be the most common use case for PEP 383 in East Asia. From mal at egenix.com Wed May 6 11:53:12 2009 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 06 May 2009 11:53:12 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A013CB4.9010204@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> Message-ID: <4A015E08.5000203@egenix.com> Martin v. L?wis wrote: >> The name "utf8b" suggested in the PEP is not in line with the codec >> design > > Where is that design documented, and how exactly violates the name > the design (chapter and verse, please). Martin, I designed the whole Python codec machinery, so even if this is not explicitly written down somewhere, you can take my word for it. I don't want users to be confused by such an error handler name, so please change it ! Here's a list of the currently available error handlers (taken from codecs.py): The .encode()/.decode() methods may use different error handling schemes by providing the errors argument. These string values are predefined: 'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences (only for encoding). The set of allowed values can be extended via register_error. >> Error handlers and codecs are two different things, so the namespaces >> need to be clearly separate. > > They *are* separate naemspaces; that's guaranteed by the implementation. In the implementation, yes, but not in the head of a typical user: the 'utf8b' looks more like a codec name than an error handler name. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2009-06-29: EuroPython 2009, Birmingham, UK 53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From google at mrabarnett.plus.com Wed May 6 12:08:45 2009 From: google at mrabarnett.plus.com (MRAB) Date: Wed, 06 May 2009 11:08:45 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A015E08.5000203@egenix.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> Message-ID: <4A0161AD.6000605@mrabarnett.plus.com> M.-A. Lemburg wrote: > Martin v. L?wis wrote: >>> The name "utf8b" suggested in the PEP is not in line with the codec >>> design >> Where is that design documented, and how exactly violates the name >> the design (chapter and verse, please). > > Martin, I designed the whole Python codec machinery, so even if > this is not explicitly written down somewhere, you can take my > word for it. > > I don't want users to be confused by such an error handler > name, so please change it ! > > Here's a list of the currently available error handlers (taken from > codecs.py): > > The .encode()/.decode() methods may use different error > handling schemes by providing the errors argument. These > string values are predefined: > > 'strict' - raise a ValueError error (or a subclass) > 'ignore' - ignore the character and continue with the next > 'replace' - replace with a suitable replacement character; > Python will use the official U+FFFD REPLACEMENT > CHARACTER for the builtin Unicode codecs on > decoding and '?' on encoding. > 'xmlcharrefreplace' - Replace with the appropriate XML > character reference (only for encoding). > 'backslashreplace' - Replace with backslashed escape sequences > (only for encoding). > > The set of allowed values can be extended via register_error. > >>> Error handlers and codecs are two different things, so the namespaces >>> need to be clearly separate. >> They *are* separate naemspaces; that's guaranteed by the implementation. > > In the implementation, yes, but not in the head of a typical user: > the 'utf8b' looks more like a codec name than an error handler > name. > Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates. > I want to avoid any such confusion with Python codecs and don't > understand why you are making a problem out of this. > From solipsis at pitrou.net Wed May 6 12:11:56 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 10:11:56 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> Message-ID: MRAB mrabarnett.plus.com> writes: > > Judging by the existing names, I think that 'surrogate' would be > reasonable. It already contains the meaning of substitute, Only if you are a native English-speaker I suppose... For me it's just a technical term denoting a certain class of unicode code points (I'm not sure of the latter terminology ;-)). Regards Antoine. From l.mastrodomenico at gmail.com Wed May 6 12:22:50 2009 From: l.mastrodomenico at gmail.com (Lino Mastrodomenico) Date: Wed, 6 May 2009 12:22:50 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> <4A0141FD.2050307@v.loewis.de> Message-ID: 2009/5/6 Antoine Pitrou : > By the way, what are the ASCII characters that are not suppported by Shift-JIS? > Not many I suppose? (if I read the Wikipedia entry correctly, it's only the > backslash and the tilde). The biggest problem with Shift-JIS is that a perfectly valid unicode character above 127 can be encoded to a byte sequence that includes bytes in range(128). E.g. the character ? (a.k.a. '\u639b') when encoded with Shift-JIS becomes the two bytes sequence b'\x8a|'. Notice that the second byte is 124, which on POSIX is usually interpreted as the pipe character and can have security implications. It's a know problem with Shift-JIS and was fixed in UTF-8. -- Lino Mastrodomenico From regebro at gmail.com Wed May 6 12:28:22 2009 From: regebro at gmail.com (Lennart Regebro) Date: Wed, 6 May 2009 12:28:22 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A013CB4.9010204@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> Message-ID: <319e029f0905060328s5f3446a1j92c52d7d6cc140ae@mail.gmail.com> On Wed, May 6, 2009 at 09:31, "Martin v. L?wis" wrote: > They *are* separate naemspaces; that's guaranteed by the implementation. Yes. But utf8b *sounds like* an encoding. When it isn't. I sure thought it was when it was first mentioned. I agree that it would be better to find another name. 'utf8-binary-replace'? Is it only usable with utf8 as an encoding? -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 From stephen at xemacs.org Wed May 6 13:39:18 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 20:39:18 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> <4A0141FD.2050307@v.loewis.de> Message-ID: <87ljpafok9.fsf@uwakimon.sk.tsukuba.ac.jp> Lino Mastrodomenico writes: > It's a know problem with Shift-JIS and was fixed in UTF-8. It was fixed in EUC before Shift-JIS was invented by Microsoft or Big5 was invented by the Taiwanese clone makers. Guido's not the only language designer with a time machine.... From stephen at xemacs.org Wed May 6 15:33:17 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 06 May 2009 22:33:17 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A014463.4070109@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> Message-ID: <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > Yeah, yeah, this is the same old same old from PEP 3131. Anything > > that handles the various attacks based on ASCII-alike characters > > should at least rule out invalid Unicode, too! > > > > And where is this U+DC2F supposed to be coming from, anyway? The > > user's *local* environment or the user's *local* filesystem! > > Why is that not a threat? Suppose you have a setuid application, and > you pass some string on the command line that decodes to /../. Then > the setuid application will be tricked into modifying files it didn't > mean to modify. Of course this is a threat, assuming that the application takes no precautions. But first, it should be stopped by any of several standard precautions. For example, applying os.path.realpath (come to think of it, PEP 383 should say something about realpath, shouldn't it?) and os.path.normpath (PEP 383 should definitely say something about this function; maybe PEP 3131 should, too) before checking access restrictions. If you're not running your paths through those, you're already vulnerable to symlink attacks, and maybe other forms of spoofing. Second, it's a threat already enabled by your restricted version of PEP 383. Access control applies to subdirectories as well as to parent directories. Since you can insert arbitrary non-ASCII bytes into the path using the current definition of 'utf8b', name-based access restrictions can be bypassed in exactly the same way for any directory whose name is not 100.00% ASCII, and the setuid application will be tricked into modifying files it didn't mean to modify. Also, on Mac OS X, system directories, including directories containing system libraries, frameworks, and executables, may be accessible via locale-specific names (I don't have a Japanese- localized Mac at hand to check, but I'm pretty sure in my old Mac the Japanese names appeared in ls in Terminal.app, which means it may be possible to access system directories containing libraries, frameworks, and executables this way). Those can be spoofed in exactly the same way. > Nothing is lost at the moment. Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. Yet it is those users who are placed at risk by PEP 383. > In any case, Python 3.1b1 may get released today, so it's way too late > for new features in the PEP. They can wait for Python 3.2. You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous. From solipsis at pitrou.net Wed May 6 15:40:16 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 13:40:16 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: Stephen J. Turnbull xemacs.org> writes: > > Nothing is lost compared to 'strict', true, but under the PEP as it is > a large fraction of Shift JIS and Big5 filenames cannot be read under > ASCII-compatible file system encodings using 'utf8b'. You should really be more specific. I'm not sure about others, but I don't understand what filenames you are talking about. From rdmurray at bitdance.com Wed May 6 15:55:16 2009 From: rdmurray at bitdance.com (R. David Murray) Date: Wed, 6 May 2009 09:55:16 -0400 (EDT) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote: > Stephen J. Turnbull xemacs.org> writes: >> >> Nothing is lost compared to 'strict', true, but under the PEP as it is >> a large fraction of Shift JIS and Big5 filenames cannot be read under >> ASCII-compatible file system encodings using 'utf8b'. > > You should really be more specific. I'm not sure about others, but I don't > understand what filenames you are talking about. Seems to me that the best thing to do would be to file a bug report with test cases that demonstrate the problems when run against the current py3k trunk. Especially the security issues you cite (which I don't understand). --David From zooko at zooko.com Wed May 6 15:48:57 2009 From: zooko at zooko.com (Zooko Wilcox-O'Hearn) Date: Wed, 6 May 2009 07:48:57 -0600 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4D13A827-2FC4-43F8-99CD-7188F832EA2A@zooko.com> On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote: > You have convinced me that the PEP should wait as well. > > In its current form it is incomplete and dangerous. +1 on delaying PEP 383 I think PEP 383 is a good idea in principle, but I'm still struggling to understand it myself, and it seems to offer new hazards for the unwary programmer. On the other hand, maybe the wary programmers are waiting for Python 3.2 anyway . On the gripping hand, if PEP 383 is released in Python 3.1, will that obligate python-dev to support it indefinitely, at least in backwards- compatibility mode? I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames... Regards, Zooko From foom at fuhm.net Wed May 6 16:41:53 2009 From: foom at fuhm.net (James Y Knight) Date: Wed, 6 May 2009 10:41:53 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> <4A013DE1.5000401@v.loewis.de> <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote: > Now, with Python's file system encoding == UTF-8 or any packed EUC, > and more than a handful of Shift JIS or Big5 characters in file names, > one is *almost certain* to encounter ASCII as the second byte of a > multibyte sequence. PEP 383 can't handle this Hm, I haven't tried the implementation, but I thought that what would happen is: '\x85a'.decode('utf-8', 'utf8b/surrogate-replace/whateveritscalled') - > u'\uDC85a' If that indeed doesn't happen, that's certainly a defect and should be remedied. > , but it is sure to be > the most common use case for PEP 383 in East Asia. Yes. James From ncoghlan at gmail.com Wed May 6 16:59:30 2009 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 07 May 2009 00:59:30 +1000 Subject: [Python-Dev] Undocumented change / bug in Python3's PyMapping_Check In-Reply-To: <3283f7fe0905042219r23113ca6ud6dd3840d7462f37@mail.gmail.com> References: <3283f7fe0905042219r23113ca6ud6dd3840d7462f37@mail.gmail.com> Message-ID: <4A01A5D2.4030803@gmail.com> John Millikin wrote: > In Python 2, PyMapping_Check will return 0 for list objects. In Python > 3, it returns 1. Obviously, this makes it rather difficult to > differentiate between mappings and other sized iterables. In addition, > it differs from the behavior of the ``collections.Mapping`` ABC -- > isinstance([], collections.Mapping) returns False. > > I believe the new behavior is erroneous, but would like to confirm > that before filing a bug. It's not a bug. PyMapping_Check just tells you if a type has an entry in the tp_as_mapping->mp_subscript slot. In 2.x, it used to have an additional condition that the tp_as_sequence->sq_slice slot be empty, but that has gone away in Py3k because the sq_slice slot has been removed. Even in 2.x that test wasn't a reliable way of telling if something was a mapping or a sequence - it happened to get it right for lists and tuples (since they define __getslice__ and __setslice__), but this is not the case for new-style user defined sequences: >>> from operator import isMappingType >>> class MySeq(object): ... def __getitem__(self, idx): ... # Is this a mapping or an unsliceable sequence? ... return idx*2 ... >>> isMappingType(MySeq()) True Using the new collections module ABCs to check for sequences and mappings. That's what they're for, and they will give you a much more reliable answer than the C level checks (which are really just an implementation detail). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia --------------------------------------------------------------- From solipsis at pitrou.net Wed May 6 18:54:37 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 16:54:37 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> <4D13A827-2FC4-43F8-99CD-7188F832EA2A@zooko.com> Message-ID: Zooko Wilcox-O'Hearn zooko.com> writes: > > I'm not thinking of API compatibility as much as > data compatibility -- someone used Python 3.1 to write down some > filenames, and now a few years later they are trying to use the > latest and greatest Python release to read those filenames... Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem. Regards Antoine. From v+python at g.nevcal.com Wed May 6 19:05:01 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 10:05:01 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A01C33D.3030906@g.nevcal.com> On approximately 5/6/2009 6:33 AM, came the following characters from the keyboard of Stephen J. Turnbull: > "Martin v. L?wis" writes: > > In any case, Python 3.1b1 may get released today, so it's way too late > > for new features in the PEP. They can wait for Python 3.2. > > You have convinced me that the PEP should wait as well. > > In its current form it is incomplete and dangerous. I see nothing in this thread that suggests that the PEP is dangerous in its current form. While I (still) think that more readable transcodings could have been used, and while I had difficulty fully understanding the PEP at first, now that I think I do understand the PEP, and it has been somewhat clarified and amended, I cannot see how it could be dangerous. A specific case of danger should be included with such a statement. Regarding incomplete, I agree it won't brush my teeth for me, but I think it does solve the problem it sets out to solve. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Wed May 6 19:08:22 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 10:08:22 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A0161AD.6000605@mrabarnett.plus.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> Message-ID: <4A01C406.3030004@g.nevcal.com> On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB: > M.-A. Lemburg wrote: >> Martin v. L?wis wrote: > Judging by the existing names, I think that 'surrogate' would be > reasonable. It already contains the meaning of substitute, it's not too > long, and the codes which act as replacements are already called > surrogates. > >> I want to avoid any such confusion with Python codecs and don't >> understand why you are making a problem out of this. +1 for "surrogate" as the name for the error handler. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From v+python at g.nevcal.com Wed May 6 19:11:15 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 10:11:15 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A0141FD.2050307@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> <4A0141FD.2050307@v.loewis.de> Message-ID: <4A01C4B3.9050905@g.nevcal.com> On approximately 5/6/2009 12:53 AM, came the following characters from the keyboard of Martin v. L?wis: >> Sorry! I suggest substituting the paragraph above for the paragraph >> which begins "The encode error handler interface presentlyrequires..." >> at line 129. > > Ah, ok. This was Glen Linderman's text before - now it's yours :-) Which is fine by me. Stephen's is more explanatory than mine, but says the same thing. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From tjreedy at udel.edu Wed May 6 21:13:55 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 06 May 2009 15:13:55 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01C406.3030004@g.nevcal.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> Message-ID: Glenn Linderman wrote: > On approximately 5/6/2009 3:08 AM, came the following characters from > the keyboard of MRAB: >> M.-A. Lemburg wrote: >>> Martin v. L?wis wrote: > >> Judging by the existing names, I think that 'surrogate' would be >> reasonable. It already contains the meaning of substitute, it's not too >> long, and the codes which act as replacements are already called >> surrogates. >> >>> I want to avoid any such confusion with Python codecs and don't >>> understand why you are making a problem out of this. > > > +1 for "surrogate" as the name for the error handler. > > +1 from me also From zooko at zooko.com Wed May 6 21:18:03 2009 From: zooko at zooko.com (Zooko Wilcox-O'Hearn) Date: Wed, 6 May 2009 13:18:03 -0600 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> <4D13A827-2FC4-43F8-99CD-7188F832EA2A@zooko.com> Message-ID: On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: > Zooko Wilcox-O'Hearn zooko.com> writes: >> >> I'm not thinking of API compatibility as much as data >> compatibility -- someone used Python 3.1 to write down some >> filenames, and now a few years later they are trying to use the >> latest and greatest Python release to read those filenames... > > Well, if the filenames are generated by Python (as opposed to read > from an existing directory on disk), they should be regular unicode > objects without any lone surrogates, so I don't see the > compatibility problem. I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them. I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case. Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode. That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1. By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/ tahoe-dev . Regards, Zooko From v+python at g.nevcal.com Wed May 6 22:17:05 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 13:17:05 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> <4D13A827-2FC4-43F8-99CD-7188F832EA2A@zooko.com>

Message-ID: <4A01F041.9000709@g.nevcal.com> On approximately 5/6/2009 12:18 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn: > On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: > >> Zooko Wilcox-O'Hearn zooko.com> writes: >>> >>> I'm not thinking of API compatibility as much as data compatibility >>> -- someone used Python 3.1 to write down some filenames, and now a >>> few years later they are trying to use the latest and greatest Python >>> release to read those filenames... >> >> Well, if the filenames are generated by Python (as opposed to read >> from an existing directory on disk), they should be regular unicode >> objects without any lone surrogates, so I don't see the compatibility >> problem. > > I meant that the application reads filenames from an existing directory > on disk, saves those filenames, and then later, using a future version > of Python, wants to read them and use them. Regarding future versions of Python. In the worst case, even if Python's default behavior changes, the transcoding done by PEP 383 can be done in other software too... it is a straightforward, fully specified, 1-to-1, reversible transcoding process, affecting and generating only invalid byte encodings on one side, and invalid Unicode sequences on the other. So if Python's default behavior should change, the transcoding implemented by PEP 383 could be easily reimplemented to enable a future version of a Python application to manipulate the transcoded, saved, filenames. By easily, I mean that I could code it in a couple hours, max. > I'm not saying that I know this would be a problem. I'm saying that I > personally can't tell whether it would be a problem or not, and the > extensive discussions so far have not convinced me that there is anyone > who both understands PEP 383 and considers this use case. Does the above help? > Many people who apparently understand encoding issues well have said > something to the effect that there is no problem, but those people > haven't yet managed to get through my thick skull how I would use PEP > 383 safely for this sort of use case -- the one where data generated by > os.listdir() travels forward in time or the one were that data travels > sideways to other systems, including Windows or other systems that > validate incoming unicode. Regarding data traveling sideways, some comments: 1) PEP 383's effect could be recoded in other languages as easily as it is in Python (or the C in which Python is implmented). So that could be a solution. 2) You mention "Windows" and "other systems that validate incoming unicode" in the same phrase, as if you think that "Windows" qualifies as an "other systems that validate incoming unicode", but it does not (at least not universally). > That's why I am a bit uncomfortable about PEP 383 being quickly > implemented and deployed in Python 3.1. Does the above help? > By the way, much of the detailed discussion about what Tahoe requires > and how that may or may not benefit from PEP 383 has now moved to the > tahoe-dev mailing list: > http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev . I have no background with Tahoe, nor particular interest, although it sounds like a useful project... so I won't be joining that list. I have no idea if there is an installed base of existing Tahoe file systems, my suggestions below assume that there is not, and that you are presently inventing them. Therefore, I provide no migration path, although I could invent one, but it would take longer to describe. However, since I'm responding here, and have read what you have posted here, it seems like the following could be true. Assumptions from your emails: A) Tahoe wants to provide a UTF-8 file name system B) Tahoe wants to interface to POSIX systems that use (and do not validate) byte interfaces. C) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with no validation. D) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with validation. Uncertainties: I'm not clear on what your goals are for Tahoe filenames. There seem to be 2 possibilities: 1) you want to reject attempts to use non-validating Unicode, be it from a 16-bit interface, or a bytes interface. 2) you don't want to reject non-validating Unicode, but you want to convert it to valid Unicode for (D) systems. 3) Orthogonally, you might want to store only Valid Unicode in the names, or you might not care, if you can meet the other goals. Truisms: If you want to support (D), and (2), then you must transform names at some point, using some scheme, because not all names supplied by (B) systems will be acceptable to (D) systems. You can choose to do this transformation when a (B) system provides an invalid (per Unicode) name, or you can choose to do the transformation when a (D) system accesses a file with an invalid (per Unicode) name. If the (B) and (D) systems talk to each other outside of Tahoe, they will have to do similar transformations, or, if they both access the same Tahoe system, they will have to do the identical transformation, to be sure that they can access the same file. All transcoding schemes have the possibility of data puns between non-transcoded names and transcoded names. In order to successfully and properly manipulate a name, you must know whether or not it has been transcoded, and how. PEP 383 limits its transcoding to names that are invalid (per Unicode). Names that cannot be properly decoded to Unicode are decoded to invalid Unicode. Names that are invalid Unicode are encoded to invalid byte sequences (per the encoding scheme specified). For PEP 383 and Python, transcoded names can be distinguished by checking for the existence of lone surrogates in the str form of the filename, or by attempting to do a strict decoding of the bytes form of the filename, depending on what you have (generally, the former). For PEP 383 and Python, the names will round trip from the POSIX bytes interfaces to the program, and back to POSIX bytes interfaces, as long as only Python wrappers of system functions are used, and the filesystem encoding is not changed between calls (or is restored). Passing them to 3rd party libraries or other systems requires extra work, if there is a desire to manipulate files with names that are not decodeable to Unicode by the standard decoding algorithm for that encoding. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From martin at v.loewis.de Wed May 6 22:40:13 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:40:13 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A015E08.5000203@egenix.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> Message-ID: <4A01F5AD.4000404@v.loewis.de> >>> The name "utf8b" suggested in the PEP is not in line with the codec >>> design >> Where is that design documented, and how exactly violates the name >> the design (chapter and verse, please). > > Martin, I designed the whole Python codec machinery Not true. PEP 293 was written and designed by Walter D?rwald. > so even if > this is not explicitly written down somewhere, you can take my > word for it. If the design was specified in writing somewhere, I would probably challenge it as obsolete. If it isn't described anywhere, I'll have to ignore it. > I want to avoid any such confusion with Python codecs and don't > understand why you are making a problem out of this. Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/ Regards, Martin From martin at v.loewis.de Wed May 6 22:34:53 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:34:53 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> <4A013DE1.5000401@v.loewis.de> <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A01F46D.50105@v.loewis.de> > I'm sorry for the lack of clarity of my posts, but somehow you're > completely missing the point. The point is precisely that Python > *won't* use Shift JIS as the file system encoding (if it did there > would be no problem with reading Shift JIS), but the people who > created the media *did*. > > Now, with Python's file system encoding == UTF-8 or any packed EUC, > and more than a handful of Shift JIS or Big5 characters in file names, > one is *almost certain* to encounter ASCII as the second byte of a > multibyte sequence. PEP 383 can't handle this Not true. PEP 383 handles this very example just fine, with no problems that I can see. Can you propose a specific example that you think might cause problems? By "specific", I mean: what file names (exact bytes, please), what locale charset, what API calls. Regards, Martin From martin at v.loewis.de Wed May 6 22:41:11 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:41:11 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A0161AD.6000605@mrabarnett.plus.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> Message-ID: <4A01F5E7.7030401@v.loewis.de> > Judging by the existing names, I think that 'surrogate' would be > reasonable MAL's list of existing names is incomplete. "surrogates" is already an existing name, also, and it means something different (similar, but different). Regards, Martin From martin at v.loewis.de Wed May 6 22:42:03 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:42:03 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> Message-ID: <4A01F61B.1000203@v.loewis.de> Terry Reedy wrote: > Glenn Linderman wrote: >> On approximately 5/6/2009 3:08 AM, came the following characters from >> the keyboard of MRAB: >>> M.-A. Lemburg wrote: >>>> Martin v. L?wis wrote: >> >>> Judging by the existing names, I think that 'surrogate' would be >>> reasonable. It already contains the meaning of substitute, it's not too >>> long, and the codes which act as replacements are already called >>> surrogates. >>> >>>> I want to avoid any such confusion with Python codecs and don't >>>> understand why you are making a problem out of this. >> >> >> +1 for "surrogate" as the name for the error handler. >> >> > +1 from me also Despite there being also an error handler called "surrogates". Are you serious? Regards, Martin From martin at v.loewis.de Wed May 6 22:44:09 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:44:09 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <319e029f0905060328s5f3446a1j92c52d7d6cc140ae@mail.gmail.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <319e029f0905060328s5f3446a1j92c52d7d6cc140ae@mail.gmail.com> Message-ID: <4A01F699.6050408@v.loewis.de> > Is it only usable with utf8 as an encoding? No, it applies to any codec which potentially cannot decode all bytes >127. Regards, Martin From solipsis at pitrou.net Wed May 6 22:48:15 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 20:48:15 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: Martin v. L?wis v.loewis.de> writes: > > Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? Regards Antoine. From martin at v.loewis.de Wed May 6 22:48:34 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 06 May 2009 22:48:34 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <87skjig5el.fsf@uwakimon.sk.tsukuba.ac.jp> <4A014463.4070109@v.loewis.de> <87k54ufjaa.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <4A01F7A2.5080603@v.loewis.de> > But first, it should be stopped by any of several > standard precautions. For example, applying os.path.realpath (come to > think of it, PEP 383 should say something about realpath, shouldn't > it?) Why do you think so? I think the existing documentation of realpath is correct and complete. > and os.path.normpath (PEP 383 should definitely say something > about this function Precisely what? > maybe PEP 3131 should, too) How can this be of relevance? > > Nothing is lost at the moment. > > Nothing is lost compared to 'strict', true, but under the PEP as it is > a large fraction of Shift JIS and Big5 filenames cannot be read under > ASCII-compatible file system encodings using 'utf8b'. Yet it is those > users who are placed at risk by PEP 383. I think this statement is incorrect. Those filenames *can* be read just fine. Regards, Martin From martin at v.loewis.de Wed May 6 22:56:34 2009 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Wed, 06 May 2009 22:56:34 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: <4A01F982.2030205@v.loewis.de> Antoine Pitrou wrote: > Martin v. L?wis v.loewis.de> writes: >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful > to what they actually /do/? The problem with these bike-shedding discussions is that you cannot stop them with a proposal. People will counter-propose. I would be willing to accept a ruling from someone who a) is a native speaker of English, and b) has demonstrated to fully understand what these do, and c) has understood why I insist on calling it utf8b. Regards, Martin From tjreedy at udel.edu Wed May 6 23:47:05 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 06 May 2009 17:47:05 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F61B.1000203@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: Martin v. L?wis wrote: >>> +1 for "surrogate" as the name for the error handler. >>> >>> >> +1 from me also > > Despite there being also an error handler called "surrogates". Given that additional information which MAL apparently omitted, I would revise. > Are you serious? Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message. Terry Jan Reedy From p.f.moore at gmail.com Thu May 7 00:01:23 2009 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 6 May 2009 23:01:23 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: <79990c6b0905061501u753042c4y337b92605578020e@mail.gmail.com> 2009/5/6 Antoine Pitrou : > Martin v. L?wis v.loewis.de> writes: >> >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful > to what they actually /do/? We could also stop the bikeshedding by sticking with the name utf8b. Martin's comment that it is the official name for this algorithm seems compelling to me (even if it is confusing because of its similarity with utf-8). Paul. From tjreedy at udel.edu Thu May 7 00:03:57 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 06 May 2009 18:03:57 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F5AD.4000404@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A01F5AD.4000404@v.loewis.de> Message-ID: Martin v. L?wis wrote: > Because utf8b (or, perhaps "UTF-8b") is the official name for this > algorithm: > http://hyperreal.org/~est/utf-8b/ Thank you for the link. It starts: "This directory contains a C implementation of a UTF-8b codec. A Python codec based on it is provided as well." 'RTF-8b' consists, obviously, 'UTF-8' plus 'b', with the 'b' signifying a variation of or addition to UTF-8. The 'b', and only the 'b', refers to the innovative error-handler that was added to the existing 'UTF-8' codec/algorithm. The name of the combined whole is not the name of the part. If you were incorporating the Python-wrapped utf-8b *codec* as a codec, which is what I once thought *because you used that name*, then calling it 'utf-8b' would be fine. But you apparently instead proposed and implemented an *error-handler*, which seems to me to be something else, and which will not be specific to utf-8 but usable with any codec. Hence some of us think it should have a different name. I gather that you lifted the error-handler part of the algorithm and propose to use it with *any* ascii-respecting codec. I could claim that the 'official name' of that part is 'b', but I think we can find a better name. Terry Jan Reedy From tjreedy at udel.edu Thu May 7 00:33:11 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 06 May 2009 18:33:11 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F982.2030205@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A01F982.2030205@v.loewis.de> Message-ID: Martin v. L?wis wrote: > Antoine Pitrou wrote: >> Martin v. L?wis v.loewis.de> writes: >>> Despite there being also an error handler called "surrogates". >> People, perhaps we could end all the bikeshedding and call one of those handlers >> "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful >> to what they actually /do/? > > The problem with these bike-shedding discussions is that you cannot stop > them with a proposal. People will counter-propose. > > I would be willing to accept a ruling from someone who a) is a native > speaker of English, and b) has demonstrated to fully understand what > these do, and c) has understood why I insist on calling it utf8b. I qualify with a). I believe I understand c) but, as explained in my other post, I do not think your reason applies. In fact, I think concern for naming rights might suggest that you *not* reuse the name for something different. I would have to learn more about the existing 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 'Surrogates-escape' is pretty good for the new handler since, to my understanding, it 'escapes' 'bad bytes' by prefixing them with bits that push them to the surrogates plane. I have been supportive of the idea and, as well as I understood them, the particulars of your proposal, from the beginning. Reusing the name of a codec as the name of an error-handler confused me and I believe it will confuse others, even though, but also because, the error handler was extracted and generalized from the codec. Terry Jan Reedy From martin at v.loewis.de Thu May 7 00:59:18 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 May 2009 00:59:18 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: <4A021646.8030904@v.loewis.de> >> Are you serious? > > Are you? ;-? You are the one naming a codec-agnostic error handler (if > I understand correctly, and correct me if I do not) after a particular > codec, and denying that that could cause confusion. See other message. I can only repeat what I said before: I call it utf8b because that's the established name for the algorithm it implements. That algorithm was originally designed with UTF-8 in mind (and only meant to be applied for UTF-8), however, it remains the same algorithm even though PEP 383 widens its application. Regards, Martin From google at mrabarnett.plus.com Thu May 7 01:06:24 2009 From: google at mrabarnett.plus.com (MRAB) Date: Thu, 07 May 2009 00:06:24 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: <4A0217F0.4070004@mrabarnett.plus.com> Antoine Pitrou wrote: > Martin v. L?wis v.loewis.de> writes: >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful > to what they actually /do/? > After having read about the existing error handler called "surrogates" and having thought about it, I've decided that calling one just "surrogates" isn't very helpful to the user; it has something to do with surrogates, but what? So +1 for Antoine's suggestion from me. From martin at v.loewis.de Thu May 7 01:16:18 2009 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 07 May 2009 01:16:18 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A01F982.2030205@v.loewis.de> Message-ID: <4A021A42.4060509@v.loewis.de> > I qualify with a). I believe I understand c) but, as explained in my > other post, I do not think your reason applies. In fact, I think > concern for naming rights might suggest that you *not* reuse the name > for something different. I would have to learn more about the existing > 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. > 'Surrogates-escape' is pretty good for the new handler since, to my > understanding, it 'escapes' 'bad bytes' by prefixing them with bits that > push them to the surrogates plane. See issue 3672. In essence, in python 2.5: py> u"\ud800".encode("utf-8") '\xed\xa0\x80' py> '\xed\xa0\x80'.decode("utf-8") u'\ud800' In 3.1, py> "\ud800".encode("utf-8") Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed py> "\ud800".encode("utf-8","surrogates") b'\xed\xa0\x80' py> b'\xed\xa0\x80'.decode("utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: illegal encoding py> b'\xed\xa0\x80'.decode("utf-8","surrogates") '\ud800' Regards, Martin From solipsis at pitrou.net Thu May 7 01:27:00 2009 From: solipsis at pitrou.net (Antoine Pitrou) Date: Wed, 6 May 2009 23:27:00 +0000 (UTC) Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A01F982.2030205@v.loewis.de> <4A021A42.4060509@v.loewis.de> Message-ID: Martin v. L?wis v.loewis.de> writes: > py> b'\xed\xa0\x80'.decode("utf-8","surrogates") > '\ud800' The point is, "surrogates" does not mean anything intuitive for an /error handler/. You seem to be the only one who finds this name explicit enough, perhaps because you chose it. Most other handlers' names have verbs in them ("ignore", "replace", "xmlcharrefreplace", etc.). Regards Antoine. From skippy.hammond at gmail.com Thu May 7 01:38:47 2009 From: skippy.hammond at gmail.com (Mark Hammond) Date: Thu, 07 May 2009 09:38:47 +1000 Subject: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath In-Reply-To: <4A000356.30408@trueblade.com> References: <49F8B222.7070204@hastings.org> <49F8D9A0.7000104@voidspace.org.uk> <49F8DBCD.6050504@trueblade.com> <49F9FCD0.80208@hastings.org> <49FA4064.5000508@gmail.com> <4A000356.30408@trueblade.com> Message-ID: <4A021F87.8030905@gmail.com> Eric Smith wrote: > Mark: I've reviewed this and it looks okay to me. Thanks Eric - I've now applied that patch. As you mentioned in a followup to the bug: | Thanks for looking at this, Mark. If we could only assign issues to | Python 3.2 and 3.3 to change the pending deprecation warning to a real | one, and to remove the function entirely, we'd be all set! I'm always | worried we'll forget these things. (for reference; the patch introduces a PendingDeprecationWarning for ntpath.uncpath) The bug tracker doesn't have these future versions available yet - is there some other way these things should be tracked? I fear simply opening a new bug without a reasonable 'trigger' will linger way beyond the next few versions... Thanks, Mark From murman at gmail.com Thu May 7 03:05:42 2009 From: murman at gmail.com (Michael Urman) Date: Wed, 6 May 2009 20:05:42 -0500 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F61B.1000203@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: On Wed, May 6, 2009 at 15:42, "Martin v. L?wis" wrote: > Despite there being also an error handler called "surrogates". Not that I have to be, but I'm not sold on the previous UTF-8 codec behavior becoming an error handler of the name "surrogates" for two reasons (I do respect the obvious PBP argument for the implementation, and have no better name - "lenient"?). First, unless there's a way to stack error handlers, there's no way to access the old behavior combined with the "replace" handler. Second, errors="surrogates" reads like surrogates should be an error, not an additionally allowed pattern. Neither of these are deal breakers or hard to learn, but they are non-obvious. I think the utf8b behavior makes a lot more sense with the name "surrogates", through the mnemonic that errors become surrogates. The stacking argument also applies to the new utf8b behavior on encode (only, as it handles all errors on decode). This may be a YAGNI, but for a non-UTF-8 encode, it may be useful to allow "xmlcharrefreplace" handling for unavailable non-surrogate-escaped characters. But without stacking that's unmaintainable, as we clearly don't want ${codec}b for all current codecs. I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an error handler (do we want both? YAGNI?). So what if it smells a little inaccurate as a handler when used with codecs other than UTF-8, no big deal. I could also see something like errors="roundtrip" which explains the intention of the handler rather than the algorithm, but is awkward on encode when it encounters unavailable Unicode characters. -- Michael Urman From mal at egenix.com Thu May 7 03:06:05 2009 From: mal at egenix.com (M.-A. Lemburg) Date: Thu, 07 May 2009 03:06:05 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F5AD.4000404@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A01F5AD.4000404@v.loewis.de> Message-ID: <4A0233FD.6010509@egenix.com> Martin v. L?wis wrote: >>>> The name "utf8b" suggested in the PEP is not in line with the codec >>>> design >>> Where is that design documented, and how exactly violates the name >>> the design (chapter and verse, please). >> Martin, I designed the whole Python codec machinery > > Not true. PEP 293 was written and designed by Walter D?rwald. Walter added the generic error handler callback mechanism and we both worked on their design. I designed and wrote the codec implementation back in 2000, which included the whole idea of having codec error handlers in the first place. The original implementation only allowed per-codec error handlers. Walter extended this to build general-purpose handlers that could be used by many codecs. His original motivation was to be able to do XML character reference escaping. If you don't believe me, go look this up in the repository, the mailing list archives and the trackers. >> so even if >> this is not explicitly written down somewhere, you can take my >> word for it. > > If the design was specified in writing somewhere, I would probably > challenge it as obsolete. If it isn't described anywhere, I'll have > to ignore it. Ah, lovely attitude. >> I want to avoid any such confusion with Python codecs and don't >> understand why you are making a problem out of this. > > Because utf8b (or, perhaps "UTF-8b") is the official name for this > algorithm: > > http://hyperreal.org/~est/utf-8b/ That's a codec implementing the escaping idea proposed by Markus Kuhn, not an official reference. AFAIK, the term "UTF-8B" originated from a "UTF-8 + binary" codec written for iconv: http://mail.nl.linux.org/linux-utf8/2006-04/msg00002.html If it were the official name of an escape algorithm, as you are suggesting, the inventor Markus Kuhn would probably have chosen it, but he hasn't... the only reference to it is an email where it is described as option D for ways of dealing with malformed UTF-8 data in a decoder: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Note that this escape method is not applicable for data that you decode from UTF-8 and then e.g. encode as Latin-1. It only works as general purpose method if you are decoding and encoding using the same codec, since it is specifically designed to assure round-trip safety. Martin, please stop being silly and just change the name. Or drop the idea of using an error handler altogether and just let people use the utf-8b codec you referenced above to solve their problems whereever and if needed. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2009-06-29: EuroPython 2009, Birmingham, UK 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From benjamin at python.org Thu May 7 03:14:06 2009 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 6 May 2009 20:14:06 -0500 Subject: [Python-Dev] test - please ignore Message-ID: <1afaf6160905061814t61b81148y68ccec09cfee1853@mail.gmail.com> Some of my messages appear not to have gotten through. -- Regards, Benjamin From benjamin at python.org Thu May 7 03:32:47 2009 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 6 May 2009 20:32:47 -0500 Subject: [Python-Dev] [RELEASED] Python 3.1 beta 1 Message-ID: <1afaf6160905061832xfc295e3y881c7c8e81083ee6@mail.gmail.com> On behalf of the Python development team, I'm thrilled to announce the first and only beta release of Python 3.1. Python 3.1 focuses on the stabilization and optimization of features and changes Python 3.0 introduced. For example, the new I/O system has been rewritten in C for speed. File system APIs that use unicode strings now handle paths with undecodable bytes in them. [1] Other features include an ordered dictionary implementation and support for ttk Tile in Tkinter. For a more extensive list of changes in 3.1, see http://doc.python.org/dev/py3k/whatsnew/3.1.html or Misc/NEWS in the Python distribution. Please note that this is a beta release, and as such is not suitable for production environments. We continue to strive for a high degree of quality, but there are still some known problems and the feature sets have not been finalized. This beta is being released to solicit feedback and hopefully discover bugs, as well as allowing you to determine how changes in 3.1 might impact you. If you find things broken or incorrect, please submit a bug report at http://bugs.python.org For more information and downloadable distributions, see the Python 3.1 website: http://www.python.org/download/releases/3.1/ See PEP 375 for release schedule details: http://www.python.org/dev/peps/pep-0375/ Enjoy, -- Benjamin Benjamin Peterson benjamin at python.org Release Manager (on behalf of the entire python-dev team and 3.1's contributors) From stephen at xemacs.org Thu May 7 04:35:52 2009 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 07 May 2009 11:35:52 +0900 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A01F46D.50105@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A005A7A.7070501@mrabarnett.plus.com> <874ovzh2xb.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00B339.5050305@v.loewis.de> <87tz3yg6jy.fsf@uwakimon.sk.tsukuba.ac.jp> <4A013DE1.5000401@v.loewis.de> <87my9qfu4p.fsf@uwakimon.sk.tsukuba.ac.jp> <4A01F46D.50105@v.loewis.de> Message-ID: <87iqkdfxmf.fsf@uwakimon.sk.tsukuba.ac.jp> "Martin v. L?wis" writes: > > Now, with Python's file system encoding == UTF-8 or any packed EUC, > > and more than a handful of Shift JIS or Big5 characters in file names, > > one is *almost certain* to encounter ASCII as the second byte of a > > multibyte sequence. PEP 383 can't handle this Ah, I see. Of course, the algorithm not only has to handle the ASCII octet which is erroneous because it can't be a trailing byte, but *also the leading byte that signalled to expect a trailing byte >127*. So the algorithm backs up to the character boundary (which is well-defined for all the "sane" encodings), encode the high byte(s) in the character with lone surrogates, and encode the ASCII as itself (promoted to a Unicode code point). Sorry, you're right, I was just confused. I withdraw the objection as completely mistaken, and apologize for not thinking more carefully in the first place. From tjreedy at udel.edu Thu May 7 05:48:38 2009 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 06 May 2009 23:48:38 -0400 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A021646.8030904@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> Message-ID: Martin v. L?wis wrote: >>> Are you serious? >> Are you? ;-? You are the one naming a codec-agnostic error handler (if >> I understand correctly, and correct me if I do not) after a particular >> codec, and denying that that could cause confusion. See other message. > > I can only repeat what I said before: I call it What, specifically, is 'it'? > utf8b because that's > the established name for the algorithm Which algorithm? > it implements. Again, what is 'it'? As *I* read the sentence above, it is not true. I went to the site you referred to as the source of your reasoning and specifically http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c The algorithm called utf-8b *IS* utf-8 with the addition or replacement (of an error return) of essentially one line in each direction: # encode if 0xDC00 <= codepoint <= 0xDCFF: byte = codepoint - 0xDC00 #encode Note: for security concerns, you are increasing the lower limit to 0xDC80. The comment at the top of the utf_8b.c, suggests that that is what it should be and should have been in the file, with the other half of that surrogate area an error along with the other surrogate area. #decode if (0x80 <= byte <= 0xFF) and utf-8-invalid(byte): codepoint = byte + 0xDC00 # decode > That algorithm was originally designed with UTF-8 in mind (and only > meant to be applied for UTF-8), however, it remains the same algorithm > even though PEP 383 widens its application. The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec. Terry Jan Reedy From v+python at g.nevcal.com Thu May 7 06:16:02 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 21:16:02 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A0233FD.6010509@egenix.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A01F5AD.4000404@v.loewis.de> <4A0233FD.6010509@egenix.com> Message-ID: <4A026082.2030508@g.nevcal.com> On approximately 5/6/2009 6:06 PM, came the following characters from the keyboard of M.-A. Lemburg: > Martin, please stop being silly and just change the name. Yes, please. If indeed Marc-Andre invented the codec business as he claims, he would be an appropriate person to give a fiat name to the error handler. > Or drop the idea of using an error handler altogether and just let > people use the utf-8b codec you referenced above to solve their > problems whereever and if needed. The design as an error handler is clever in leveraging the same error handler for multiple codecs, which cannot be done by using utf-8b alone, if I understand correctly. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From martin at v.loewis.de Thu May 7 07:43:30 2009 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 07 May 2009 07:43:30 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> Message-ID: <4A027502.5000901@v.loewis.de> Michael Urman wrote: > On Wed, May 6, 2009 at 15:42, "Martin v. L?wis" wrote: >> Despite there being also an error handler called "surrogates". > > Not that I have to be, but I'm not sold on the previous UTF-8 codec > behavior becoming an error handler of the name "surrogates" for two > reasons (I do respect the obvious PBP argument for the implementation, > and have no better name - "lenient"?). PBP? > First, unless there's a way to stack error handlers, there's no way to > access the old behavior combined with the "replace" handler. Well, there is a way to stack error handlers, although it's not pretty: _surrogates = codecs.lookup_errors("surrogates") _replace = codecs.lookup_errors("replace") def surrogates_then_replace(exc): try: return _surrogates(exc) except UnicodeError: return _replace(exc) codecs.register_error("surrogates_then_replace", surrogates_then_replace) > The stacking argument also applies to the new utf8b behavior on encode > (only, as it handles all errors on decode). This may be a YAGNI Indeed - in particular, as, in the primary application of this error handler (i.e. file IO operations), there is no way of specifying an addition error handler anyway. Regards, Martin From martin at v.loewis.de Thu May 7 07:53:07 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 May 2009 07:53:07 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> Message-ID: <4A027743.2050500@v.loewis.de> > The error handler designed with utf-8 in mind has no name in the encode > direction and is called "utf_8b_decoder_invalid_bytes" in the decode > direction. By your reasoning, *that* should be its name in Python. The > encoding error handler would then be named analogously > "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better > than confusing giving them the same name as the codec. So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? Regards, Martin From martin at v.loewis.de Thu May 7 08:10:16 2009 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Thu, 07 May 2009 08:10:16 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <87r5z2g3zk.fsf@uwakimon.sk.tsukuba.ac.jp> <4A0141FD.2050307@v.loewis.de> Message-ID: <4A027B48.5060208@v.loewis.de> > By the way, what are the ASCII characters that are not suppported by Shift-JIS? > Not many I suppose? (if I read the Wikipedia entry correctly, it's only the > backslash and the tilde). The problem with this encoding is that bytes below 128 appear as second bytes of a two-byte encoding: py> "\x81@".decode("shift-jis") u'\u3000' py> "\x81A".decode("shift-jis") u'\u3001' So in on decoding, it may be the second byte (i.e. the ASCII byte) that causes a problem: py> "\x81/".decode("shift-jis") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 0-1: illegal multibyte sequence For the shift-jis codec, that's actually not a problem, though: py> b"\x81/".decode("shift-jis","utf8b") '\udc81/' so the utf8b error handler will escape the first of the two bytes, and then pass the second byte to the codec again, which then decodes as ASCII. Regards, Martin From martin at v.loewis.de Thu May 7 08:16:11 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 May 2009 08:16:11 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A027904.7040602@g.nevcal.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> <4A027904.7040602@g.nevcal.com> Message-ID: <4A027CAB.5070708@v.loewis.de> >> So are you proposing that I should rename the PEP 383 handler >> to "utf_8b_encoder_invalid_codepoints"? > > > No, he's saying that your algorithm for choosing the PEP 383 handler > should have come up with that name, rather than utf8b. But since PEP > 383 applies to other codecs besides UTF-8, it should have a different > name. And one that is less cumbersome than > "utf_8b_encoder_invalid_codepoints" I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal. Regards, Martin From v+python at g.nevcal.com Thu May 7 08:00:36 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 23:00:36 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A027743.2050500@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> Message-ID: <4A027904.7040602@g.nevcal.com> On approximately 5/6/2009 10:53 PM, came the following characters from the keyboard of Martin v. L?wis: >> The error handler designed with utf-8 in mind has no name in the encode >> direction and is called "utf_8b_decoder_invalid_bytes" in the decode >> direction. By your reasoning, *that* should be its name in Python. The >> encoding error handler would then be named analogously >> "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better >> than confusing giving them the same name as the codec. > > So are you proposing that I should rename the PEP 383 handler > to "utf_8b_encoder_invalid_codepoints"? No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints" -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From martin at v.loewis.de Thu May 7 08:37:36 2009 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 07 May 2009 08:37:36 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A028090.6060405@g.nevcal.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> <4A027904.7040602@g.nevcal.com> <4A027CAB.5070708@v.loewis.de> <4A028090.6060405@g.nevcal.com> Message-ID: <4A0281B0.9070303@v.loewis.de> > Wouldn't renaming the existing "surrogates" handler be an incompatible > change, and thus inappropriate? No - it's new in Python 3.1. So what do you think about Antoine's proposal? Regards, Martin From v+python at g.nevcal.com Thu May 7 08:32:48 2009 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 06 May 2009 23:32:48 -0700 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A027CAB.5070708@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> <4A027904.7040602@g.nevcal.com> <4A027CAB.5070708@v.loewis.de> Message-ID: <4A028090.6060405@g.nevcal.com> On approximately 5/6/2009 11:16 PM, came the following characters from the keyboard of Martin v. L?wis: >>> So are you proposing that I should rename the PEP 383 handler >>> to "utf_8b_encoder_invalid_codepoints"? >> >> No, he's saying that your algorithm for choosing the PEP 383 handler >> should have come up with that name, rather than utf8b. But since PEP >> 383 applies to other codecs besides UTF-8, it should have a different >> name. And one that is less cumbersome than >> "utf_8b_encoder_invalid_codepoints" > > I'm still at a loss what name to give it, though. I understand that > I have to rename both error handlers, but I'm uncertain what I should > rename them to. So proposals that rename only one of them aren't > that helpful. It would be helpful if people would indicate support > for Antoine's proposal. Wouldn't renaming the existing "surrogates" handler be an incompatible change, and thus inappropriate? I assume that is the second handler you are referring to? "bytes-as-lone-surrogates" That would be very descriptive of the decode case for PEP 383, but very long. One problem with the word "surrogates" is that anything you add to it makes it too long. "bytes-ls" This is short, but a meaningless as is -- however, adding the understanding via documentation that "ls" means "lone surrogates" would make it meaningful, and mnemonic. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking From mal at egenix.com Thu May 7 11:21:28 2009 From: mal at egenix.com (M.-A. Lemburg) Date: Thu, 07 May 2009 11:21:28 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A01F982.2030205@v.loewis.de> <4A021A42.4060509@v.loewis.de> Message-ID: <4A02A818.4000204@egenix.com> Antoine Pitrou wrote: > Martin v. L?wis v.loewis.de> writes: >> py> b'\xed\xa0\x80'.decode("utf-8","surrogates") >> '\ud800' > > The point is, "surrogates" does not mean anything intuitive for an /error > handler/. You seem to be the only one who finds this name explicit enough, > perhaps because you chose it. > Most other handlers' names have verbs in them ("ignore", "replace", > "xmlcharrefreplace", etc.). Correct. The purpose of an error handler name is to indicate to the user what it does, hence the use of verbs. Walter started with "xmlcharrefreplace", ie. no space names, so "surrogatereplace" would be the logically correct name for the "replace with lone surrogates" scheme invented by Markus Kuhn. The error handler for undoing this operation (ie. when converting a Unicode string to some other encoding) should probably use the same name based on symmetry and the fact that the escaping scheme is meant to be used for enabling round-trip safety. BTW: It would also be appropriate to reference Markus Kuhn in the PEP as the inventor of the escaping scheme. Even if only to give the reader an idea of how that scheme works and why (the PEP on python.org currently doesn't explain this). It should also explain that the scheme is meant to assure round-trip safety and doesn't necessarily work when using transcoding, ie. reading using one encoding, writing using another. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2009-06-29: EuroPython 2009, Birmingham, UK 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From cournape at gmail.com Thu May 7 11:50:18 2009 From: cournape at gmail.com (David Cournapeau) Date: Thu, 7 May 2009 18:50:18 +0900 Subject: [Python-Dev] Help on issue 5941 In-Reply-To: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> References: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> Message-ID: <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> On Wed, May 6, 2009 at 6:01 PM, Tarek Ziad? wrote: > Hello, > > I need some help on http://bugs.python.org/issue5941 > > The bug is quite simple: the Distutils unixcompiler used to set the > archiver command to "ar -rc". > > For quite a while now, this behavior has changed in order to be able > to customize the compiler behavior from > the environment. That introduced a regression because the mechanism in > Distutils that looks for the > AR variable in the environment also looks into the Makefile of Python. > (in the Makefile then is os.environ) > > And as a matter of fact, AR is set to "ar" in there, so the -cr option > is not set anymore. > > So my question is : should I make a change into the Makefile by adding > for example a variable called AR_OPTIONS > then build the ar command with AR + AR_OPTIONS I think for consistency, it could be named ARFLAGS (this is the name usually taken for configure scripts), and both should be overridable as the other variable in distutils.sysconfig.customize_compiler. Those flags should be used in Makefile.pre as well, instead of the harcoded cr as currently used. Here is what I would try: - check for AR (already done in the configure script AFAICT) - if ARFLAGS is defined in the environment, use those, otherwise set ARFLAGS to cr - use ARFLAGS in the makefile Then, in the customize_compiler function, set archiver to $AR + $ARFLAGS. IOW, just copying the logic used for e.g. ldshared, I can prepare a patch if you want, cheers, David From ziade.tarek at gmail.com Thu May 7 12:07:01 2009 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 7 May 2009 12:07:01 +0200 Subject: [Python-Dev] Help on issue 5941 In-Reply-To: <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> References: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> Message-ID: <94bdd2610905070307g5eec595cw9f3de6c296e70acc@mail.gmail.com> On Thu, May 7, 2009 at 11:50 AM, David Cournapeau wrote: > Then, in the customize_compiler function, set archiver to $AR + > $ARFLAGS. IOW, just copying the logic used for e.g. ldshared, > > I can prepare a patch if you want, I am ok on Distutils side, but I wouldn't mind some help on the makefile/configure side Even if I could mimic what's in there, I am not confident enough yet. Please do so, by attaching your patch in the issue, Thanks Tarek -- Tarek Ziad? | http://ziade.org From ziade.tarek at gmail.com Thu May 7 13:49:36 2009 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 7 May 2009 13:49:36 +0200 Subject: [Python-Dev] Help on issue 5941 In-Reply-To: <5b8d13220905070437x18bdf332m737e6a934d40566c@mail.gmail.com> References: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> <94bdd2610905070307g5eec595cw9f3de6c296e70acc@mail.gmail.com> <5b8d13220905070437x18bdf332m737e6a934d40566c@mail.gmail.com> Message-ID: <94bdd2610905070449l5e565091ve2524f4d5e6522f1@mail.gmail.com> On Thu, May 7, 2009 at 1:37 PM, David Cournapeau wrote: > On Thu, May 7, 2009 at 7:07 PM, Tarek Ziad? wrote: >> On Thu, May 7, 2009 at 11:50 AM, David Cournapeau wrote: >>> Then, in the customize_compiler function, set archiver to $AR + >>> $ARFLAGS. IOW, just copying the logic used for e.g. ldshared, >>> >>> I can prepare a patch if you want, >> >> I am ok on Distutils side, but I wouldn't mind some help on the >> makefile/configure side > > Ok, I ended up making a patch for everything. I tested it on Linux, > where it fixed the issue while keeping the customization (both AR and > ARFLAGS can be customized through environment variables). > > numpy now builds under python 2.7, > > cheers, > > David > ok thanks David, I'll complete your patch with the test I have written for this issue and commit it so it's included in 2.7/3.1. Notice that from the beginning, the unixcompiler class options are never used if the option has been customized in distutils.sysconfig and present in the Makefile, so we need to clean this behavior as well at some point, and document the customization features. By the way, do you happen to have a buildbot or something that builds numpy ? If not it'll be very interesting: I wouldn't mind having one numpy track running on the Python trunk and receiving mails if something is broken. Regards Tarek -- Tarek Ziad? | http://ziade.org From cournape at gmail.com Thu May 7 14:11:46 2009 From: cournape at gmail.com (David Cournapeau) Date: Thu, 7 May 2009 21:11:46 +0900 Subject: [Python-Dev] Help on issue 5941 In-Reply-To: <94bdd2610905070449l5e565091ve2524f4d5e6522f1@mail.gmail.com> References: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> <94bdd2610905070307g5eec595cw9f3de6c296e70acc@mail.gmail.com> <5b8d13220905070437x18bdf332m737e6a934d40566c@mail.gmail.com> <94bdd2610905070449l5e565091ve2524f4d5e6522f1@mail.gmail.com> Message-ID: <5b8d13220905070511q1f9f5d61u136c34dabefc0ca4@mail.gmail.com> On Thu, May 7, 2009 at 8:49 PM, Tarek Ziad? wrote: > > Notice that from the beginning, the unixcompiler class options are > never used if the option has been customized > in distutils.sysconfig and present in the Makefile, so we need to > clean this behavior as well at some point, and document > the customization features. Indeed, I have never bothered much with this part, though. Flags customization with distutils is too awkward to be useful in general for something like numpy IMHO, I just use scons instead when I need fine grained control. > By the way, do you happen to have a buildbot or something that builds numpy ? We have a buildbot: http://buildbot.scipy.org/ But I don't know if that's easy to set up such as both python and numpy are built from sources. > If not it'll be very interesting: ?I wouldn't mind having one numpy > track running on the Python trunk and receiving > mails if something is broken. Well, I would not mind either :) David From ziade.tarek at gmail.com Thu May 7 14:25:01 2009 From: ziade.tarek at gmail.com (=?ISO-8859-1?Q?Tarek_Ziad=E9?=) Date: Thu, 7 May 2009 14:25:01 +0200 Subject: [Python-Dev] Help on issue 5941 In-Reply-To: <5b8d13220905070511q1f9f5d61u136c34dabefc0ca4@mail.gmail.com> References: <94bdd2610905060201s2590144dp386d33773338d923@mail.gmail.com> <5b8d13220905070250m694f62d1uf311fde0f5203e8d@mail.gmail.com> <94bdd2610905070307g5eec595cw9f3de6c296e70acc@mail.gmail.com> <5b8d13220905070437x18bdf332m737e6a934d40566c@mail.gmail.com> <94bdd2610905070449l5e565091ve2524f4d5e6522f1@mail.gmail.com> <5b8d13220905070511q1f9f5d61u136c34dabefc0ca4@mail.gmail.com> Message-ID: <94bdd2610905070525k2f8392ecm3cd3ba2225a8d461@mail.gmail.com> On Thu, May 7, 2009 at 2:11 PM, David Cournapeau wrote: > But I don't know if that's easy to set up such as both python and > numpy are built from sources. I don't know about the numpy part, but the PyBots project code could be a source of inspiration for the Python part http://code.google.com/p/pybots/source/browse/trunk/master/community.cfg From benjamin at python.org Thu May 7 01:01:25 2009 From: benjamin at python.org (Benjamin Peterson) Date: Wed, 6 May 2009 18:01:25 -0500 Subject: [Python-Dev] [RELEASED] Python 3.1 beta 1 Message-ID: <1afaf6160905061601l1fac114ei4ffd0f4f35826640@mail.gmail.com> On behalf of the Python development team, I'm thrilled to announce the first and only beta release of Python 3.1. Python 3.1 focuses on the stabilization and optimization of features and changes Python 3.0 introduced. For example, the new I/O system has been rewritten in C for speed. File system APIs that use unicode strings now handle paths with undecodable bytes in them. [1] Other features include an ordered dictionary implementation and support for ttk Tile in Tkinter. For a more extensive list of changes in 3.1, see http://doc.python.org/dev/py3k/whatsnew/3.1.html or Misc/NEWS in the Python distribution. Please note that this is a beta release, and as such is not suitable for production environments. We continue to strive for a high degree of quality, but there are still some known problems and the feature sets have not been finalized. This beta is being released to solicit feedback and hopefully discover bugs, as well as allowing you to determine how changes in 3.1 might impact you. If you find things broken or incorrect, please submit a bug report at http://bugs.python.org For more information and downloadable distributions, see the Python 3.1 website: http://www.python.org/download/releases/3.1/ See PEP 375 for release schedule details: http://www.python.org/dev/peps/pep-0375/ Enjoy, -- Benjamin Benjamin Peterson benjamin at python.org Release Manager (on behalf of the entire python-dev team and 3.1's contributors) From walter at livinglogic.de Thu May 7 15:20:07 2009 From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=) Date: Thu, 07 May 2009 15:20:07 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A02A818.4000204@egenix.com> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A01F982.2030205@v.loewis.de> <4A021A42.4060509@v.loewis.de> <4A02A818.4000204@egenix.com> Message-ID: <4A02E007.9070308@livinglogic.de> M.-A. Lemburg wrote: > Antoine Pitrou wrote: >> Martin v. L?wis v.loewis.de> writes: >>> py> b'\xed\xa0\x80'.decode("utf-8","surrogates") >>> '\ud800' >> The point is, "surrogates" does not mean anything intuitive for an /error >> handler/. You seem to be the only one who finds this name explicit enough, >> perhaps because you chose it. >> Most other handlers' names have verbs in them ("ignore", "replace", >> "xmlcharrefreplace", etc.). > > Correct. > > The purpose of an error handler name is to indicate to the user > what it does, hence the use of verbs. > > Walter started with "xmlcharrefreplace", ie. no space names, so > "surrogatereplace" would be the logically correct name for the > "replace with lone surrogates" scheme invented by Markus Kuhn. "surrogatepass" (for the "don't complain about lone half surrogates" handler) and "surrogatereplace" sound OK to me. However the other "...replace" handlers are destructive (i.e. when such a "...replace" handler is used for encoding, decoding will not produce the original unicode string). The purpose of the PEP 383 error handler however is to be roundtrip safe, so maybe we should choose a slightly different name? How about "surrogateescape"? > The error handler for undoing this operation (ie. when converting > a Unicode string to some other encoding) should probably use the > same name based on symmetry and the fact that the escaping > scheme is meant to be used for enabling round-trip safety. We have only one error handler registry, but we *can* have one error handler for both directions (encoding and decoding) as the error handler can simply check whether it got passed a UnicodeEncodeError or UnicodeDecodeError object. > BTW: It would also be appropriate to reference Markus Kuhn in the PEP > as the inventor of the escaping scheme. > > Even if only to give the reader an idea of how that scheme works and > why (the PEP on python.org currently doesn't explain this). > > It should also explain that the scheme is meant to assure round-trip > safety and doesn't necessarily work when using transcoding, ie. > reading using one encoding, writing using another. Servus, Walter From google at mrabarnett.plus.com Thu May 7 15:47:13 2009 From: google at mrabarnett.plus.com (MRAB) Date: Thu, 07 May 2009 14:47:13 +0100 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A0281B0.9070303@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <87d4anha1r.fsf@uwakimon.sk.tsukuba.ac.jp> <4A00A93D.3030204@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> <4A027904.7040602@g.nevcal.com> <4A027CAB.5070708@v.loewis.de> <4A028090.6060405@g.nevcal.com> <4A0281B0.9070303@v.loewis.de> Message-ID: <4A02E661.9040306@mrabarnett.plus.com> Martin v. L?wis wrote: >> Wouldn't renaming the existing "surrogates" handler be an incompatible >> change, and thus inappropriate? > > No - it's new in Python 3.1. > > So what do you think about Antoine's proposal? > +1 Although it looks like it would be without the '-' for consistency with existing error handlers. From murman at gmail.com Thu May 7 16:18:31 2009 From: murman at gmail.com (Michael Urman) Date: Thu, 7 May 2009 09:18:31 -0500 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A027502.5000901@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A027502.5000901@v.loewis.de> Message-ID: On Thu, May 7, 2009 at 00:43, "Martin v. L?wis" wrote: > Michael Urman wrote: >> On Wed, May 6, 2009 at 15:42, "Martin v. L?wis" wrote: >>> Despite there being also an error handler called "surrogates". >> >> Not that I have to be, but I'm not sold on the previous UTF-8 codec >> behavior becoming an error handler of the name "surrogates" for two >> reasons (I do respect the obvious PBP argument for the implementation, >> and have no better name - "lenient"?). > > PBP? Practicality beats purity. From a purity standpoint, the legacy invalid utf-8 seems more like an encoding than an error handler to me. >From a practicality standpoint, it's presumably much more convenient to implement it on top of the new valid UTF-8 codec's behavior. And then any error handler needs a name. > Well, there is a way to stack error handlers, although it's not pretty: > [...] > codecs.register_error("surrogates_then_replace", > ? ? ? ? ? ? ? ? ? ? ?surrogates_then_replace) That mitigates my arguments significantly, although I'd rather see something like errors=('surrogates', 'replace') chain the handlers without additional registrations. But that's a different PEP or arbitrary change. :) >> The stacking argument also applies to the new utf8b behavior on encode >> (only, as it handles all errors on decode). This may be a YAGNI > > Indeed - in particular, as, in the primary application of this error > handler (i.e. file IO operations), there is no way of specifying > an addition error handler anyway. Would it be useful to allow setting this somewhere? It'd be analogous to setfsencoding, perhaps a setfsencodingerrors. It's not hard to imagine an application working on Windows where all Unicode characters are valid, and constructing backup filenames by adding some arbitrary character, or receiving them from a user who doesn't understand encodings. When this application is taken to a non-Unicode filesystem, without the ability to say "I really want a valid filename: so replace", that could get messy. But it may still be a YAGNI, or a "don't do that." -- Michael Urman From murman at gmail.com Thu May 7 16:31:11 2009 From: murman at gmail.com (Michael Urman) Date: Thu, 7 May 2009 09:31:11 -0500 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: <4A027CAB.5070708@v.loewis.de> References: <49FD5300.6010906@v.loewis.de> <4A01C406.3030004@g.nevcal.com> <4A01F61B.1000203@v.loewis.de> <4A021646.8030904@v.loewis.de> <4A027743.2050500@v.loewis.de> <4A027904.7040602@g.nevcal.com> <4A027CAB.5070708@v.loewis.de> Message-ID: On Thu, May 7, 2009 at 01:16, "Martin v. L?wis" wrote: > I'm still at a loss what name to give it, though. I understand that > I have to rename both error handlers, but I'm uncertain what I should > rename them to. So proposals that rename only one of them aren't > that helpful. It would be helpful if people would indicate support > for Antoine's proposal. Part of the problem is they both allow byte sequences to decode to invalid Unicode strings, and in particular they both affect the same byte subsequences, and that brought us to the crossroads where we wanted to name both of them "surrogates". So I'll offer a few more colors, and try to get out of the way of choosing between them or the other proposed ones. :) I haven't come up with anything I like better than errors="lenient" for the old utf8 behavior handler; would errors="nonvalidating" be correct? It still seems to me that a new codec, perhaps "utf8-lenient", reads better. For the utf8b error handler, I could see any of errors="roundtrip", errors="roundtripreplace", errors="tosurrogate", errors="surrogatereplace", errors="surrogateescape", errors="binaryreplace", errors="binaryescape". This includes Antoine's proposal (sans hyphen). -- Michael Urman From walter at livinglogic.de Thu May 7 16:33:21 2009 From: walter at livinglogic.de (=?UTF-8?B?V2FsdGVyIETDtnJ3YWxk?=) Date: Thu, 07 May 2009 16:33:21 +0200 Subject: [Python-Dev] PEP 383 update: utf8b is now the error handler In-Reply-To: References: <49FD5300.6010906@v.loewis.de> <4A00D937.6080403@egenix.com> <4A013CB4.9010204@v.loewis.de> <4A015E08.5000203@egenix.com> <4A0161AD.6000605@mrabarnett.plus.com> <4A01C406.3030004@g.nevcal.com>