From hi at shiz.me Fri Aug 1 02:59:41 2014 From: hi at shiz.me (Shiz) Date: Fri, 1 Aug 2014 02:59:41 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules Message-ID: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> Hi folks, I?m working on porting CPython to the Android platform, and while making decent progress, I?m currently stuck at a higher-level issue than adding #ifdefs for __ANDROID__ to C extension modules. The idea is, not only CPython extension modules have some assumptions that don?t seem to fit Android?s mold, some default Python-written modules do as well. However, whereas CPython extensions can trivially check if we?re building for Android by checking the __ANDROID__ compiler macro, Python modules can do no such check, and are left wondering how to figure out if the platform they are currently running on is an Android one. To my knowledge there is no reliable way to detect if one is using Android as a vehicle for their journey using any other way. Now, the main question is: what would be the best way to ?expose? the indication that Android is being ran on to Python-living modules? My own thought was to add sys.getlinuxuserland(), or platform.linux_userland(), in similar vein to sys.getwindowsversion() and platform.linux_distribution(), which could return information about the userland of running CPython instance, instead of knowing merely the kernel and the distribution. This way, code could trivially check if it ran on the GNU(+associates) userland, or under a BSD-ish userland, or Android? and adjust its behaviour accordingly. I would be delighted to hear comments on this proposal, or better yet, alternative solutions. :) Kind regards, Shiz P.S.: I am well aware that Android might as well never be officially supported in CPython. In that case, consider this a thought experiment of how it /would/ be handled. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 1495 bytes Desc: Message signed with OpenPGP using GPGMail URL: From v+python at g.nevcal.com Fri Aug 1 03:54:53 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 31 Jul 2014 18:54:53 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> Message-ID: <53DAF36D.2050406@g.nevcal.com> On 7/31/2014 5:59 PM, Shiz wrote: > Hi folks, > > I?m working on porting CPython to the Android platform, and while making decent progress, I?m currently stuck at a higher-level issue than adding #ifdefs for __ANDROID__ to C extension modules. > > The idea is, not only CPython extension modules have some assumptions that don?t seem to fit Android?s mold, some default Python-written modules do as well. However, whereas CPython extensions can trivially check if we?re building for Android by checking the __ANDROID__ compiler macro, Python modules can do no such check, and are left wondering how to figure out if the platform they are currently running on is an Android one. To my knowledge there is no reliable way to detect if one is using Android as a vehicle for their journey using any other way. > > Now, the main question is: what would be the best way to ?expose? the indication that Android is being ran on to Python-living modules? My own thought was to add sys.getlinuxuserland(), or platform.linux_userland(), in similar vein to sys.getwindowsversion() and platform.linux_distribution(), which could return information about the userland of running CPython instance, instead of knowing merely the kernel and the distribution. I've no idea what you mean by "userland" in your suggestions above or below, but doesn't the Android environment qualify as a (multi-versioned) platform independently of its host OS? Seems I've read about an Android reimplementation for Windows, for example. As long as all the services expected by Android are faithfully produced, the host OS may be irrelevant to an Android application... in which case, I would think/propose/suggest the platform name should change from win32 or linux to Android (and the Android version be reflected in version parts). > This way, code could trivially check if it ran on the GNU(+associates) userland, or under a BSD-ish userland, or Android? and adjust its behaviour accordingly. > > I would be delighted to hear comments on this proposal, or better yet, alternative solutions. :) > > Kind regards, > Shiz > > P.S.: I am well aware that Android might as well never be officially supported in CPython. In that case, consider this a thought experiment of how it /would/ be handled. :) Is your P.S. suggestive that you would not be willing to support your port for use by others? Of course, until it is somewhat complete, it is hard to know how complete and compatible it can be. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Aug 1 08:46:21 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 1 Aug 2014 07:46:21 +0100 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DAF36D.2050406@g.nevcal.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <53DAF36D.2050406@g.nevcal.com> Message-ID: On 1 August 2014 02:54, Glenn Linderman wrote: > I've no idea what you mean by "userland" in your suggestions above or below, > but doesn't the Android environment qualify as a (multi-versioned) platform > independently of its host OS? Seems I've read about an Android > reimplementation for Windows, for example. As long as all the services > expected by Android are faithfully produced, the host OS may be irrelevant > to an Android application... in which case, I would think/propose/suggest > the platform name should change from win32 or linux to Android (and the > Android version be reflected in version parts). Alternatively, if having sys.platform be "linux" makes portability easier because code that does a platform check generally gets the right answer if Android reports as "linux", then why not make sys.linux_distribution report "android"? To put it briefly, either android is the platform, or android is a specific distribution of the linux platform. Paul From hi at shiz.me Fri Aug 1 14:23:17 2014 From: hi at shiz.me (Shiz) Date: Fri, 1 Aug 2014 14:23:17 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DAF36D.2050406@g.nevcal.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <53DAF36D.2050406@g.nevcal.com> Message-ID: On 01 Aug 2014, at 03:54, Glenn Linderman wrote: > I've no idea what you mean by "userland" in your suggestions above or below, but doesn't the Android environment qualify as a (multi-versioned) platform independently of its host OS? Seems I've read about an Android reimplementation for Windows, for example. As long as all the services expected by Android are faithfully produced, the host OS may be irrelevant to an Android application... in which case, I would think/propose/suggest the platform name should change from win32 or linux to Android (and the Android version be reflected in version parts). That might be a way to look at it. So far I assumed that the Android environment would be largely Linux-based, since the Android NDK (Native Development Kit, the SDK used for creating C/C++-level applications) is used for my patch which gives a GNU-ish toolchain with a Linux/Unixy environment. I know an implementation exists that claims to run Android on top of an NT kernel, but I honestly have little idea of how it works. Given how a fair amount of things ?already work? with the platform set to linux, I?m not sure if changing sys.platform would be a good idea? but that?s from my NDK perspective. > Is your P.S. suggestive that you would not be willing to support your port for use by others? Of course, until it is somewhat complete, it is hard to know how complete and compatible it can be. Oh, no, nothing like that. It?s just that I?m not sure, as goes for anything, that it would be accepted into mainline CPython. Better safe than sorry in that aspect: maybe the maintainers don?t want to support Android in the first place. :) Kind regards, Shiz -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 1495 bytes Desc: Message signed with OpenPGP using GPGMail URL: From mark at xn--hwg34fba.ws Fri Aug 1 14:32:48 2014 From: mark at xn--hwg34fba.ws (Shiz) Date: Fri, 1 Aug 2014 14:32:48 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <53DAF36D.2050406@g.nevcal.com> Message-ID: <3DB38541-BA43-43AE-B3BF-D958886944D9@xn--hwg34fba.ws> > On 1 August 2014 02:54, Glenn Linderman wrote: > > Alternatively, if having sys.platform be "linux" makes portability > easier because code that does a platform check generally gets the > right answer if Android reports as "linux", then why not make > sys.linux_distribution report "android"? > > To put it briefly, either android is the platform, or android is a > specific distribution of the linux platform. > > Paul That might maybe work better. I was assuming a userland perspective because I?ve been honestly mostly wrestling with Bionic, Android?s libc, but putting that into perspective to consider Android as a whole (after all, the SDK and NDK are what make Android for a lot of developers) might be a valid other approach as well. Kinds regards, Shiz -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 1495 bytes Desc: Message signed with OpenPGP using GPGMail URL: From status at bugs.python.org Fri Aug 1 18:08:08 2014 From: status at bugs.python.org (Python tracker) Date: Fri, 1 Aug 2014 18:08:08 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20140801160808.8B200561A1@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2014-07-25 - 2014-08-01) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 4592 ( +1) closed 29297 (+49) total 33889 (+50) Open issues with patches: 2163 Issues opened (34) ================== #11271: concurrent.futures.ProcessPoolExecutor.map() doesn't batch fun http://bugs.python.org/issue11271 reopened by pitrou #22063: asyncio: sock_xxx() methods of event loops should check ath so http://bugs.python.org/issue22063 reopened by haypo #22069: TextIOWrapper(newline="\n", line_buffering=True) mistakenly tr http://bugs.python.org/issue22069 opened by akira #22070: Use the _functools module to speed up functools.total_ordering http://bugs.python.org/issue22070 opened by ncoghlan #22071: Remove long-time deprecated attributes from smtpd http://bugs.python.org/issue22071 opened by zvyn #22077: Improve the error message for various sequences http://bugs.python.org/issue22077 opened by Claudiu.Popa #22079: Ensure in PyType_Ready() that base class of static type is sta http://bugs.python.org/issue22079 opened by serhiy.storchaka #22080: Add windows_helper module helper http://bugs.python.org/issue22080 opened by Claudiu.Popa #22083: Refactor PyShell's breakpoint related methods http://bugs.python.org/issue22083 opened by sahutd #22086: Tab indent no longer works in interpreter http://bugs.python.org/issue22086 opened by Azendale #22087: _UnixDefaultEventLoopPolicy should either create a new loop or http://bugs.python.org/issue22087 opened by dan.oreilly #22088: base64 module still ignores non-alphabet characters http://bugs.python.org/issue22088 opened by Julian #22090: Decimal and float formatting treat '%' differently for infinit http://bugs.python.org/issue22090 opened by mark.dickinson #22091: __debug__ in compile(optimize=1) http://bugs.python.org/issue22091 opened by arigo #22092: Executing some tests inside Lib/unittest/test individually thr http://bugs.python.org/issue22092 opened by vajrasky #22093: Compiling python on OS X gives warning about compact unwind http://bugs.python.org/issue22093 opened by vajrasky #22094: oss_audio_device.write(data) produces short writes http://bugs.python.org/issue22094 opened by akira #22095: Use of set_tunnel with default port results in incorrect post http://bugs.python.org/issue22095 opened by demian.brecht #22097: Linked list API for ordereddict http://bugs.python.org/issue22097 opened by pitrou #22098: Behavior of Structure inconsistent with BigEndianStructure whe http://bugs.python.org/issue22098 opened by Florian.Dold #22100: Use $HOSTPYTHON when determining candidate interpreter for $PY http://bugs.python.org/issue22100 opened by shiz #22102: Zipfile generates Zipfile error in zip with 0 total number of http://bugs.python.org/issue22102 opened by Guillaume.Carre #22103: bdist_wininst does not run install script http://bugs.python.org/issue22103 opened by mb_ #22104: test_asyncio unstable in refleak mode http://bugs.python.org/issue22104 opened by pitrou #22105: Hang during File "Save As" http://bugs.python.org/issue22105 opened by Joe #22107: tempfile module misinterprets access denied error on Windows http://bugs.python.org/issue22107 opened by rupole #22110: enable extra compilation warnings http://bugs.python.org/issue22110 opened by neologix #22112: '_UnixSelectorEventLoop' object has no attribute 'create_task' http://bugs.python.org/issue22112 opened by pydanny #22113: memoryview and struct.pack_into http://bugs.python.org/issue22113 opened by stangelandcl #22114: You cannot call communicate() safely after receiving an except http://bugs.python.org/issue22114 opened by amrith #22115: Add new methods to trace Tkinter variables http://bugs.python.org/issue22115 opened by serhiy.storchaka #22116: Weak reference support for C function objects http://bugs.python.org/issue22116 opened by pitrou #22117: Rewrite pytime.h to work on nanoseconds http://bugs.python.org/issue22117 opened by haypo #22118: urljoin fails with messy relative URLs http://bugs.python.org/issue22118 opened by Mike.Lissner Most recent 15 issues with no replies (15) ========================================== #22116: Weak reference support for C function objects http://bugs.python.org/issue22116 #22115: Add new methods to trace Tkinter variables http://bugs.python.org/issue22115 #22107: tempfile module misinterprets access denied error on Windows http://bugs.python.org/issue22107 #22105: Hang during File "Save As" http://bugs.python.org/issue22105 #22103: bdist_wininst does not run install script http://bugs.python.org/issue22103 #22102: Zipfile generates Zipfile error in zip with 0 total number of http://bugs.python.org/issue22102 #22098: Behavior of Structure inconsistent with BigEndianStructure whe http://bugs.python.org/issue22098 #22095: Use of set_tunnel with default port results in incorrect post http://bugs.python.org/issue22095 #22092: Executing some tests inside Lib/unittest/test individually thr http://bugs.python.org/issue22092 #22088: base64 module still ignores non-alphabet characters http://bugs.python.org/issue22088 #22086: Tab indent no longer works in interpreter http://bugs.python.org/issue22086 #22083: Refactor PyShell's breakpoint related methods http://bugs.python.org/issue22083 #22080: Add windows_helper module helper http://bugs.python.org/issue22080 #22077: Improve the error message for various sequences http://bugs.python.org/issue22077 #22071: Remove long-time deprecated attributes from smtpd http://bugs.python.org/issue22071 Most recent 15 issues waiting for review (15) ============================================= #22117: Rewrite pytime.h to work on nanoseconds http://bugs.python.org/issue22117 #22115: Add new methods to trace Tkinter variables http://bugs.python.org/issue22115 #22110: enable extra compilation warnings http://bugs.python.org/issue22110 #22104: test_asyncio unstable in refleak mode http://bugs.python.org/issue22104 #22100: Use $HOSTPYTHON when determining candidate interpreter for $PY http://bugs.python.org/issue22100 #22097: Linked list API for ordereddict http://bugs.python.org/issue22097 #22095: Use of set_tunnel with default port results in incorrect post http://bugs.python.org/issue22095 #22092: Executing some tests inside Lib/unittest/test individually thr http://bugs.python.org/issue22092 #22087: _UnixDefaultEventLoopPolicy should either create a new loop or http://bugs.python.org/issue22087 #22083: Refactor PyShell's breakpoint related methods http://bugs.python.org/issue22083 #22080: Add windows_helper module helper http://bugs.python.org/issue22080 #22077: Improve the error message for various sequences http://bugs.python.org/issue22077 #22071: Remove long-time deprecated attributes from smtpd http://bugs.python.org/issue22071 #22068: tkinter: avoid reference loops with Variables and Fonts http://bugs.python.org/issue22068 #22065: Update turtledemo menu creation http://bugs.python.org/issue22065 Top 10 most discussed issues (10) ================================= #21308: PEP 466: backport ssl changes http://bugs.python.org/issue21308 13 msgs #22097: Linked list API for ordereddict http://bugs.python.org/issue22097 13 msgs #22114: You cannot call communicate() safely after receiving an except http://bugs.python.org/issue22114 9 msgs #9529: Make re match object iterable http://bugs.python.org/issue9529 8 msgs #15986: memoryview: expose 'buf' attribute http://bugs.python.org/issue15986 8 msgs #20170: Derby #1: Convert 137 sites to Argument Clinic in Modules/posi http://bugs.python.org/issue20170 8 msgs #21933: Allow the user to change font sizes with the text pane of turt http://bugs.python.org/issue21933 8 msgs #22087: _UnixDefaultEventLoopPolicy should either create a new loop or http://bugs.python.org/issue22087 8 msgs #17620: Python interactive console doesn't use sys.stdin for input http://bugs.python.org/issue17620 7 msgs #18174: Make regrtest with --huntrleaks check for fd leaks http://bugs.python.org/issue18174 7 msgs Issues closed (49) ================== #11969: Can't launch multiproccessing.Process on methods http://bugs.python.org/issue11969 closed by pitrou #11990: redirected output - stdout writes newline as \n in windows http://bugs.python.org/issue11990 closed by haypo #15152: test_subprocess failures on awfully slow builtbots http://bugs.python.org/issue15152 closed by neologix #15398: intermittence on UnicodeFileTests.test_rename at test_pep277 o http://bugs.python.org/issue15398 closed by ned.deily #16005: smtplib.SMTP().sendmail() and rset() http://bugs.python.org/issue16005 closed by r.david.murray #16383: Python 3.3 Permission Error with User Library on Windows http://bugs.python.org/issue16383 closed by zach.ware #17172: Add turtledemo to IDLE menu http://bugs.python.org/issue17172 closed by terry.reedy #17371: Mismatch between Python 3.3 build environment and distutils co http://bugs.python.org/issue17371 closed by loewis #17634: Win32: shutil.copy leaks file handles to child processes http://bugs.python.org/issue17634 closed by haypo #18395: Make _Py_char2wchar() and _Py_wchar2char() public http://bugs.python.org/issue18395 closed by haypo #19612: test_subprocess: sporadic failure of test_communicate_epipe() http://bugs.python.org/issue19612 closed by haypo #19875: test_getsockaddrarg occasional failure http://bugs.python.org/issue19875 closed by neologix #19923: OSError: [Errno 512] Unknown error 512 in test_multiprocessing http://bugs.python.org/issue19923 closed by neologix #20093: Wrong OSError message from os.rename() when dst is a non-empty http://bugs.python.org/issue20093 closed by doko #20466: Example in Doc/extending/embedding.rst fails to compile cleanl http://bugs.python.org/issue20466 closed by zach.ware #21580: PhotoImage(data=...) apparently has to be UTF-8 or Base-64 enc http://bugs.python.org/issue21580 closed by serhiy.storchaka #21591: "exec(a, b, c)" not the same as "exec a in b, c" in nested fun http://bugs.python.org/issue21591 closed by djc #21704: _multiprocessing module builds incorrectly when POSIX semaphor http://bugs.python.org/issue21704 closed by Arfrever #21867: Turtle returns TypeError when undobuffer is set to 0 (aka no u http://bugs.python.org/issue21867 closed by berker.peksag #21958: Allow python 2.7 to compile with Visual Studio 2013 http://bugs.python.org/issue21958 closed by zach.ware #21990: saxutils defines an inner class where a normal one would do http://bugs.python.org/issue21990 closed by rhettinger #22003: BytesIO copy-on-write http://bugs.python.org/issue22003 closed by pitrou #22018: signal.set_wakeup_fd() should accept sockets on Windows http://bugs.python.org/issue22018 closed by haypo #22023: PyUnicode_FromFormat is broken on python 2 http://bugs.python.org/issue22023 closed by haypo #22033: Subclass friendly reprs http://bugs.python.org/issue22033 closed by serhiy.storchaka #22041: http POST request with python 3.3 through web proxy http://bugs.python.org/issue22041 closed by ned.deily #22044: Premature Py_DECREF while generating a TypeError in call_tzinf http://bugs.python.org/issue22044 closed by rhettinger #22054: Add os.get_blocking() and os.set_blocking() functions http://bugs.python.org/issue22054 closed by haypo #22058: datetime.datetime() should accept a datetime.date as init para http://bugs.python.org/issue22058 closed by rhettinger #22066: subprocess.communicate() does not receive full output from the http://bugs.python.org/issue22066 closed by ezio.melotti #22072: Fix typos in SSL's documentation http://bugs.python.org/issue22072 closed by python-dev #22073: Reference links in PEP466 are broken http://bugs.python.org/issue22073 closed by ned.deily #22074: Lib/test/make_ssl_certs.py fails with NameError http://bugs.python.org/issue22074 closed by pitrou #22075: Lambda, Enumerate and List comprehensions crash http://bugs.python.org/issue22075 closed by ned.deily #22076: csv module bad grammar in exception message http://bugs.python.org/issue22076 closed by berker.peksag #22078: io.BufferedReader hides ResourceWarnings when garbage collecte http://bugs.python.org/issue22078 closed by serhiy.storchaka #22081: Backport repr(socket.socket) from Python 3.5 to Python 2.7 http://bugs.python.org/issue22081 closed by haypo #22082: Clear interned strings listed in slotdefs http://bugs.python.org/issue22082 closed by loewis #22084: Mutating while iterating http://bugs.python.org/issue22084 closed by ncoghlan #22085: Drop support of Tk 8.3 http://bugs.python.org/issue22085 closed by serhiy.storchaka #22089: collections.MutableSet does not provide update method http://bugs.python.org/issue22089 closed by rhettinger #22096: Argument Clinic: add ability to specify an existing impl funct http://bugs.python.org/issue22096 closed by zach.ware #22099: Two "Save As" Windows http://bugs.python.org/issue22099 closed by ned.deily #22101: collections.abc.Set doesn't provide copy() method http://bugs.python.org/issue22101 closed by rhettinger #22106: Python 2 docs 'control flow/pass' section contains bad example http://bugs.python.org/issue22106 closed by rhettinger #22108: python c api wchar_t*/char* passing contradiction http://bugs.python.org/issue22108 closed by loewis #22109: Python failing in markupsafe module when running ansible http://bugs.python.org/issue22109 closed by r.david.murray #22111: Improve imaplib testsuite. http://bugs.python.org/issue22111 closed by pitrou #1508864: threading.Timer/timeouts break on change of win32 local time http://bugs.python.org/issue1508864 closed by haypo From cf.natali at gmail.com Fri Aug 1 19:49:52 2014 From: cf.natali at gmail.com (=?ISO-8859-1?Q?Charles=2DFran=E7ois_Natali?=) Date: Fri, 1 Aug 2014 18:49:52 +0100 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <53DAF36D.2050406@g.nevcal.com> Message-ID: 2014-08-01 13:23 GMT+01:00 Shiz : > >> Is your P.S. suggestive that you would not be willing to support your port for use by others? Of course, until it is somewhat complete, it is hard to know how complete and compatible it can be. > > Oh, no, nothing like that. It's just that I'm not sure, as goes for anything, that it would be accepted into mainline CPython. Better safe than sorry in that aspect: maybe the maintainers don't want to support Android in the first place. :) Well, Android is so popular that supporting it would definitely be interesting. There are a couple questions however (I'm not familiar at all with Android, I don't have a smartphone ;-): - Do you have an idea of the amount of work/patch size required? Do you have an example of a patch (even if it's a work-in-progess)? - Is there really a common Android platform? I've heard a lot about fragmentation, so would we have to support several Android flavours (like #ifdef __ANDROID_VENDOR_A__, #elif defined __ANDROID_VENDOR_B__)? From hi at shiz.me Fri Aug 1 20:09:30 2014 From: hi at shiz.me (Shiz) Date: Fri, 01 Aug 2014 20:09:30 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <53DAF36D.2050406@g.nevcal.com> Message-ID: <53DBD7DA.90903@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Charles-Fran?ois Natali wrote: > Well, Android is so popular that supporting it would definitely be > interesting. There are a couple questions however (I'm not familiar > at all with Android, I don't have a smartphone ;-): - Do you have an > idea of the amount of work/patch size required? Do you have an > example of a patch (even if it's a work-in-progess)? - Is there > really a common Android platform? I've heard a lot about > fragmentation, so would we have to support several Android flavours > (like #ifdef __ANDROID_VENDOR_A__, #elif defined > __ANDROID_VENDOR_B__)? Absolutely! I maintain a public repository of patches against CPython v3.3.3 at [1]. They are divided into three large patches: one fixes some issues I encountered with CPython's build system for cross-compilation, one fixes Android/Bionic's numerous locale issues (locale.h/langinfo.h support in Android is basically a set of stub functions that return NULL), and the last one is a set of 'misc' fixes for things that affect Android, mainly smaller things like missing fields in struct passwd and the like. With those patches, CPython 3.3.3 will cross-compile to and run on at least my own Android device, a Moto G running Android 4.4.2. What's left to fail is fix the numerous regression test failures and their causes. I documented some of my findings at [2]. :) As far as Android fragmentation goes, to my knowledge that mainly refers to fragmentation at two levels: the Android versions numerous devices run tends to differ greatly, and the screen sizes, resolutions and aspect ratios vary greatly. Obviously the latter is a problem beyond the scope of CPython, but the former could lead to some issues. Luckily however, the NDK[3], the SDK of choice to use for C/C++-level applications, is fairly unified and expected to be used for pretty much all Android devices. Essentially there should be only the NDK for CPython to target, with a variety of NDK versions to support depending on which versions of Android CPython chooses to support. So far I've been only testing against NDK r9c, so I'm honestly not all that familiar with the changes different NDK versions bring, but from what I heard up until NDK r10 were mostly toolchain updates and header additions for new Android versions. I'd dare say that the vast, vast majority of Android devices out there are running on the same base, namely AOSP[4] with numerous vendor fixes/drivers/additions, and that custom Android distributions would try not to break NDK compatibility. Kind regards, Shiz [1]: https://github.com/rave-engine/python3-android/tree/master/src [2]: https://github.com/rave-engine/python3-android/issues/3 [3]: https://developer.android.com/tools/sdk/ndk/index.html [4]: https://source.android.com/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT29fZAAoJEICfd9ZVuxW+HaUf/1tBsrRckvIAdCyTfo1NjY/U fp8HJRh+mWUQ58dAy+jvaPxxQ4pTus5bsMZKw2LvSoC4aO4guqOydPBW4DVjdvJl m2ZegA3uGdgejoN/4gHpaQRU2Jq0KzRx3+ov3EuvPF8qZbGHelu/SjAWB+3zFH29 OrQsb8iX/p2RuuwcLlFF5ESO4ycYSRC8Pug8tUV+L8H2J9YU0ZyPzo1QsipNozzH k707RFa5xb34M3HINu/2ObT36mNsSHpOK+V268/Gst9VYwYnmgE//gen2vX2uaHi gqQMrd0eHBvaBdv48E681ytl9BHCN0X79fYDjKNCAZB/928UPy46qTtrvGxZ4ycR 2dCd0YURRMqRA8WeunkEHl9KYHWLsqex00RgwTgRXgJ+3gufeY4o2o5UH/JMW2te kMFqVN6S/ulzUF9clC1nLd5NFTq14e46kSckJdQJYscXghTmNf1D7CvnmwyOaYGo Ptq0vYaKcxGu3l1moFJCjqzz0unxBFMvEzkdY/YNXmtVjtrv66whcCAKIhPmrdgu DOsGt/UL5AtfiOcZv7ae9OyWvTd6+wqgWCj+sAxPA2hWEYkUZMju6lZP9OLAkXE7 UgPcjzRDTMPzr0DIBK0KyRxOYacHBb2dBidMqC9JqvKAlUpH132/YDz1LnFjdLj4 yxX6CztDHvdkxaCv1gTcvsyc+WJceUeiT+GGPtvDl/PoCy4pN4T+3NJp2EX21GNj md4OyLVps0lRGtAGAuGi3MwNI1CdSMrmqTt6YLsl8sM4GLbsr7CawULViA9N5Bup T/b/F46l9RXDAxj7T/PfTJHfByK6TrIJwP7ZBsryy1u6WIDUUu9hAETfOSze5P2d /RBzAieVbfctm41mZwvr1szFTTUJNVizlpti1Ab6874f3Hrn9h/06+MOGPBNEl2z 02JPVOUhpZqBvXdRvzyevOqVCWyN5WuVtGRY1hzP1/B06CSkbEHGbLXNE5GLqs4J q0wkpF47KMo+Jnbw+i9j5IrFCOP6S1Oqo8p0jPl/3C5JuLakF6F+w3mumJrxeClh BqvEsj2bpe+Su3tNc6msp/ifYH7GyLrya2BRs8B1SmOKuvLe2RuITPy6HXMqDRle v1wUHgUmTaTvpqxG7z3AQwf+r+p7eKKGOQA/NkvbnGMkiXBAyL1WYqkuZ4P05UIk Wpz+zKWZEdookd81drHrUw2ohMEFvrVpSvnWe4d/VLEsgObhS3t3gQRjvf804UuJ uMsqlOeNYDv/UPVoF12cJUOVONWchB7ACXbE7rt1Z8yf0p9sPlqM33aptB8tC4NE +tDUH4XRvWgw14PE7s7xBeTrElTHcJisQBXr5flBDsw6eMF1R4Bba3M4skqxpxs= =pMFQ -----END PGP SIGNATURE----- From agriff at tin.it Fri Aug 1 23:48:37 2014 From: agriff at tin.it (Andrea Griffini) Date: Fri, 1 Aug 2014 23:48:37 +0200 Subject: [Python-Dev] sum(...) limitation Message-ID: help(sum) tells clearly that it should be used to sum numbers and not strings, and with strings actually fails. However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. Is this to be considered a bug? Andrea -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Fri Aug 1 23:51:54 2014 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Aug 2014 14:51:54 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: Message-ID: No. We just can't put all possible use cases in the docstring. :-) On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: > help(sum) tells clearly that it should be used to sum numbers and not > strings, and with strings actually fails. > > However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. > > Is this to be considered a bug? > > Andrea > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From 4kir4.1i at gmail.com Sat Aug 2 03:53:45 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Sat, 02 Aug 2014 05:53:45 +0400 Subject: [Python-Dev] Exposing the Android platform existence to Python modules References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> Message-ID: <87bns37j2u.fsf@gmail.com> Shiz writes: > Hi folks, > > I?m working on porting CPython to the Android platform, and while > making decent progress, I?m currently stuck at a higher-level issue > than adding #ifdefs for __ANDROID__ to C extension modules. > > The idea is, not only CPython extension modules have some assumptions > that don?t seem to fit Android?s mold, some default Python-written > modules do as well. However, whereas CPython extensions can trivially > check if we?re building for Android by checking the __ANDROID__ > compiler macro, Python modules can do no such check, and are left > wondering how to figure out if the platform they are currently running > on is an Android one. To my knowledge there is no reliable way to > detect if one is using Android as a vehicle for their journey using > any other way. > > Now, the main question is: what would be the best way to ?expose? the > indication that Android is being ran on to Python-living modules? My > own thought was to add sys.getlinuxuserland(), or > platform.linux_userland(), in similar vein to sys.getwindowsversion() > and platform.linux_distribution(), which could return information > about the userland of running CPython instance, instead of knowing > merely the kernel and the distribution. > > This way, code could trivially check if it ran on the GNU(+associates) > userland, or under a BSD-ish userland, or Android? and adjust its > behaviour accordingly. > > I would be delighted to hear comments on this proposal, or better yet, > alternative solutions. :) > > Kind regards, > Shiz > > P.S.: I am well aware that Android might as well never be officially > supported in CPython. In that case, consider this a thought experiment > of how it /would/ be handled. :) Python uses os.name, sys.platform, and various functions from `platform` module to provide version info: - coarse: os.name is 'posix', 'nt', 'ce', 'java' [1]. It is defined by availability of some builtin modules ('posix', 'nt' in particular) at import time. - finer: sys.platform may start with freebsd, linux, win, cygwin, darwin (`uname -s`). It is defined at python build time. - detailed: `platform` module. It provides as much info as possible e.g., platform.uname(), platform.platform(). It may use runtime commands to get it. If Android is posixy enough (would `posix` module work on Android?) then os.name could be left 'posix'. You could set sys.platform to 'android' (like sys.platform may be 'cygwin' on Windows) if Android is not like *any other* Linux distribution (from the point of view of writing a working Python code on it) i.e., if Android is further from other Linux distribution than freebsd, linux, darwin from each other then it might deserve sys.platform slot. If sys.platform is left 'linux' (like sys.platform is 'darwin' on iOS) then platform module could be used to detect Android e.g., platform.linux_distribution() though (it might be removed in Python 3.6) it is unpredictable [2] unless you fix it on your python distribution, e.g., here's an output on my machine: >>> import platform >>> platform.linux_distribution() ('Ubuntu', '14.04', 'trusty') For example: is_android = (platform.linux_distribution()[0] == 'Android') You could also define platform.android_version() that can provide Android specific version details as much as you need: is_android = bool(platform.android_version().release) You could provide an alias android_ver (like existing java_ver, libc_ver, mac_ver, win32_ver). See also, "When to use os.name, sys.platform, or platform.system?" [3] Unrelated, TIL [4]: Android is a Linux distribution according to the Linux Foundation [1] https://docs.python.org/3.4/library/os.html#os.name [2] http://bugs.python.org/issue1322 [3] http://stackoverflow.com/questions/4553129/when-to-use-os-name-sys-platform-or-platform-system [4] http://en.wikipedia.org/wiki/Android_(operating_system) btw, does it help adding os.get_shell_executable() [5] function, to avoid hacking subprocess module, so that os.confstr('CS_PATH') or os.defpath on Android could be defined to include /system/bin instead? [5] http://bugs.python.org/issue16353 -- Akira From steve at pearwood.info Sat Aug 2 05:06:34 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 2 Aug 2014 13:06:34 +1000 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <87bns37j2u.fsf@gmail.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> Message-ID: <20140802030634.GH4525@ando> On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote: > Python uses os.name, sys.platform, and various functions from `platform` > module to provide version info: [...] > If Android is posixy enough (would `posix` module work on Android?) > then os.name could be left 'posix'. Does anyone know what kivy does when running under Android? -- Steven From guido at python.org Sat Aug 2 05:34:32 2014 From: guido at python.org (Guido van Rossum) Date: Fri, 1 Aug 2014 20:34:32 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <20140802030634.GH4525@ando> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> Message-ID: Or SL4A? (https://github.com/damonkohler/sl4a) On Fri, Aug 1, 2014 at 8:06 PM, Steven D'Aprano wrote: > On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote: > > > Python uses os.name, sys.platform, and various functions from `platform` > > module to provide version info: > [...] > > If Android is posixy enough (would `posix` module work on Android?) > > then os.name could be left 'posix'. > > Does anyone know what kivy does when running under Android? > > > -- > Steven > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cyberdupo56 at gmail.com Sat Aug 2 07:57:38 2014 From: cyberdupo56 at gmail.com (Allen Li) Date: Fri, 1 Aug 2014 22:57:38 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: Message-ID: <20140802055738.GA6053@gensokyo> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote: > No. We just can't put all possible use cases in the docstring. :-) > > > On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: > > help(sum) tells clearly that it should be used to sum numbers and not > strings, and with strings actually fails. > > However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. > > Is this to be considered a bug? Can you explain the rationale behind this design decision? It seems terribly inconsistent. Why are only strings explicitly restricted from being sum()ed? sum() should either ban everything except numbers or accept everything that implements addition (duck typing). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 473 bytes Desc: not available URL: From tjreedy at udel.edu Sat Aug 2 08:35:32 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 02 Aug 2014 02:35:32 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802055738.GA6053@gensokyo> References: <20140802055738.GA6053@gensokyo> Message-ID: On 8/2/2014 1:57 AM, Allen Li wrote: > On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote: >> No. We just can't put all possible use cases in the docstring. :-) >> >> >> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: >> >> help(sum) tells clearly that it should be used to sum numbers and not >> strings, and with strings actually fails. >> >> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. >> >> Is this to be considered a bug? > > Can you explain the rationale behind this design decision? It seems > terribly inconsistent. Why are only strings explicitly restricted from > being sum()ed? sum() should either ban everything except numbers or > accept everything that implements addition (duck typing). O(n**2) behavior, ''.join(strings) alternative. -- Terry Jan Reedy From phil at riverbankcomputing.com Sat Aug 2 09:53:35 2014 From: phil at riverbankcomputing.com (Phil Thompson) Date: Sat, 02 Aug 2014 08:53:35 +0100 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> Message-ID: On 02/08/2014 4:34 am, Guido van Rossum wrote: > Or SL4A? (https://github.com/damonkohler/sl4a) > > > On Fri, Aug 1, 2014 at 8:06 PM, Steven D'Aprano > wrote: > >> On Sat, Aug 02, 2014 at 05:53:45AM +0400, Akira Li wrote: >> >> > Python uses os.name, sys.platform, and various functions from `platform` >> > module to provide version info: >> [...] >> > If Android is posixy enough (would `posix` module work on Android?) >> > then os.name could be left 'posix'. >> >> Does anyone know what kivy does when running under Android? I don't think either do anything. As the OP said, porting Python to Android is mainly about dealing with a C stdlib that is limited in places. Therefore there might be the odd missing function or attribute in the Python stdlib - just the same as can happen with other platforms. To me the issue is whether, for a particular value of sys.platform, the programmer can expect a particular Python stdlib API. If so then Android needs a different value for sys.platform. On the other hand if the programmer should not expect to make such an assumption, and should instead allow for the absence of certain functions (but which ones?), then the existing value of 'linux' should be fine. Another option I don't think I've seen suggested, given the recommended way of testing for Linux is to use sys.platform.startswith('linux'), is to use a value of 'linux-android'. Phil From steve at pearwood.info Sat Aug 2 09:39:12 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 2 Aug 2014 17:39:12 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802055738.GA6053@gensokyo> References: <20140802055738.GA6053@gensokyo> Message-ID: <20140802073912.GI4525@ando> On Fri, Aug 01, 2014 at 10:57:38PM -0700, Allen Li wrote: > On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote: > > No. We just can't put all possible use cases in the docstring. :-) > > > > > > On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: > > > > help(sum) tells clearly that it should be used to sum numbers and not > > strings, and with strings actually fails. > > > > However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. > > > > Is this to be considered a bug? > > Can you explain the rationale behind this design decision? It seems > terribly inconsistent. Why are only strings explicitly restricted from > being sum()ed? sum() should either ban everything except numbers or > accept everything that implements addition (duck typing). Repeated list and str concatenation both have quadratic O(N**2) performance, but people frequently build up strings with + and rarely do the same for lists. String concatenation with + is an attractive nuisance for many people, including some who actually know better but nevertheless do it. Also, for reasons I don't understand, many people dislike or cannot remember to use ''.join. Whatever the reason, repeated string concatenation is common whereas repeated list concatenation is much, much rarer (and repeated tuple concatenation even rarer), so sum(strings) is likely to be a land mine buried in your code while sum(lists) is not. Hence the decision that beginners in particular need to be protected from the mistake of using sum(strings) but bothering to check for sum(lists) is a waste of time. Personally, I wish that sum would raise a warning rather than an exception. As for prohibiting anything except numbers with sum(), that in my opinion would be a bad idea. sum(vectors), sum(numeric_arrays), sum(angles) etc. should all be allowed. The general sum() built-in should accept any type that allows + (unless explicitly black-listed), while specialist numeric-only sums could go into modules (like math.fsum). -- Steven From jtaylor.debian at googlemail.com Sat Aug 2 12:11:54 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Sat, 02 Aug 2014 12:11:54 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> Message-ID: <53DCB96A.8050809@googlemail.com> On 02.08.2014 08:35, Terry Reedy wrote: > On 8/2/2014 1:57 AM, Allen Li wrote: >> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote: >>> No. We just can't put all possible use cases in the docstring. :-) >>> >>> >>> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: >>> >>> help(sum) tells clearly that it should be used to sum numbers >>> and not >>> strings, and with strings actually fails. >>> >>> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. >>> >>> Is this to be considered a bug? >> >> Can you explain the rationale behind this design decision? It seems >> terribly inconsistent. Why are only strings explicitly restricted from >> being sum()ed? sum() should either ban everything except numbers or >> accept everything that implements addition (duck typing). > > O(n**2) behavior, ''.join(strings) alternative. > > hm could this be a pure python case that would profit from temporary elision [0]? lists could declare the tp_can_elide slot and call list.extend on the temporary during its tp_add slot instead of creating a new temporary. extend/realloc can avoid the copy if there is free memory available after the block. [0] https://mail.python.org/pipermail/python-dev/2014-June/134826.html From stefan_ml at behnel.de Sat Aug 2 12:56:53 2014 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 02 Aug 2014 12:56:53 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53DCB96A.8050809@googlemail.com> References: <20140802055738.GA6053@gensokyo> <53DCB96A.8050809@googlemail.com> Message-ID: Julian Taylor schrieb am 02.08.2014 um 12:11: > On 02.08.2014 08:35, Terry Reedy wrote: >> On 8/2/2014 1:57 AM, Allen Li wrote: >>> On Fri, Aug 01, 2014 at 02:51:54PM -0700, Guido van Rossum wrote: >>>> No. We just can't put all possible use cases in the docstring. :-) >>>> >>>> >>>> On Fri, Aug 1, 2014 at 2:48 PM, Andrea Griffini wrote: >>>> >>>> help(sum) tells clearly that it should be used to sum numbers >>>> and not >>>> strings, and with strings actually fails. >>>> >>>> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. >>>> >>>> Is this to be considered a bug? >>> >>> Can you explain the rationale behind this design decision? It seems >>> terribly inconsistent. Why are only strings explicitly restricted from >>> being sum()ed? sum() should either ban everything except numbers or >>> accept everything that implements addition (duck typing). >> >> O(n**2) behavior, ''.join(strings) alternative. > > lists could declare the tp_can_elide slot and call list.extend on the > temporary during its tp_add slot instead of creating a new temporary. > extend/realloc can avoid the copy if there is free memory available > after the block. Yes, i.e. only sometimes. Better not rely on it in your code. Stefan From alexander.belopolsky at gmail.com Sat Aug 2 16:52:07 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sat, 2 Aug 2014 10:52:07 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802073912.GI4525@ando> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> Message-ID: On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote: > String concatenation with + is an attractive > nuisance for many people, including some who actually know better but > nevertheless do it. Also, for reasons I don't understand, many people > dislike or cannot remember to use ''.join. > Since sum() already treats strings as a special case, why can't it simply call (an equivalent of) ''.join itself instead of telling the user to do it? It does not matter why "many people dislike or cannot remember to use ''.join" - if this is a fact - it should be considered by language implementors. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan_ml at behnel.de Sat Aug 2 17:06:10 2014 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 02 Aug 2014 17:06:10 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> Message-ID: Alexander Belopolsky schrieb am 02.08.2014 um 16:52: > On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote: > >> String concatenation with + is an attractive >> nuisance for many people, including some who actually know better but >> nevertheless do it. Also, for reasons I don't understand, many people >> dislike or cannot remember to use ''.join. > > Since sum() already treats strings as a special case, why can't it simply > call (an equivalent of) ''.join itself instead of telling the user to do > it? It does not matter why "many people dislike or cannot remember to use > ''.join" - if this is a fact - it should be considered by language > implementors. I don't think sum(strings) is beautiful enough to merit special cased support. Special cased rejection sounds like a much better way to ask people "think again - what's a sum of strings anyway?". Stefan From steve at pearwood.info Sat Aug 2 17:27:56 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 3 Aug 2014 01:27:56 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> Message-ID: <20140802152756.GJ4525@ando> On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote: > On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote: > > > String concatenation with + is an attractive > > nuisance for many people, including some who actually know better but > > nevertheless do it. Also, for reasons I don't understand, many people > > dislike or cannot remember to use ''.join. > > > > Since sum() already treats strings as a special case, why can't it simply > call (an equivalent of) ''.join itself instead of telling the user to do > it? It does not matter why "many people dislike or cannot remember to use > ''.join" - if this is a fact - it should be considered by language > implementors. It could, of course, but there is virtue in keeping sum simple, rather than special-casing who knows how many different types. If sum() tries to handle strings, should it do the same for lists? bytearrays? array.array? tuple? Where do we stop? Ultimately it comes down to personal taste. Some people are going to wish sum() tried harder to do the clever thing with more types, some people are going to wish it was simpler and didn't try to be clever at all. Another argument against excessive cleverness is that it ties sum() to one particular idiom or implementation. Today, the idiomatic and efficient way to concatenate a lot of strings is with ''.join, but tomorrow there might be a new str.concat() method. Who knows? sum() shouldn't have to care about these details, since they are secondary to sum()'s purpose, which is to add numbers. Anything else is a bonus (or perhaps a nuisance). So, I would argue that when faced with something that is not a number, there are two reasonable approaches for sum() to take: - refuse to handle the type at all; or - fall back on simple-minded repeated addition. By the way, I think this whole argument would have been easily side-stepped if + was only used for addition, and & used for concatenation. Then there would be no question about what sum() should do for lists and tuples and strings: raise TypeError. -- Steven From hi at shiz.me Sat Aug 2 14:00:04 2014 From: hi at shiz.me (Shiz) Date: Sat, 02 Aug 2014 14:00:04 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <87bns37j2u.fsf@gmail.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> Message-ID: <53DCD2C4.201@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Akira Li wrote: > Python uses os.name, sys.platform, and various functions from > `platform` module to provide version info: > > - coarse: os.name is 'posix', 'nt', 'ce', 'java' [1]. It is defined > by availability of some builtin modules ('posix', 'nt' in particular) > at import time. > > - finer: sys.platform may start with freebsd, linux, win, cygwin, > darwin (`uname -s`). It is defined at python build time. > > - detailed: `platform` module. It provides as much info as possible > e.g., platform.uname(), platform.platform(). It may use runtime > commands to get it. > > If Android is posixy enough (would `posix` module work on Android?) > then os.name could be left 'posix'. > > You could set sys.platform to 'android' (like sys.platform may be > 'cygwin' on Windows) if Android is not like *any other* Linux > distribution (from the point of view of writing a working Python code > on it) i.e., if Android is further from other Linux distribution > than freebsd, linux, darwin from each other then it might deserve > sys.platform slot. > > If sys.platform is left 'linux' (like sys.platform is 'darwin' on > iOS) then platform module could be used to detect Android e.g., > platform.linux_distribution() though (it might be removed in Python > 3.6) it is unpredictable [2] unless you fix it on your python > distribution, e.g., here's an output on my machine: > >>>> import platform platform.linux_distribution() > ('Ubuntu', '14.04', 'trusty') > > For example: > > is_android = (platform.linux_distribution()[0] == 'Android') > > You could also define platform.android_version() that can provide > Android specific version details as much as you need: > > is_android = bool(platform.android_version().release) > > You could provide an alias android_ver (like existing java_ver, > libc_ver, mac_ver, win32_ver). > > See also, "When to use os.name, sys.platform, or platform.system?" > [3] > > Unrelated, TIL [4]: > > Android is a Linux distribution according to the Linux Foundation > > [1] https://docs.python.org/3.4/library/os.html#os.name [2] > http://bugs.python.org/issue1322 [3] > http://stackoverflow.com/questions/4553129/when-to-use-os-name-sys-platform-or-platform-system > > [4] http://en.wikipedia.org/wiki/Android_(operating_system) > > > btw, does it help adding os.get_shell_executable() [5] function, to > avoid hacking subprocess module, so that os.confstr('CS_PATH') or > os.defpath on Android could be defined to include /system/bin > instead? > > [5] http://bugs.python.org/issue16353 Thanks for the detailed information! I would consider Android at least POSIX-y enough for os.name to be considered 'posix'. It doesn't implement a few POSIX-mandated things like POSIX semaphores, but aside from that I would largely consider it 'compatible enough'. I guess what is left is deciding whether to add a platform slot for Android, or to stuff the detection in platform.linux_distribution(). I feel like it would be a bit hacky for standard modules to rely on a platform.linux_distribution() return value though, it seems mostly useful for display purposes. Phil Thompson's idea of setting sys.platform to 'linux-android' also occurred to me. Under the premise that we can get users to use sys.platform.startswith('linux'), this seems like the best solution in my eyes: it both allows for existing code to continue the assumption that they are running on a Linux platform, which I believe to be correct in a lot of places, and Python modules to use a solid value to check if they need to behave differently when running on Android. On a sidenote, Kivy and SL4A/Py4A do not address this, no. From what I've seen from their patches they are mostly there to get Python compiling and running in the first place, not necessarily about fixing every compatibility issue. :) As for the os.get_shell_executable(), that seems like a good solution for the issue that occurs in the subprocess module indeed. I'd personally prefer it to manual checking within the module. Kind regards, Shiz -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3NLBAAoJEICfd9ZVuxW+SnIf/jZmnxyMTcFE5IyrmOC5v53M AklKWhVK/XeK5gYAiglV7+JwIpkEHyiiIyik5QKr/ssQ/D6EjZTmb7guxoX9QZml pWHukSmEpHTJVUDtSQ9OqgRADisZKV/8Yu1pLRIqe5zDcyZLZZg7Fg01rvqpBHvp qhVzp2jdLCrcdlVZFKk3Hk04DgbJD9SUYg7ITCqj4qr5wwphCwCYfbGGlzeUXXyG /zMWB6rI86DaOcy+b8DOK4Q6xdScnwIdFaV8A3lVEBi8b8DIl5ffGe8t6WtnUBOE XGXh2wLZvnqYr31rGc0nRP16osm1usipq6jLQ4rNebzMCW/1JbybYbcOjPcOBSJw TyAjJw5KOac8hK0hapguqWKSDIYTZqnrYPy7dn8r2oXGtXGR24W/kZdELHlQi2cg HgfWf7YkA4wuMEjJQOa+ulMj34LhfmYQrj19Gy+5Pp6FA+w3r9fcKrELxXcZGZix 66WReYJOvqx78fWXdBaij7650LdOmQblrDZD4mgxiEhoiD9gDDEOib9CosyANDTf gjKatd/GOhXpJWETU6o/b1l2Yt/+cQWQbuJsvUd1jOiTn67Sf0w6Og0ZlPyV7Fgb hLp4vYWQ0rWg/UBV9H0HzPsVJz6o4wCR1+8Jj6hdgp2EFHQ0EFVCESxqiPrsFFHz CT1Ud+ZVs+2Mt7Q47Hb46RiFmzcNl6U0HZ94OczI/NpSVc/HGeY4EzyEtS18Y3Bs Sbqgst/uXJCkidcJik/YH/wIxhbngezZl0FzsWZz7BCjw99Sx4DIQJRdmHGt0CKw 3tYqus/JD+lPBGHfblMqSfCGQk4fH+1InwwKiFuxNb4nUTTC1LSqLbTYdtUk5EoG zj8RlIGG6JQaICbJZEceCDyN1sauk0Q+V3tNauMOPtgY/z/q7X15zjOtsSM15a3j TMRGUvgISq7cMT6wWgJ1o7ehizOmrJxr9E4eRb4u5CmO+bkmU/BraUULlbxRo6QB kGR21cJ0vLYip9X5Q9RS+j2P9XZClc31jLAtajeX+FZye0Hr35I7vX8RcmQT2RKu 9UqM+ow9iNr38glkDRe+QalXAknskLGbhUB4UIYI51eaHkcwmyxES9nFbcInnVbW LWtFcBErwbWass4MI/MnWlMNSqvPFa6PwZzrFS7RkpY3QqIEDK2zSdItw26EG8SG v7tqF6gdOxCs/Hc4W1DHGeuEN0e3jHgPZ+Z7KJ2l28a1zt9R853mnOElfzdFTmMM fbXCsFLFBGq3x8Vq1DIlJnE73jSDaleOI1KPWGZjbJuWO3ENaQfU+Pdf0vRwR0ot 4xPZvpSPW3n+zvddZ4g7Ly/utbeKfQWqBvDZmJnKQkA3wE6zhhKMnhnEH8D2UZs= =x/cb -----END PGP SIGNATURE----- From python at mrabarnett.plus.com Sat Aug 2 17:50:32 2014 From: python at mrabarnett.plus.com (MRAB) Date: Sat, 02 Aug 2014 16:50:32 +0100 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802152756.GJ4525@ando> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802152756.GJ4525@ando> Message-ID: <53DD08C8.8070502@mrabarnett.plus.com> On 2014-08-02 16:27, Steven D'Aprano wrote: > On Sat, Aug 02, 2014 at 10:52:07AM -0400, Alexander Belopolsky wrote: >> On Sat, Aug 2, 2014 at 3:39 AM, Steven D'Aprano wrote: >> >> > String concatenation with + is an attractive >> > nuisance for many people, including some who actually know better but >> > nevertheless do it. Also, for reasons I don't understand, many people >> > dislike or cannot remember to use ''.join. >> > >> >> Since sum() already treats strings as a special case, why can't it simply >> call (an equivalent of) ''.join itself instead of telling the user to do >> it? It does not matter why "many people dislike or cannot remember to use >> ''.join" - if this is a fact - it should be considered by language >> implementors. > > It could, of course, but there is virtue in keeping sum simple, > rather than special-casing who knows how many different types. If sum() > tries to handle strings, should it do the same for lists? bytearrays? > array.array? tuple? Where do we stop? > We could leave any special-casing to the classes themselves: def sum(iterable, start=0): sum_func = getattr(type(start), '__sum__') if sum_func is None: result = start for item in iterable: result = result + item else: result = sum_func(start, iterable) return result > Ultimately it comes down to personal taste. Some people are going to > wish sum() tried harder to do the clever thing with more types, some > people are going to wish it was simpler and didn't try to be clever at > all. > > Another argument against excessive cleverness is that it ties sum() to > one particular idiom or implementation. Today, the idiomatic and > efficient way to concatenate a lot of strings is with ''.join, but > tomorrow there might be a new str.concat() method. Who knows? sum() > shouldn't have to care about these details, since they are secondary to > sum()'s purpose, which is to add numbers. Anything else is a > bonus (or perhaps a nuisance). > > So, I would argue that when faced with something that is not a number, > there are two reasonable approaches for sum() to take: > > - refuse to handle the type at all; or > - fall back on simple-minded repeated addition. > > > By the way, I think this whole argument would have been easily > side-stepped if + was only used for addition, and & used for > concatenation. Then there would be no question about what sum() should > do for lists and tuples and strings: raise TypeError. > From alexander.belopolsky at gmail.com Sat Aug 2 20:15:34 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sat, 2 Aug 2014 14:15:34 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> Message-ID: On Sat, Aug 2, 2014 at 11:06 AM, Stefan Behnel wrote: > I don't think sum(strings) is beautiful enough sum(strings) is more beautiful than ''.join(strings) in my view, but unfortunately it does not work even for lists because the initial value defaults to 0. sum(strings, '') and ''.join(strings) are equally ugly and non-obvious because they require an empty string. Empty containers are an advanced concept and it is unfortunate that a simple job of concatenating a list of (non-empty!) strings exposes the user to it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Sat Aug 2 20:36:29 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Aug 2014 11:36:29 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> Message-ID: On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson wrote: > To me the issue is whether, for a particular value of sys.platform, the > programmer can expect a particular Python stdlib API. If so then Android > needs a different value for sys.platform. > sys.platform is for a broad indication of the OS kernel. It can be used to distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since Android is Linux it should have the same sys.platform as other Linux systems ('linux2'). If you want to know whether a specific syscall is there, check for the presence of the method in the os module. The platform module is suitable for additional vendor-specific info about the platform, and I'd hope that there's something there that indicates Android. Again, what values does the platform module return on SL4A or Kivy, which have already ported Python to Android? In particular, I'd expect platform.linux_distribution() to return a clue that it's Android. There should also be clues in /etc/lsb-release (assuming Android supports it :-). -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hi at shiz.me Sat Aug 2 21:14:30 2014 From: hi at shiz.me (Shiz) Date: Sat, 02 Aug 2014 21:14:30 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> Message-ID: <53DD3896.4050708@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Guido van Rossum wrote: > sys.platform is for a broad indication of the OS kernel. It can be > used to distinguish Windows, Mac and Linux (and BSD, Solaris etc.). > Since Android is Linux it should have the same sys.platform as other > Linux systems ('linux2'). If you want to know whether a specific > syscall is there, check for the presence of the method in the os > module. > > The platform module is suitable for additional vendor-specific info > about the platform, and I'd hope that there's something there that > indicates Android. Again, what values does the platform module return > on SL4A or Kivy, which have already ported Python to Android? In > particular, I'd expect platform.linux_distribution() to return a > clue that it's Android. There should also be clues in > /etc/lsb-release (assuming Android supports it :-). > > -- --Guido van Rossum (python.org/~guido ) To the best of my knowledge, Kivy and Py4A/SL4A don't modify that code at all, so it just returns 'linux2'. In addition, they don't modify platform.py either, so platform.linux_distribution() returns empty values. My patchset[1] currently contains patches that both set sys.platform to 'linux-android' and modifies platform.linux_distribution() to parse and return a proper value for Android systems: >>> import sys, platform sys.platform 'linux-android' >>> platform.linux_distribution() ('Android', '4.4.2', 'Blur_Version.174.44.9.falcon_umts.EURetail.en.EU') The sys.platform thing was mainly done out of curiosity on its possibility after Phil bringing it up. My main issue with leaving Android detection to checking platform.linux_distribution() is that it feels like a bit of a wonky thing for core Python modules to rely on to change behaviour where needed on Android (as well as introducing a dependency cycle between subprocess and platform right now). I'd also like to note that I wouldn't agree with following too many of Kivy/Py4A/SL4A's design decisions on this, as they seem mostly absent. - From what I've read, their patches mostly seem geared towards getting Python to run on Android, not necessarily integrating it well or fixing all inconsistencies. This also leads to things like subprocess.Popen() indeed breaking with shell=True[2]. Kind regards, Shiz [1]: https://github.com/rave-engine/python3-android/tree/master/src [2]: http://grokbase.com/t/gg/python-for-android/1343rm7q1w/py4a-subprocess-popen-oserror-errno-8-exec-format-error -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3TiVAAoJEICfd9ZVuxW+XvAf+waDYEyURnBa7kSanThoV28b ilx6g4rMwXBZ+R3t4a0D7Q489uSQ63IJ8KWUI6AOE3v998pUOxg4LNdhBIbnr+WD bT5WUk/elfhhdEEM7pAVIg/r76lIgysVwW0uibZw9bS32TayUjigtxI9nEWUAH8D 48maBBX9CCy5G0aysx4zLqGr49MeRM7stRuS3yf55RArRdoUUibUyHhA7q7ACWbH LCiV9oECmgUCvc+uzj1dZSLJR4cYsldV9GUnvgE0mSbUGfp4QlqKa9V9WrziH26e UQ/G3nM0XmZNbdHlKfwl12x6zLq+TLADyCZV8BZEcHF6+FqmvpNyMf6Hwg3DkojX a7UmEPcbiHcnH0ncqNB6gVu92O+qMtfaWV0kfHGIwWriNPuGfJWWiwEP/Q4TICGm Yfo+nJ780opdUobvU9NTUjSlQoUBYlQQmJgCrUsOZTBWxZeIdyn4LpspUM7PyVaY vXMAq+D9fYyF6LjVSv+IBU9rZnwVIxS7XFnTKt4Q/YL2upM4q9KGe5WZH4EqPjaK 1kLX1QHWlDHkCY8BdjhGHdvQBm1YhpJCRcFJgCIMzWUnWiMl4vhHYF4mp/WseiiX DHKSiHPNd51yhMXBplksPn9gOYfaHnIKJeBccegRsmKdTfiLiyksmWSeSBuFzb8w lHCir/u8AuadinYYS5V9bb80T6LyJKVZ74qa7dOi9Y9h1Li7ytRC8ZfkLpqUTwS/ 2KOpUxNLRkyLVqCvSaGM72LIvAX7t/H0f4U9rnGRAwYJaoSjfyL6eNK8DOGo+J0a TmueQBSnrFVI85rvCdQtSiPFAj0/UhSat5XP/3AN0X8lcnxWJFnPkYCbFMhkR3pw Wuhvjv5Xm0gB95zcjNoKlBoISl9R6ZCLnR6td3NYZGbWxyEK4zdA7X6wGRXDK6ZX YC5jUm6kG+lJ6WzF9SSRCtJ9IvuhFdUPu+1LuWSBrWhRT5pGJIIYr83hTLb6t6V7 zv+pHMfqYGP3IkoXYsCq+STkKmyD4Jce9uwdzHn8IncMM8KNqpwJeMlC8Wz16EsS /jFi6wftdpCVjiXDHPGxGyuxDW+bDhLfb4giOc9Gx9Wabi3IqGjqHfDfXC1MQY+K M9T3izX7Zvbf2g2+oxh9qftX4rGMsNe1uuS9b8Ym8Eupwv4NyHUbZx2e1glHTDJ0 u5Vt/DZqWHZPaoNT100vqfWKXlZC+BeFY+MB0k5ozBlhhdMMkST3ZVw8tve2WJCO fwb7hbhCJ3X7J9hBzq1giljgPJuWbUadYnVa8RulLDSlXQw/yh/8jkPShTc47row oZfxalB/0qwC+I8kWxb2Ln5LbAxqNhkahALPLSVQ/Mx/+6/shoVNQJTGc1FXn20= =nA+h -----END PGP SIGNATURE----- From dw+python-dev at hmmz.org Sat Aug 2 22:35:13 2014 From: dw+python-dev at hmmz.org (David Wilson) Date: Sat, 2 Aug 2014 20:35:13 +0000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802073912.GI4525@ando> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> Message-ID: <20140802203513.GA10447@k2> On Sat, Aug 02, 2014 at 05:39:12PM +1000, Steven D'Aprano wrote: > Repeated list and str concatenation both have quadratic O(N**2) > performance, but people frequently build up strings with + and rarely > do the same for lists. String concatenation with + is an attractive > nuisance for many people, including some who actually know better but > nevertheless do it. Also, for reasons I don't understand, many people > dislike or cannot remember to use ''.join. join() isn't preferable in cases where it damages readability while simultaneously providing zero or negative performance benefit, such as when concatenating a few short strings, e.g. while adding a prefix to a filename. Although it's true that join() is automatically the safer option, and especially when dealing with user supplied data, the net harm caused by teaching rote and ceremony seems far less desirable compared to fixing a trivial slowdown in a script, if that slowdown ever became apparent. Another (twisted) interpretation is that since the quadratic behaviour is a CPython implementation detail, and there are alternatives where __add__ is constant time, encouraging users to code against implementation details becomes undesirable. In our twisty world, __add__ becomes *preferable* since the resulting programs more closely resemble pseudo-code. $ cat t.py a = 'this ' b = 'is a string' c = 'as we can tell' def x(): return a + b + c def y(): return ''.join([a, b, c]) $ python -m timeit -s 'import t' 't.x()' 1000000 loops, best of 3: 0.477 usec per loop $ python -m timeit -s 'import t' 't.y()' 1000000 loops, best of 3: 0.695 usec per loop David From phil at riverbankcomputing.com Sat Aug 2 22:38:37 2014 From: phil at riverbankcomputing.com (Phil Thompson) Date: Sat, 02 Aug 2014 21:38:37 +0100 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> Message-ID: <7174d653f6f56d769059264f0fa0499c@www.riverbankcomputing.com> On 02/08/2014 7:36 pm, Guido van Rossum wrote: > On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson > > wrote: > >> To me the issue is whether, for a particular value of sys.platform, >> the >> programmer can expect a particular Python stdlib API. If so then >> Android >> needs a different value for sys.platform. >> > > sys.platform is for a broad indication of the OS kernel. It can be used > to > distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since > Android > is Linux it should have the same sys.platform as other Linux systems > ('linux2'). If you want to know whether a specific syscall is there, > check > for the presence of the method in the os module. It's not just the os module - other modules contain code that would be affected, but there are plenty of other parts of the Python stdlib that aren't implemented on every platform. Using the approach you prefer then all that's needed is to update the documentation to say that certain things are not implemented on Android. Phil From guido at python.org Sat Aug 2 22:40:38 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Aug 2014 13:40:38 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <7174d653f6f56d769059264f0fa0499c@www.riverbankcomputing.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <7174d653f6f56d769059264f0fa0499c@www.riverbankcomputing.com> Message-ID: Right. On Saturday, August 2, 2014, Phil Thompson wrote: > On 02/08/2014 7:36 pm, Guido van Rossum wrote: > >> On Sat, Aug 2, 2014 at 12:53 AM, Phil Thompson < >> phil at riverbankcomputing.com> >> wrote: >> >> To me the issue is whether, for a particular value of sys.platform, the >>> programmer can expect a particular Python stdlib API. If so then Android >>> needs a different value for sys.platform. >>> >>> >> sys.platform is for a broad indication of the OS kernel. It can be used to >> distinguish Windows, Mac and Linux (and BSD, Solaris etc.). Since Android >> is Linux it should have the same sys.platform as other Linux systems >> ('linux2'). If you want to know whether a specific syscall is there, check >> for the presence of the method in the os module. >> > > It's not just the os module - other modules contain code that would be > affected, but there are plenty of other parts of the Python stdlib that > aren't implemented on every platform. Using the approach you prefer then > all that's needed is to update the documentation to say that certain things > are not implemented on Android. > > Phil > -- --Guido van Rossum (on iPad) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Sat Aug 2 22:35:01 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Aug 2014 13:35:01 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DD3896.4050708@shiz.me> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> Message-ID: On Sat, Aug 2, 2014 at 12:14 PM, Shiz wrote: > Guido van Rossum wrote: > > sys.platform is for a broad indication of the OS kernel. It can be > > used to distinguish Windows, Mac and Linux (and BSD, Solaris etc.). > > Since Android is Linux it should have the same sys.platform as other > > Linux systems ('linux2'). If you want to know whether a specific > > syscall is there, check for the presence of the method in the os > > module. > > > > The platform module is suitable for additional vendor-specific info > > about the platform, and I'd hope that there's something there that > > indicates Android. Again, what values does the platform module return > > on SL4A or Kivy, which have already ported Python to Android? In > > particular, I'd expect platform.linux_distribution() to return a > > clue that it's Android. There should also be clues in > > /etc/lsb-release (assuming Android supports it :-). > > > > -- --Guido van Rossum (python.org/~guido ) > > To the best of my knowledge, Kivy and Py4A/SL4A don't modify that code > at all, so it just returns 'linux2'. In addition, they don't modify > platform.py either, so platform.linux_distribution() returns empty values. > OK, so personally I'd leave sys.platform but improve on platform.linux_distribution(). > My patchset[1] currently contains patches that both set sys.platform to > 'linux-android' and modifies platform.linux_distribution() to parse and > return a proper value for Android systems: > > >>> import sys, platform sys.platform > 'linux-android' > >>> platform.linux_distribution() > ('Android', '4.4.2', 'Blur_Version.174.44.9.falcon_umts.EURetail.en.EU') > > The sys.platform thing was mainly done out of curiosity on its > possibility after Phil bringing it up. Can you give a few examples of where you'd need to differentiate Android from other Linux platforms in otherwise portable code, and where testing for the presence or absence of the specific function that you'd like to call isn't possible? I know I pretty much never test for the difference between OSX and other UNIX variants (including Linux) -- the only platform distinction that regularly comes up in my own code is Windows vs. the rest. And even there, often the right thing to test for is something more specific like os.sep. > My main issue with leaving > Android detection to checking platform.linux_distribution() is that it > feels like a bit of a wonky thing for core Python modules to rely on to > change behaviour where needed on Android (as well as introducing a > dependency cycle between subprocess and platform right now). > What's the specific change in stdlib behavior that you're proposing for Android? > I'd also like to note that I wouldn't agree with following too many of > Kivy/Py4A/SL4A's design decisions on this, as they seem mostly absent. > - From what I've read, their patches mostly seem geared towards getting > Python to run on Android, not necessarily integrating it well or fixing > all inconsistencies. This also leads to things like subprocess.Popen() > indeed breaking with shell=True[2]. > I'm all for fixing subprocess.Popen(), though I'm not sure what the best way is to determine this particular choice (why is it in the first place that /bin/sh doesn't work?). However, since it's a stdlib module you could easily rely on a private API to detect Android, so this doesn't really force the sys.platform issue. (Or you could propose a fix that will work for Kivi and SL4A as well, e.g. checking for some system file that is documented as unique to Android.) > > Kind regards, > Shiz > > [1]: https://github.com/rave-engine/python3-android/tree/master/src > [2]: > > http://grokbase.com/t/gg/python-for-android/1343rm7q1w/py4a-subprocess-popen-oserror-errno-8-exec-format-error > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dw+python-dev at hmmz.org Sun Aug 3 00:13:39 2014 From: dw+python-dev at hmmz.org (David Wilson) Date: Sat, 2 Aug 2014 22:13:39 +0000 Subject: [Python-Dev] [Python-checkins] cpython: Issue #22003: When initialized from a bytes object, io.BytesIO() now In-Reply-To: References: <3hNDzH5WHWz7Ljk@mail.python.org> Message-ID: <20140802221339.GA12662@k2> Thanks for spotting, There is a new patch in http://bugs.python.org/issue22125 to fix the warnings. David From hi at shiz.me Sun Aug 3 00:49:00 2014 From: hi at shiz.me (Shiz) Date: Sun, 03 Aug 2014 00:49:00 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> Message-ID: <53DD6ADC.9060600@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Guido van Rossum wrote: > Can you give a few examples of where you'd need to differentiate > Android from other Linux platforms in otherwise portable code, and > where testing for the presence or absence of the specific function > that you'd like to call isn't possible? I know I pretty much never > test for the difference between OSX and other UNIX variants > (including Linux) -- the only platform distinction that regularly > comes up in my own code is Windows vs. the rest. And even there, > often the right thing to test for is something more specific like > os.sep. > What's the specific change in stdlib behavior that you're proposing > for Android? The most obvious change would be to subprocess.Popen(). The reason a generic approach there won't work is also the reason I expect more changes might be needed: the Android file system doesn't abide by any POSIX file system standards. Its shell isn't located at /bin/sh, but at /system/bin/sh. The only directories it provides that are POSIX-standard are /dev and /etc, to my knowledge. You could check to see if /system/bin/sh exists and use that first, but that would break the preferred shell on POSIX systems that happen to have /system for some reason or another. In short: the preferred shell on POSIX systems is /bin/sh, but on Android it's /system/bin/sh. Simple existence checking might break the preferred shell on either. For more specific stdlib examples I'd have to check the test suite again. I can see the point of a sys.platform change not necessarily being needed, but it would nice for user code too to have a sort-of trivial way to figure out if it's running on Android. While core CPython might in general care far less, for user applications it's a bigger deal since they have to draw GUIs and use system services in a way that *is* usually very different on Android. Again, platform.linux_distribution() seems more for display purposes than for applications to check their core logic against. In addition, apparently platform.linux_distribution() is getting deprecated in 3.5 and removed in 3.6[1]. I agree that above issue should in fact be solved by the earlier-linked to os.get_preferred_shell() approach, however. > However, since it's a stdlib module you could easily rely on a > private API to detect Android, so this doesn't really force the > sys.platform issue. (Or you could propose a fix that will work for > Kivi and SL4A as well, e.g. checking for some system file that is > documented as unique to Android.) After checking most of the entire Android file system, I'm not sure if such a file exists. Sure, a lot of the Android file system hierarchy isn't really used anywhere else, but I'm not sure a check to see if e.g. /system exists is really enough to conclude Python is running on Android on its own. The thing that gets closest (which is the thing my platform.py patch checks for) is several Android-specific environment variables being defined (ANDROID_ROOT, ANDROID_DATA, ANDROID_PROPERTY_WORKSPACE...). Wouldn't it be better to put this in the standard Python library and expose it somehow, though? It *is* fragile code, it seems better if applications could 'just rely' on Python to figure it out, since it's not a trivial check. Kind regards, Shiz [1]: http://bugs.python.org/issue1322#msg207427 -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3WrbAAoJEICfd9ZVuxW+CSEgAMgBE12MW1H+MjScIUI19cFi yCexTCEwu1rApjGYWSUw92Ihr9LnWn4aL7tEBqGXHN5pDctw0/FlGH9d0WhpMz/b DN0w5ukqx2YyY1EDK7hp1//6eU+tXTGQu890CWgboj5OQF8LXFyN6ReG0ynAKFC7 gSyYGunqCIInRdnz9IRXWgQ91F/d1D3hZq9ZNffZzacA+PIA1rPdgziUuLdThl14 P2/o98DzLRa3iTrTeW+x8f7nfbfNFmO8BLJsrce0o50BlD75YsUKVeTlwjU9IuIC gbw5Cxo8cfBN9Eg7iLkMgxkwiEVspuLVcVmoNVL4zsuavj41jlmyZFmPvRMO7OK+ NQMq5vGPub7q4lBtlk7a8gFqDJQad7fcEgsCFTIb0nvckkEi1EeLC9kyzmVEqi3C ngiXGVfjM0qpwLKvY+pr5adsoeJSK3dVzIfEXptsvHvOhav6oxG9nCdbe3uW2ROT hM444FSqngUabceRe395TXu2XhXcpDNcl8Ye1ADfMZdiWFYRp8/xtNVKoWZ7Ge6D Gcx3/QiUtXP7jvykE9GI7QGB6JKCFuBY/RloDS7miteCutl7k0GLcp3+tRmtoypi jL3lcCtUSNOMEX4Y5CqfhMcjEVccWvy98oM4Tz7qMdYv5OwASNDAzjRFh3SbRXI+ WRVqBf5aF13hy37RbkgoweXh1qn2vBO9sUUTJFp5ymlz8WisQFr+KRnt5bcjCKAe ycVThHQaLE/j1JOSgOmbD0Xi4hcvfFvlaNEmXTL1TiWRDC0crhM9fqObHHhWlFHv +b6AO39vVSfz1nTxTIByr6Z3GHlTFaU6iUx9oixHModEg2ej9iXb1Hq8atMHv/Z1 thP/sZ7mRRBhakQPoL9i8+5+AIEiFnw5GnW7w74N/cRalF5SB2RpzDAudv2UHMWQ jPpVrDbDv9BAUeZKF/hl1xCpbI3xR1zhpLP6d7kH7p9fDAcS07W2hYIkX1LCyTvx xn0XHQKEejaAZG1HwYE/0aP1Z39SJhODZx1rFjWtgE3q1akO9hfadpRiRVhozsUT r/cXoJN3sakPbctN7B4wMXtSTrVrwqdfPCuua6mG15uTGVbkPFze/vj4yc0b+sql LFed7BAYV0ZSeIDswrt+JyT+ZFBNZRV8zsPPZM2hNBkEqoMHshlI8QloMRbcqDnT GnrxeiWmJXE/DkpyTbEXUPyCm95ggm+TUfUJ/yb/GhdL1yU9xCjVcxuFmAo5s0WH k4tra8/vU21V8OzxPmK0eGH9Sl4fUg7JsmAC/Igez+utO7lJLXwfPnUSz+Ls30ao Xd28IYMsoQ1LCltmfN/fDl3uWJi2e/kZM9v/KTkj9AncvUsDLIOV80AP+remM9E= =Z0j+ -----END PGP SIGNATURE----- From greg.ewing at canterbury.ac.nz Sun Aug 3 02:27:40 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 03 Aug 2014 12:27:40 +1200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DD6ADC.9060600@shiz.me> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> Message-ID: <53DD81FC.9030408@canterbury.ac.nz> Shiz wrote: > I'm not sure a check to see if e.g. > /system exists is really enough to conclude Python is running on Android > on its own. Since MacOSX has /System and typically a case-insensitive file system, it certainly wouldn't. :-) -- Greg From guido at python.org Sun Aug 3 06:41:42 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 2 Aug 2014 21:41:42 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DD6ADC.9060600@shiz.me> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> Message-ID: Well, it really does look like checking for the presence of those ANDROID_* environment variables it the best way to recognize the Android platform. Anyone can do that without waiting for a ruling on whether Android is Linux or not (which would be necessary because the docs for sys.platform are quite clear about its value on Linux systems). Googling terms like "is Android Linux" suggests that there is considerable controversy about the issue, so I suggest you don't wait. :-) On Sat, Aug 2, 2014 at 3:49 PM, Shiz wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > Guido van Rossum wrote: > > Can you give a few examples of where you'd need to differentiate > > Android from other Linux platforms in otherwise portable code, and > > where testing for the presence or absence of the specific function > > that you'd like to call isn't possible? I know I pretty much never > > test for the difference between OSX and other UNIX variants > > (including Linux) -- the only platform distinction that regularly > > comes up in my own code is Windows vs. the rest. And even there, > > often the right thing to test for is something more specific like > > os.sep. > > > What's the specific change in stdlib behavior that you're proposing > > for Android? > > The most obvious change would be to subprocess.Popen(). The reason a > generic approach there won't work is also the reason I expect more > changes might be needed: the Android file system doesn't abide by any > POSIX file system standards. Its shell isn't located at /bin/sh, but at > /system/bin/sh. The only directories it provides that are POSIX-standard > are /dev and /etc, to my knowledge. You could check to see if > /system/bin/sh exists and use that first, but that would break the > preferred shell on POSIX systems that happen to have /system for some > reason or another. In short: the preferred shell on POSIX systems is > /bin/sh, but on Android it's /system/bin/sh. Simple existence checking > might break the preferred shell on either. For more specific stdlib > examples I'd have to check the test suite again. > > I can see the point of a sys.platform change not necessarily being > needed, but it would nice for user code too to have a sort-of trivial > way to figure out if it's running on Android. While core CPython might > in general care far less, for user applications it's a bigger deal since > they have to draw GUIs and use system services in a way that *is* > usually very different on Android. Again, platform.linux_distribution() > seems more for display purposes than for applications to check their > core logic against. > In addition, apparently platform.linux_distribution() is getting > deprecated in 3.5 and removed in 3.6[1]. > > I agree that above issue should in fact be solved by the earlier-linked > to os.get_preferred_shell() approach, however. > > > However, since it's a stdlib module you could easily rely on a > > private API to detect Android, so this doesn't really force the > > sys.platform issue. (Or you could propose a fix that will work for > > Kivi and SL4A as well, e.g. checking for some system file that is > > documented as unique to Android.) > > After checking most of the entire Android file system, I'm not sure if > such a file exists. Sure, a lot of the Android file system hierarchy > isn't really used anywhere else, but I'm not sure a check to see if e.g. > /system exists is really enough to conclude Python is running on Android > on its own. The thing that gets closest (which is the thing my > platform.py patch checks for) is several Android-specific environment > variables being defined (ANDROID_ROOT, ANDROID_DATA, > ANDROID_PROPERTY_WORKSPACE...). Wouldn't it be better to put this in the > standard Python library and expose it somehow, though? It *is* fragile > code, it seems better if applications could 'just rely' on Python to > figure it out, since it's not a trivial check. > > Kind regards, > Shiz > > [1]: http://bugs.python.org/issue1322#msg207427 > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.22 (Darwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQQcBAEBCgAGBQJT3WrbAAoJEICfd9ZVuxW+CSEgAMgBE12MW1H+MjScIUI19cFi > yCexTCEwu1rApjGYWSUw92Ihr9LnWn4aL7tEBqGXHN5pDctw0/FlGH9d0WhpMz/b > DN0w5ukqx2YyY1EDK7hp1//6eU+tXTGQu890CWgboj5OQF8LXFyN6ReG0ynAKFC7 > gSyYGunqCIInRdnz9IRXWgQ91F/d1D3hZq9ZNffZzacA+PIA1rPdgziUuLdThl14 > P2/o98DzLRa3iTrTeW+x8f7nfbfNFmO8BLJsrce0o50BlD75YsUKVeTlwjU9IuIC > gbw5Cxo8cfBN9Eg7iLkMgxkwiEVspuLVcVmoNVL4zsuavj41jlmyZFmPvRMO7OK+ > NQMq5vGPub7q4lBtlk7a8gFqDJQad7fcEgsCFTIb0nvckkEi1EeLC9kyzmVEqi3C > ngiXGVfjM0qpwLKvY+pr5adsoeJSK3dVzIfEXptsvHvOhav6oxG9nCdbe3uW2ROT > hM444FSqngUabceRe395TXu2XhXcpDNcl8Ye1ADfMZdiWFYRp8/xtNVKoWZ7Ge6D > Gcx3/QiUtXP7jvykE9GI7QGB6JKCFuBY/RloDS7miteCutl7k0GLcp3+tRmtoypi > jL3lcCtUSNOMEX4Y5CqfhMcjEVccWvy98oM4Tz7qMdYv5OwASNDAzjRFh3SbRXI+ > WRVqBf5aF13hy37RbkgoweXh1qn2vBO9sUUTJFp5ymlz8WisQFr+KRnt5bcjCKAe > ycVThHQaLE/j1JOSgOmbD0Xi4hcvfFvlaNEmXTL1TiWRDC0crhM9fqObHHhWlFHv > +b6AO39vVSfz1nTxTIByr6Z3GHlTFaU6iUx9oixHModEg2ej9iXb1Hq8atMHv/Z1 > thP/sZ7mRRBhakQPoL9i8+5+AIEiFnw5GnW7w74N/cRalF5SB2RpzDAudv2UHMWQ > jPpVrDbDv9BAUeZKF/hl1xCpbI3xR1zhpLP6d7kH7p9fDAcS07W2hYIkX1LCyTvx > xn0XHQKEejaAZG1HwYE/0aP1Z39SJhODZx1rFjWtgE3q1akO9hfadpRiRVhozsUT > r/cXoJN3sakPbctN7B4wMXtSTrVrwqdfPCuua6mG15uTGVbkPFze/vj4yc0b+sql > LFed7BAYV0ZSeIDswrt+JyT+ZFBNZRV8zsPPZM2hNBkEqoMHshlI8QloMRbcqDnT > GnrxeiWmJXE/DkpyTbEXUPyCm95ggm+TUfUJ/yb/GhdL1yU9xCjVcxuFmAo5s0WH > k4tra8/vU21V8OzxPmK0eGH9Sl4fUg7JsmAC/Igez+utO7lJLXwfPnUSz+Ls30ao > Xd28IYMsoQ1LCltmfN/fDl3uWJi2e/kZM9v/KTkj9AncvUsDLIOV80AP+remM9E= > =Z0j+ > -----END PGP SIGNATURE----- > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hi at shiz.me Sun Aug 3 07:18:01 2014 From: hi at shiz.me (Shiz) Date: Sun, 03 Aug 2014 07:18:01 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> Message-ID: <53DDC609.2080409@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Guido van Rossum wrote: > Well, it really does look like checking for the presence of those > ANDROID_* environment variables it the best way to recognize the > Android platform. Anyone can do that without waiting for a ruling on > whether Android is Linux or not (which would be necessary because the > docs for sys.platform are quite clear about its value on Linux > systems). Googling terms like "is Android Linux" suggests that there > is considerable controversy about the issue, so I suggest you don't > wait. :-) Right, which brings us back to the original point I was trying to make: any chance we could move logic like that into a sys.getandroidversion() or platform.android_version() so user code (and standard library code alike) doesn't have to perform those relatively nasty checks themselves? It seems like a fair thing to do if CPython would support Android as an official target. Kind regards, Shiz -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3cYHAAoJEICfd9ZVuxW+hSogAKg8FUz/SuH6d0a4QvDctpMO pm58gBqVYvd1y/uIiLpQgpGb1dPrNziV1IYOBJaDcU1i/03JlgGdr3HOq29KvHdQ xgaQQbsyl63Tzhs4oA2iow7eoRO5rkZ338hxpWrUQqRek73AYXJt2r5w9dRklUh/ Z1R+80otVRAj69uJub8yAys08QqljKG80cnfQwUcFJVDWZRmr/z/WRGoC7QkRYVK EfIa7EVlm/3mArmueF6vxgF5qHevXIHvVSf18JJ918gxldKLJ4ht1v8L/4h4QBrC zfNqWyg8lXh6evMMH4lM755rycCTrtyzkoxmocLkUsEHrB65eOWWSBYdQgRMpuOH SZs+9K+P1jPwsJlcHl8j4sXoG6NtL6BBim70nlEnvdWQ6qHMivBNcyA1gEwI7Upn hG4t7AM4c3fdbkOg4V1F7EVrS9QqIxxWFIMAfYUGstZnfbBUDDGKIkE68ZbT+scq RTLbh78WsVA/YB/NLnxKvCTCuJb2uwg7R/VC1bMlsTUqTSfmckHl/XSRrgk+ggve A45sOKyoWzpfZEaAL9/e2TsPul5bRatVFX2JqEuzO42OTNZRr7GRxvRgF4tmnmG2 baSfrEhm3rcIFxT2IqLy+28g7ffGKcbbq7oo7LPvrh+zIupamygCnvMs6aSPE3zi Vi31EiFrZ8pn3YF+yfO7D9hjtqE41IIc86dKPUyKYfG+wO1oPXNwzBEZfoRSoJaY 9EKd1fqOm9iYHHzr+mkEko/bl+SxNFHHJ/y/uEU6ZIhBjbylDJ9AKCAm5q9gotuT 5i3PuyOOrTuYO0ei0su5Ya9UO5vD3+gUNKTHe9IdUL/e+5qYt5tjwtfPC9UTldSy xLv8Ca0uC7mOHLPi8ASghoO2tbjy69TNYmzljqIGUufBOKshFnNWA7DDmQdYrdTN t+EXsUAUmqm1RT29Zhrt1LCsoByyXh5jBapyIleU8TTrmotpX3dlI7rooZSegUiy 8lD05oIjX+JRbfXXsNg384e6Stc6UktrhIK00w3ILVP9IqnqAO+dao/uE+5lLvxU BcL9/PjmTY+1U8ZJCb9uZXNG8jWP2lsQEKaSFURkoUjTzfRpAoa6tVpCZOOvqZC2 F52ZSwmUBtP7vydRJ7BZjOeRxDzMD8qd0ED3fciDRbnVdXHIG+8MFL5MY1CDm9i7 r7bngcsqSUURq/Zj4BYnM8lOX1PXC9+U4gVNEkiwf+9CjfeIyMd4QpuMyXPxeiUa QDU8MX5VdA1oBvJ2nbXV8QwriIfODbyhD/00QhLHw5ifKjxB8ZZdF4jNT+Ay9jnR nEWuIpat3ch2Sg/ECtBvcA8hHYE9TfFZGdrdZVvib7fHsS+AUFXuhjAnkEyOVB4= =m+JD -----END PGP SIGNATURE----- From 4kir4.1i at gmail.com Sun Aug 3 12:45:30 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Sun, 03 Aug 2014 14:45:30 +0400 Subject: [Python-Dev] Exposing the Android platform existence to Python modules References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> Message-ID: <87k36petrp.fsf@gmail.com> Shiz writes: > The most obvious change would be to subprocess.Popen(). The reason a > generic approach there won't work is also the reason I expect more > changes might be needed: the Android file system doesn't abide by any > POSIX file system standards. Its shell isn't located at /bin/sh, but at > /system/bin/sh. The only directories it provides that are POSIX-standard > are /dev and /etc, to my knowledge. You could check to see if > /system/bin/sh exists and use that first, but that would break the > preferred shell on POSIX systems that happen to have /system for some > reason or another. In short: the preferred shell on POSIX systems is > /bin/sh, but on Android it's /system/bin/sh. Simple existence checking > might break the preferred shell on either. For more specific stdlib > examples I'd have to check the test suite again. FYI, /bin/sh is not POSIX, see http://bugs.python.org/issue16353#msg224514 -- Akira From 4kir4.1i at gmail.com Sun Aug 3 13:31:06 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Sun, 03 Aug 2014 15:31:06 +0400 Subject: [Python-Dev] Exposing the Android platform existence to Python modules References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> Message-ID: <87fvhdernp.fsf@gmail.com> Guido van Rossum writes: > Well, it really does look like checking for the presence of those ANDROID_* > environment variables it the best way to recognize the Android platform. > Anyone can do that without waiting for a ruling on whether Android is Linux > or not (which would be necessary because the docs for sys.platform are > quite clear about its value on Linux systems). Googling terms like "is > Android Linux" suggests that there is considerable controversy about the > issue, so I suggest you don't wait. :-) I don't see sysconfig mentioned in the discussion (maybe for a reason). It might provide build-time information e.g., built_for_android = 'android' in sysconfig.get_config_var('MULTIARCH') assuming the complete value is something like 'arm-linux-android'. It says that the python binary is built for android (the current platform may or may not be Android). -- Akira From guido at python.org Sun Aug 3 17:58:11 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 3 Aug 2014 08:58:11 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <53DDC609.2080409@shiz.me> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <53DDC609.2080409@shiz.me> Message-ID: But *are* we going to support Android officially? What's the point? Do you have a plan for getting Python apps to first-class status in the App Store (um, Google Play)? Regardless, I recommend that you add a new method to the platform module (careful people can test for the presence of the new method before calling it) and leave poor sys.platform alone. On Sat, Aug 2, 2014 at 10:18 PM, Shiz wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > Guido van Rossum wrote: > > Well, it really does look like checking for the presence of those > > ANDROID_* environment variables it the best way to recognize the > > Android platform. Anyone can do that without waiting for a ruling on > > whether Android is Linux or not (which would be necessary because the > > docs for sys.platform are quite clear about its value on Linux > > systems). Googling terms like "is Android Linux" suggests that there > > is considerable controversy about the issue, so I suggest you don't > > wait. :-) > > Right, which brings us back to the original point I was trying to make: > any chance we could move logic like that into a sys.getandroidversion() > or platform.android_version() so user code (and standard library code > alike) doesn't have to perform those relatively nasty checks themselves? > It seems like a fair thing to do if CPython would support Android as an > official target. > > Kind regards, > Shiz > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.22 (Darwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQQcBAEBCgAGBQJT3cYHAAoJEICfd9ZVuxW+hSogAKg8FUz/SuH6d0a4QvDctpMO > pm58gBqVYvd1y/uIiLpQgpGb1dPrNziV1IYOBJaDcU1i/03JlgGdr3HOq29KvHdQ > xgaQQbsyl63Tzhs4oA2iow7eoRO5rkZ338hxpWrUQqRek73AYXJt2r5w9dRklUh/ > Z1R+80otVRAj69uJub8yAys08QqljKG80cnfQwUcFJVDWZRmr/z/WRGoC7QkRYVK > EfIa7EVlm/3mArmueF6vxgF5qHevXIHvVSf18JJ918gxldKLJ4ht1v8L/4h4QBrC > zfNqWyg8lXh6evMMH4lM755rycCTrtyzkoxmocLkUsEHrB65eOWWSBYdQgRMpuOH > SZs+9K+P1jPwsJlcHl8j4sXoG6NtL6BBim70nlEnvdWQ6qHMivBNcyA1gEwI7Upn > hG4t7AM4c3fdbkOg4V1F7EVrS9QqIxxWFIMAfYUGstZnfbBUDDGKIkE68ZbT+scq > RTLbh78WsVA/YB/NLnxKvCTCuJb2uwg7R/VC1bMlsTUqTSfmckHl/XSRrgk+ggve > A45sOKyoWzpfZEaAL9/e2TsPul5bRatVFX2JqEuzO42OTNZRr7GRxvRgF4tmnmG2 > baSfrEhm3rcIFxT2IqLy+28g7ffGKcbbq7oo7LPvrh+zIupamygCnvMs6aSPE3zi > Vi31EiFrZ8pn3YF+yfO7D9hjtqE41IIc86dKPUyKYfG+wO1oPXNwzBEZfoRSoJaY > 9EKd1fqOm9iYHHzr+mkEko/bl+SxNFHHJ/y/uEU6ZIhBjbylDJ9AKCAm5q9gotuT > 5i3PuyOOrTuYO0ei0su5Ya9UO5vD3+gUNKTHe9IdUL/e+5qYt5tjwtfPC9UTldSy > xLv8Ca0uC7mOHLPi8ASghoO2tbjy69TNYmzljqIGUufBOKshFnNWA7DDmQdYrdTN > t+EXsUAUmqm1RT29Zhrt1LCsoByyXh5jBapyIleU8TTrmotpX3dlI7rooZSegUiy > 8lD05oIjX+JRbfXXsNg384e6Stc6UktrhIK00w3ILVP9IqnqAO+dao/uE+5lLvxU > BcL9/PjmTY+1U8ZJCb9uZXNG8jWP2lsQEKaSFURkoUjTzfRpAoa6tVpCZOOvqZC2 > F52ZSwmUBtP7vydRJ7BZjOeRxDzMD8qd0ED3fciDRbnVdXHIG+8MFL5MY1CDm9i7 > r7bngcsqSUURq/Zj4BYnM8lOX1PXC9+U4gVNEkiwf+9CjfeIyMd4QpuMyXPxeiUa > QDU8MX5VdA1oBvJ2nbXV8QwriIfODbyhD/00QhLHw5ifKjxB8ZZdF4jNT+Ay9jnR > nEWuIpat3ch2Sg/ECtBvcA8hHYE9TfFZGdrdZVvib7fHsS+AUFXuhjAnkEyOVB4= > =m+JD > -----END PGP SIGNATURE----- > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hi at shiz.me Sun Aug 3 18:00:28 2014 From: hi at shiz.me (Shiz) Date: Sun, 03 Aug 2014 18:00:28 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <87k36petrp.fsf@gmail.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <87k36petrp.fsf@gmail.com> Message-ID: <53DE5C9C.2070004@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Akira Li wrote: > FYI, /bin/sh is not POSIX, see > http://bugs.python.org/issue16353#msg224514 Ah right, my apologies. Android doesn't seem to have getconf(1) either, but sh /is/ on $PATH. Anyway, even if it weren't, os.defpath could be tweaked on Android. > I don't see sysconfig mentioned in the discussion (maybe for a > reason). It might provide build-time information e.g., > > built_for_android = 'android' in > sysconfig.get_config_var('MULTIARCH') > > assuming the complete value is something like 'arm-linux-android'. > It says that the python binary is built for android (the current > platform may or may not be Android). MULTIARCH is empty in my sysconfig (http://txt.shiz.me/MjBmOTQ4). You could possibly match HOST_GNU_TYPE against 'androideabi', even though it still seems a bit fragile. Please ignore MACHDEP/PLATDIR, those are set as a result of me fiddling with sys.platform. Kind regards, Shiz -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3lyaAAoJEICfd9ZVuxW+8D4gAL8Bi7gvHQlaDOPDWjILeFEy DN7t4RGuVBGv2MYhMJPgHQy27bUAQLXtEWNYFGl8X9mK3BdDlrDfmcVxnN4hJPJD iqWHlECN/yY4+fkzLbq85OHKyWKo410SHssyCd2X4WeGJL6PVcnq50/bYhsDd8gf 31ICjfDCs8ZFFSIxdb0KzLE7dXOIBnsB4QLFqi6KUwiJYls6JyOhfwYt1UU5/7he J0urNh/cJomHCrod26A/C9sBGB6LcjIv5xIYosQ0C7dpYfRyfF+JRuzHcA2Wm6NY gt1jAeRHSb+YihbuTwtsH6gPopXWdSY1IWtBX+Q98Je95weO12dI1M1BRCRk7yW1 AyRyclnBjFAKkYAzuCEIQxFBpmKYkO4W23CBjUvK21AmpV86sK1A6OWRPPADAthw jQpXsfv2WFIEpZsVFZ2YQ1hTcdCnUdAoCaJbFkH8hhFXRF3A9asO1N0ff0sEFNe2 kRSJD8wzgMjy98c3xAmwfXCRCbM7kIkM87R2Mw+cIWC1/xV3erlOmCtKkVbAUJSK 3r+w33meFmYGQGrh3TaCScBIN6aoYTIQKJOGwWKYp8fa313qW8BQjSw7WzOLCjNM 8zDFuBfQJvVCs/eiMcFTMHaHBaXj9rNNw7pyQBwMXUQUWntIP/Pio6sZJEjJ4tM7 SIHtqfGT+kHN8PIPj/1PM9VuXge4Z97d2TXplwAfmNc2D/mvnfu0bswBVFuRzNi7 9a4P8jIrw5ZqMF/tKE7ykk79Xz3hwCEm+W8rzzqiz5P8Wn39YTXd5dX/EXaoOzmh vWAF7opI7vHeA0f8rfkgkvBBhy6g74ku1Ie2k9KzrcMlKUMYEH60QWzcj1POT6oz ovfciahZtAN11j0+M3GIKWnuLXTieoPOaYa4EuPe+ZFqTg1pAAX0z2saKNAXR7Fo hITS1WeN3kenXmkAzWTkX+2tb+TtbKOhd5MURWMbUIaTFjzYvgFAtpyTHZExWO6a zpWYcWBGFn4c16rgK8VD/UYIqA7xIwmhtkz0UBXtIM2za8AidR8nZph0cgpH0EnC IwzjAy5WT2WxUGBHcRwA+FhPQYxZPcai/QAAs2VPLm3srmGvYwmLibqS0FUL+tME xl5p/2NkwDuP3Q8TBvIMXmxTk5X0w7c8/cSGMYcujkK6dsodhrdmw2IwNf/pQ8tW FUuY0znvfKfklqN6BfeN7tLF9rDlaHlOE0iRM24Qx1mFvKA9RXrhXok1BBO0WYK4 jklFC4WdYhqEw47GsXDkO8UmgvV3knoSX77g4Xq043hnbd8brydkEgd8OfqjX/JQ 4gKyYyW7yZAxrChxBg2DDyMmjMU89yLLpiI55yAdN+lT3N7aCj6DJWADNz20TvQ= =TMlH -----END PGP SIGNATURE----- From hi at shiz.me Sun Aug 3 18:04:50 2014 From: hi at shiz.me (Shiz) Date: Sun, 03 Aug 2014 18:04:50 +0200 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <53DDC609.2080409@shiz.me> Message-ID: <53DE5DA2.6000604@shiz.me> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Guido van Rossum wrote: > But *are* we going to support Android officially? What's the point? > Do you have a plan for getting Python apps to first-class status in > the App Store (um, Google Play)? > > Regardless, I recommend that you add a new method to the platform > module (careful people can test for the presence of the new method > before calling it) and leave poor sys.platform alone. Well, that is the idea, at least empowering people to write proper Android apps in Python. The first step of that would be making CPython run on Android, the second step would be adding libraries that allow Python users to interface with the Android API. As I said, even if the CPython maintainers are not willing to support Android in the end, I'd at least like my patchset to be done according to CPython development guidelines/principles as close as possible. Adding android_version() to the platform module it is, then. hasattr(platform, 'android_version') is probably an easy enough check for Python users. Kind regards, Shiz -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQQcBAEBCgAGBQJT3l2hAAoJEICfd9ZVuxW+01AgALmVjV30qJ7HEOJEslb7sET4 9wPd1n3BLp/cmGxxLynMmQ6PDv5pHwjJT/QqwpW6xbYCnJ/5M5/8jMOXjMgedJ0C +wSx4/Detv0Cp5fXIFy3G8fS6yIm30mRrVzAq6gJ0I+NqkzR8S/DeIsonz/vJmEI aWxVra4jd7zPmAV/mUr0nwJ1xyEHWJnW+CPsc1FRI+YKgMbJnegFo+7GrthkuJen jmbrAQbt6FG0tKBGgmfL6r6r+c+5KSQYH+VLESpcIJGZbNr7IVpGhRB247njVLug AsqXFvabGU0/RHmJnLa0a09AH2NZYkhMqv9Ncaamf2DijVMN9Wez4UFpPTUkxNlJ qfK0S6vafDW0FKNNe07xn1fQi3Lrax1pZlX7emGp9TplsctPTNVEfnQfcgJpXlrj rATtH3MVxD7nf8WjYwGXcFpECxNdy1+096neqp9jkWkXpQZhr1qtUiXe6Ez4DUxe 4H1ZWOmuu+HJHVMN3wDjr5VUMcdzjUMA2DgHIkstfNqkEb3U4//5UFro5/plmGkc qvcU5SQMPKCff+LB2dvgpmYNVZDuj8AHb1t805KesW+eEmcIEgQ4zn+pcBr/+B13 8yal38Ms8jl9flqVcLFlqpAQyIcLdjtdMsyG3DUvZ26ChRF2NSITMYL6hn12VbEK UiXK8uus2YE0SRfJ1mJthg8vk+DOqGxsaT4XdvfoFykQo2W1JYKlS69Fnxsnl8MN 6KPNOmscQAfLfGKBiIjDqlcNFnKcAUJE5paywrNQkb3Kq+5NZqK3swNieOvaSI9b e23cP0GmCJUW0vlRJzjhpSpAMR4hy6TF8wEJavAb4s7IgQi928mVEVEFhVLDIXNi kZFG2BXQZ9aAc1pwBxFaeuGHvIAiJ+lYB6E6bg2LIEtDBeBSf8JMk1+FtIOaLD6L W/J0c2PHpSX11mPlmQBmQwbhSJ9s3lQE4bYVh7MQf/dPWLYFCO+8fXlRykZK10cj advO51WyrYldecSiAufyPShP0ouU6Qw7wrDByBxD7BmsCqwmblx8yNJz8pwvD33q hYVl3LWtf1KJIpC44Kcob06z2bi/r91nbIQgFm01LiIrwuPoa9ydrwl3ET+qrjle +FYXR0NEk3RmjNoo+MaXXfxrz4lUcFv+olGf6A/dVvsd8XpxibTaxBcpppTl6o9M oLYFw1c2f5psmrVByXpiEWuvjnSvcSnjEV7qlGI2dSGks1aq/R6otNenwr/BgamO OkvA9DmBKvfJ5MdATjYtVLSBj5om1yqQnGm1snqkDDnWIQx60i5LXeNvJE6XyD1s MM0d4WVZqNwZsrOa5/Yd2rHi1L+14aFctk40IocVUFOUrMVX52oKGyKovylRqNk= =N1mB -----END PGP SIGNATURE----- From phil at riverbankcomputing.com Sun Aug 3 19:16:53 2014 From: phil at riverbankcomputing.com (Phil Thompson) Date: Sun, 03 Aug 2014 18:16:53 +0100 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <53DDC609.2080409@shiz.me> Message-ID: <3fb678cfd0717fd634dd7448109bd932@www.riverbankcomputing.com> On 03/08/2014 4:58 pm, Guido van Rossum wrote: > But *are* we going to support Android officially? What's the point? Do > you > have a plan for getting Python apps to first-class status in the App > Store > (um, Google Play)? I do... http://pyqt.sourceforge.net/Docs/pyqtdeploy/introduction.html Phil From guido at python.org Sun Aug 3 20:17:03 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 3 Aug 2014 11:17:03 -0700 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <3fb678cfd0717fd634dd7448109bd932@www.riverbankcomputing.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <53DDC609.2080409@shiz.me> <3fb678cfd0717fd634dd7448109bd932@www.riverbankcomputing.com> Message-ID: On Sun, Aug 3, 2014 at 10:16 AM, Phil Thompson wrote: > On 03/08/2014 4:58 pm, Guido van Rossum wrote: > >> But *are* we going to support Android officially? What's the point? Do you >> have a plan for getting Python apps to first-class status in the App Store >> (um, Google Play)? >> > > I do... > > http://pyqt.sourceforge.net/Docs/pyqtdeploy/introduction.html > > Phil > Oooh, that's pretty cool! -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Aug 4 02:01:14 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 4 Aug 2014 10:01:14 +1000 Subject: [Python-Dev] Exposing the Android platform existence to Python modules In-Reply-To: <3fb678cfd0717fd634dd7448109bd932@www.riverbankcomputing.com> References: <8ECB8F3D-E512-43F7-913F-3E5EC2154D27@shiz.me> <87bns37j2u.fsf@gmail.com> <20140802030634.GH4525@ando> <53DD3896.4050708@shiz.me> <53DD6ADC.9060600@shiz.me> <53DDC609.2080409@shiz.me> <3fb678cfd0717fd634dd7448109bd932@www.riverbankcomputing.com> Message-ID: On 4 Aug 2014 03:18, "Phil Thompson" wrote: > > On 03/08/2014 4:58 pm, Guido van Rossum wrote: >> >> But *are* we going to support Android officially? What's the point? Do you >> have a plan for getting Python apps to first-class status in the App Store >> (um, Google Play)? > > > I do... > > http://pyqt.sourceforge.net/Docs/pyqtdeploy/introduction.html Nice! I've only been skimming this thread, but +1 for Android mostly reading as Linux, but with an extra method in the platform module that gives more details. For those interested in mobile app development, Russell Keith-Magee also announced the release of "toga" [1] here at PyCon AU. That's a Python specific GUI library that maps directly to native widgets (rather than using theming as Kivy does). I mention it as one of the things Russell is specifically looking for is more participation from folks that know the Android side of things :) [1] http://pybee.org/toga/ Cheers, Nick. > > Phil > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From larry at hastings.org Mon Aug 4 09:12:47 2014 From: larry at hastings.org (Larry Hastings) Date: Mon, 04 Aug 2014 17:12:47 +1000 Subject: [Python-Dev] Surely "nullable" is a reasonable name? Message-ID: <53DF326F.9030908@hastings.org> Argument Clinic "converters" specify how to convert an individual argument to the function you're defining. Although a converter could theoretically represent any sort of conversion, most of the time they directly represent types like "int" or "double" or "str". Because there's such variety in argument parsing, the converters are customizable with parameters. Many of these are common enough that Argument Clinic suggests some standard names. Examples: "zeroes=True" for strings and buffers means "permit internal \0 characters", and "bitwise=True" for unsigned integers means "copy the bits over, even if there's overflow/underflow, and even if the original is negative". A third example is "nullable=True", which means "also accept None for this parameter". This was originally intended for use with strings (compare the "s" and "z" format units for PyArg_ParseTuple), however it looks like we'll have a use for "nullable ints" in the ongoing Argument Clinic conversion work. Several people have said they found the name "nullable" surprising, suggesting I use another name like "allow_none" or "noneable". I, in turn, find their surprise surprising; "nullable" is a term long associated with exactly this concept. It's used in C# and SQL, and the term even has its own Wikipedia page: http://en.wikipedia.org/wiki/Nullable_type Most amusingly, Vala *used* to have an annotation called "(allow-none)", but they've broken it out into two annotations, "(nullable)" and "(optional)". http://blogs.gnome.org/desrt/2014/05/27/allow-none-is-dead-long-live-nullable/ Before you say "the term 'nullable' will confuse end users", let me remind you: this is not user-facing. This is a parameter for an Argument Clinic converter, and will only ever be seen by CPython core developers. A group which I hope is not so easily confused. It's my contention that "nullable" is the correct name. But I've been asked to bring up the topic for discussion, to see if a consensus forms around this or around some other name. Let the bike-shedding begin, //arry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From me+python at ixokai.io Mon Aug 4 09:35:39 2014 From: me+python at ixokai.io (Stephen Hansen) Date: Mon, 4 Aug 2014 00:35:39 -0700 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: On Mon, Aug 4, 2014 at 12:12 AM, Larry Hastings wrote: > > Several people have said they found the name "nullable" surprising, > suggesting I use another name like "allow_none" or "noneable". I, in turn, > find their surprise surprising; "nullable" is a term long associated with > exactly this concept. It's used in C# and SQL, and the term even has its > own Wikipedia page: > The thing is, "null" in these languages are not the same thing. If you look to the various database wrappers there's a lot of controversy about just how to map the SQL NULL to Python: simply mapping it to Python's None becomes strange because the semantics of a SQL NULL or NULL pointer and Python None don't exactly match. Not all that long ago someone was making an argument on this list to add a SQLNULL type object to better map SQL NULL semantics (regards to sorting, as I recall -- but its been awhile) Python has None. Its definition and understanding in a Python context is clear. Why introduce some other concept? In Python its very common you pass None instead of an other argument. > Before you say "the term 'nullable' will confuse end users", let me remind > you: this is not user-facing. This is a parameter for an Argument Clinic > converter, and will only ever be seen by CPython core developers. A group > which I hope is not so easily confused > Yet, my lurking observation of argument clinic is it is all about clearly defining the C-side of how things are done in Python API's. It may not confuse 'end users', but it may confuse possible contributors, and simply add a lack of clarity to the situation. Passing None in place of another argument is a very Pythonic thing to do; why confuse that by using other words which imply other semantics? None is a Python thing with clear semantics in Python; allow_none quite accurately describes the Pythonic thing described here, while 'nullable' expects for domain knowledge beyond Python and makes assumptions of semantics. /re-lurk --S -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Mon Aug 4 09:46:25 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Mon, 04 Aug 2014 00:46:25 -0700 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: References: <53DF326F.9030908@hastings.org> Message-ID: <53DF3A51.90506@g.nevcal.com> On 8/4/2014 12:35 AM, Stephen Hansen wrote: > On Mon, Aug 4, 2014 at 12:12 AM, Larry Hastings > wrote: > > > Several people have said they found the name "nullable" > surprising, suggesting I use another name like "allow_none" or > "noneable". I, in turn, find their surprise surprising; > "nullable" is a term long associated with exactly this concept. > It's used in C# and SQL, and the term even has its own Wikipedia page: > > > The thing is, "null" in these languages are not the same thing. If you > look to the various database wrappers there's a lot of controversy > about just how to map the SQL NULL to Python: simply mapping it to > Python's None becomes strange because the semantics of a SQL NULL or > NULL pointer and Python None don't exactly match. Not all that long > ago someone was making an argument on this list to add a SQLNULL type > object to better map SQL NULL semantics (regards to sorting, as I > recall -- but its been awhile) > > Python has None. Its definition and understanding in a Python context > is clear. Why introduce some other concept? In Python its very common > you pass None instead of an other argument. > > Before you say "the term 'nullable' will confuse end users", let > me remind you: this is not user-facing. This is a parameter for > an Argument Clinic converter, and will only ever be seen by > CPython core developers. A group which I hope is not so easily > confused > > > Yet, my lurking observation of argument clinic is it is all about > clearly defining the C-side of how things are done in Python API's. It > may not confuse 'end users', but it may confuse possible contributors, > and simply add a lack of clarity to the situation. > > Passing None in place of another argument is a very Pythonic thing to > do; why confuse that by using other words which imply other semantics? > None is a Python thing with clear semantics in Python; allow_none > quite accurately describes the Pythonic thing described here, while > 'nullable' expects for domain knowledge beyond Python and makes > assumptions of semantics. > > /re-lurk > > --S Thanks, Stephen. +1 to all you wrote. There remains, of course, one potential justification for using "nullable", that you didn't make 100% clear. Because "argument clinic is it is all about clearly defining the C-side of how things are done in Python API's." and that is that C uses NULL (but it is only a convention, not a language feature) for missing reference parameters on occasion. But I think it is much more clear that if C NULL gets mapped to Python None, and we are talking about Python parameters, then a NULLable C parameter should map to an "allow_none" Python parameter. The concepts of C NULL, C# NULL, SQL NULL, and Python None are all slightly different, even the brilliant people on python-dev could better spend their energies on new features and bug fixes rather than being slowed by the need to remember yet another unclear and inconsistent terminology issue, of which there are already too many. Glenn -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Mon Aug 4 09:39:36 2014 From: phd at phdru.name (Oleg Broytman) Date: Mon, 4 Aug 2014 09:39:36 +0200 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: <20140804073936.GA9332@phdru.name> Hi! On Mon, Aug 04, 2014 at 05:12:47PM +1000, Larry Hastings wrote: > "nullable=True", which means "also accept None > for this parameter". This was originally intended for use with > strings (compare the "s" and "z" format units for PyArg_ParseTuple), > however it looks like we'll have a use for "nullable ints" in the > ongoing Argument Clinic conversion work. > > Several people have said they found the name "nullable" surprising, > suggesting I use another name like "allow_none" or "noneable". I, > in turn, find their surprise surprising; "nullable" is a term long > associated with exactly this concept. It's used in C# and SQL, and > the term even has its own Wikipedia page: > > http://en.wikipedia.org/wiki/Nullable_type In my very humble opinion, "nullable" is ok, but "allow_none" is better. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From ncoghlan at gmail.com Mon Aug 4 14:22:17 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 4 Aug 2014 22:22:17 +1000 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <20140804073936.GA9332@phdru.name> References: <53DF326F.9030908@hastings.org> <20140804073936.GA9332@phdru.name> Message-ID: On 4 Aug 2014 18:16, "Oleg Broytman" wrote: > > Hi! > > On Mon, Aug 04, 2014 at 05:12:47PM +1000, Larry Hastings < larry at hastings.org> wrote: > > "nullable=True", which means "also accept None > > for this parameter". This was originally intended for use with > > strings (compare the "s" and "z" format units for PyArg_ParseTuple), > > however it looks like we'll have a use for "nullable ints" in the > > ongoing Argument Clinic conversion work. > > > > Several people have said they found the name "nullable" surprising, > > suggesting I use another name like "allow_none" or "noneable". I, > > in turn, find their surprise surprising; "nullable" is a term long > > associated with exactly this concept. It's used in C# and SQL, and > > the term even has its own Wikipedia page: > > > > http://en.wikipedia.org/wiki/Nullable_type > > In my very humble opinion, "nullable" is ok, but "allow_none" is > better. Yup, this is where I stand as well. The main concern I have with nullable is that we *are* writing C code when dealing with Argument Clinic, and "nullable" may make me think of a C NULL rather than Python's None. Cheers, Nick. > > Oleg. > -- > Oleg Broytman http://phdru.name/ phd at phdru.name > Programmers don't die, they just GOSUB without RETURN. > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Aug 4 15:06:36 2014 From: antoine at python.org (Antoine Pitrou) Date: Mon, 04 Aug 2014 09:06:36 -0400 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: References: <53DF326F.9030908@hastings.org> Message-ID: Le 04/08/2014 03:35, Stephen Hansen a ?crit : > > Before you say "the term 'nullable' will confuse end users", let me > remind you: this is not user-facing. This is a parameter for an > Argument Clinic converter, and will only ever be seen by CPython > core developers. A group which I hope is not so easily confused > > > Yet, my lurking observation of argument clinic is it is all about > clearly defining the C-side of how things are done in Python API's. It > may not confuse 'end users', but it may confuse possible contributors, > and simply add a lack of clarity to the situation. That's a rather good point, and I agree with Stephen here. Even core contributors can deserve clarity and the occasional non-confusing notation :-) Regards Antoine. From njs at pobox.com Mon Aug 4 12:19:38 2014 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 4 Aug 2014 11:19:38 +0100 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: I admit I spent the first half of the email scratching my head and trying to figure out what NULL had to do with argument clinic specs. (Maybe it would mean that if the argument is "not given" in some appropriate way then we set the corresponding C variable to NULL?) Finding out you were talking about None came as a surprising twist. -n On 4 Aug 2014 08:13, "Larry Hastings" wrote: > > > Argument Clinic "converters" specify how to convert an individual argument > to the function you're defining. Although a converter could theoretically > represent any sort of conversion, most of the time they directly represent > types like "int" or "double" or "str". > > Because there's such variety in argument parsing, the converters are > customizable with parameters. Many of these are common enough that > Argument Clinic suggests some standard names. Examples: "zeroes=True" for > strings and buffers means "permit internal \0 characters", and > "bitwise=True" for unsigned integers means "copy the bits over, even if > there's overflow/underflow, and even if the original is negative". > > A third example is "nullable=True", which means "also accept None for this > parameter". This was originally intended for use with strings (compare the > "s" and "z" format units for PyArg_ParseTuple), however it looks like we'll > have a use for "nullable ints" in the ongoing Argument Clinic conversion > work. > > Several people have said they found the name "nullable" surprising, > suggesting I use another name like "allow_none" or "noneable". I, in turn, > find their surprise surprising; "nullable" is a term long associated with > exactly this concept. It's used in C# and SQL, and the term even has its > own Wikipedia page: > > http://en.wikipedia.org/wiki/Nullable_type > > Most amusingly, Vala *used* to have an annotation called "(allow-none)", > but they've broken it out into two annotations, "(nullable)" and > "(optional)". > > > http://blogs.gnome.org/desrt/2014/05/27/allow-none-is-dead-long-live-nullable/ > > > Before you say "the term 'nullable' will confuse end users", let me remind > you: this is not user-facing. This is a parameter for an Argument Clinic > converter, and will only ever be seen by CPython core developers. A group > which I hope is not so easily confused. > > It's my contention that "nullable" is the correct name. But I've been > asked to bring up the topic for discussion, to see if a consensus forms > around this or around some other name. > > Let the bike-shedding begin, > > > */arry* > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/njs%40pobox.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Aug 4 18:25:12 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 4 Aug 2014 09:25:12 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140802203513.GA10447@k2> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> Message-ID: On Sat, Aug 2, 2014 at 1:35 PM, David Wilson wrote: > > Repeated list and str concatenation both have quadratic O(N**2) > > performance, but people frequently build up strings with + > > join() isn't preferable in cases where it damages readability while > simultaneously providing zero or negative performance benefit, such as > when concatenating a few short strings, e.g. while adding a prefix to a > filename. > Good point -- I was trying to make the point about .join() vs + for strings in an intro python class last year, and made the mistake of having the students test the performance. You need to concatenate a LOT of strings to see any difference at all -- I know that O() of algorithms is unavoidable, but between efficient python optimizations and a an apparently good memory allocator, it's really a practical non-issue. > Although it's true that join() is automatically the safer option, and > especially when dealing with user supplied data, the net harm caused by > teaching rote and ceremony seems far less desirable compared to fixing a > trivial slowdown in a script, if that slowdown ever became apparent. > and it rarely would. Blocking sum( some_strings) because it _might_ have poor performance seems awfully pedantic. As a long-time numpy user, I think sum(a_long_list_of_numbers) has pathetically bad performance, but I wouldn't block it! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From larry at hastings.org Mon Aug 4 18:56:36 2014 From: larry at hastings.org (Larry Hastings) Date: Tue, 05 Aug 2014 02:56:36 +1000 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF3A51.90506@g.nevcal.com> References: <53DF326F.9030908@hastings.org> <53DF3A51.90506@g.nevcal.com> Message-ID: <53DFBB44.7070501@hastings.org> On 08/04/2014 05:46 PM, Glenn Linderman wrote: > There remains, of course, one potential justification for using > "nullable", that you didn't make 100% clear. Because "argument clinic > is it is all about clearly defining the C-side of how things are done > in Python API's." and that is that C uses NULL (but it is only a > convention, not a language feature) for missing reference parameters > on occasion. But I think it is much more clear that if C NULL gets > mapped to Python None, and we are talking about Python parameters, > then a NULLable C parameter should map to an "allow_none" Python > parameter. Argument Clinic defines *both* sides of how things are done in builtins, both C and Python. So it's a bit messier than that. Currently the "nullable" flag is only applicable to certain converters which output pointer types in C, so if it gets a None for that argument it does provide a NULL as the C equivalent. But in the "nullable int" patch obviously I can't do that. Instead you get a structure containing either an int or a flag specifying "you got a None", currently named "is_null". So I don't think your proposed additional justification helps. Of course, in my opinion I don't need this additional justification. Python's "None" is its null object. And we already have the concept of "nullable types" in computer science, for exactly, *exactly!*, this concept. As the Zen says, "special cases aren't special enough to break the rules". Just because Python is silly enough to name its null object "None" doesn't mean we have to warp all our other names around it. //arry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Mon Aug 4 18:57:03 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Mon, 04 Aug 2014 09:57:03 -0700 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: <53DFBB5F.3020001@stoneleaf.us> On 08/04/2014 12:12 AM, Larry Hastings wrote: > > It's my contention that "nullable" is the correct name. But I've been asked to bring up the topic for discussion, to > see if a consensus forms around this or around some other name. > > Let the bike-shedding begin, I think the original name is okay, but 'allow_none' is definitely clearer. -- ~Ethan~ From alexander.belopolsky at gmail.com Mon Aug 4 19:36:39 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Mon, 4 Aug 2014 13:36:39 -0400 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DFBB5F.3020001@stoneleaf.us> References: <53DF326F.9030908@hastings.org> <53DFBB5F.3020001@stoneleaf.us> Message-ID: On Mon, Aug 4, 2014 at 12:57 PM, Ethan Furman wrote: > 'allow_none' is definitely clearer. I disagree. Unlike "nullable", "allow_none" does not tell me what happens on the C side when I pass in None. If the receiving type is PyObject*, either NULL or Py_None is a valid choice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Aug 4 19:53:19 2014 From: antoine at python.org (Antoine Pitrou) Date: Mon, 04 Aug 2014 13:53:19 -0400 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: References: <53DF326F.9030908@hastings.org> <53DFBB5F.3020001@stoneleaf.us> Message-ID: Le 04/08/2014 13:36, Alexander Belopolsky a ?crit : > > On Mon, Aug 4, 2014 at 12:57 PM, Ethan Furman > wrote: > > 'allow_none' is definitely clearer. > > > I disagree. Unlike "nullable", "allow_none" does not tell me what > happens on the C side when I pass in None. If the receiving type is > PyObject*, either NULL or Py_None is a valid choice. But here the receiving type can be an int. Regards Antoine. From alexander.belopolsky at gmail.com Mon Aug 4 20:04:05 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Mon, 4 Aug 2014 14:04:05 -0400 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: References: <53DF326F.9030908@hastings.org> <53DFBB5F.3020001@stoneleaf.us> Message-ID: On Mon, Aug 4, 2014 at 1:53 PM, Antoine Pitrou wrote: > I disagree. Unlike "nullable", "allow_none" does not tell me what >> happens on the C side when I pass in None. If the receiving type is >> PyObject*, either NULL or Py_None is a valid choice. >> > > But here the receiving type can be an int. We cannot "allow None" when the receiving type is C int. In this case, we need a way to implement "nullable int" type in C. We can use int * or a pair of int and _Bool or anything else. Whatever the implementation, the concept that is implemented is "nullable int." The advantage of using the term "nullable" is that it is language and implementation neutral. -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Mon Aug 4 20:10:18 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 5 Aug 2014 04:10:18 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> Message-ID: <20140804181013.GO4525@ando> On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote: > Good point -- I was trying to make the point about .join() vs + for strings > in an intro python class last year, and made the mistake of having the > students test the performance. > > You need to concatenate a LOT of strings to see any difference at all -- I > know that O() of algorithms is unavoidable, but between efficient python > optimizations and a an apparently good memory allocator, it's really a > practical non-issue. If only that were the case, but it isn't. Here's a cautionary tale for how using string concatenation can blow up in your face: Chris Withers asks for help debugging HTTP slowness: https://mail.python.org/pipermail/python-dev/2009-August/091125.html and publishes some times: https://mail.python.org/pipermail/python-dev/2009-September/091581.html (notice that Python was SIX HUNDRED times slower than wget or IE) and Simon Cross identified the problem: https://mail.python.org/pipermail/python-dev/2009-September/091582.html leading Guido to describe the offending code as an embarrassment. It shouldn't be hard to demonstrate the difference between repeated string concatenation and join, all you need do is defeat sum()'s prohibition against strings. Run this bit of code, and you'll see a significant difference in performance, even with CPython's optimized concatenation: # --- cut --- class Faker: def __add__(self, other): return other x = Faker() strings = list("Hello World!") assert ''.join(strings) == sum(strings, x) from timeit import Timer setup = "from __main__ import x, strings" t1 = Timer("''.join(strings)", setup) t2 = Timer("sum(strings, x)", setup) print (min(t1.repeat())) print (min(t2.repeat())) # --- cut --- On my computer, using Python 2.7, I find the version using sum is nearly 4.5 times slower, and with 3.3 about 4.2 times slower. That's with a mere twelve substrings, hardly "a lot". I tried running it on IronPython with a slightly larger list of substrings, but I got sick of waiting for it to finish. If you want to argue that microbenchmarks aren't important, well, I might agree with you in general, but in the specific case of string concatenation there's that pesky factor of 600 slowdown in real world code to argue with. > Blocking sum( some_strings) because it _might_ have poor performance seems > awfully pedantic. The rationale for explicitly prohibiting strings while merely implicitly discouraging other non-numeric types is that beginners, who are least likely to understand why their code occasionally and unpredictably becomes catastrophically slow, are far more likely to sum strings than sum tuples or lists. (I don't entirely agree with this rationale, I'd prefer a warning rather than an exception.) -- Steven From larry at hastings.org Mon Aug 4 20:18:44 2014 From: larry at hastings.org (Larry Hastings) Date: Tue, 05 Aug 2014 04:18:44 +1000 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: References: <53DF326F.9030908@hastings.org> <53DFBB5F.3020001@stoneleaf.us> Message-ID: <53DFCE84.30204@hastings.org> On 08/05/2014 03:53 AM, Antoine Pitrou wrote: > Le 04/08/2014 13:36, Alexander Belopolsky a ?crit : >> If the receiving type is PyObject*, either NULL or Py_None is a valid >> choice. > But here the receiving type can be an int. Just to be precise: in the case where the receiving type *would* have been an int, and "nullable=True", the receiving type is actually a structure containing an int and a "you got a None" flag. I can't stick a magic value in the int and say "that represents you getting a None" because any integer value may be valid. Also, I'm pretty sure there are places in builtin argument parsing that accept either NULL or Py_None, and I *think* maybe in one or two of them they actually mean different things. What fun! For small values of "fun", //arry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Aug 4 20:37:54 2014 From: antoine at python.org (Antoine Pitrou) Date: Mon, 04 Aug 2014 14:37:54 -0400 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DFCE84.30204@hastings.org> References: <53DF326F.9030908@hastings.org> <53DFBB5F.3020001@stoneleaf.us> <53DFCE84.30204@hastings.org> Message-ID: Le 04/08/2014 14:18, Larry Hastings a ?crit : > > On 08/05/2014 03:53 AM, Antoine Pitrou wrote: >> Le 04/08/2014 13:36, Alexander Belopolsky a ?crit : >>> If the receiving type is PyObject*, either NULL or Py_None is a valid >>> choice. >> But here the receiving type can be an int. > > Just to be precise: in the case where the receiving type *would* have > been an int, and "nullable=True", the receiving type is actually a > structure containing an int and a "you got a None" flag. I can't stick a > magic value in the int and say "that represents you getting a None" > because any integer value may be valid. > > Also, I'm pretty sure there are places in builtin argument parsing that > accept either NULL or Py_None, and I *think* maybe in one or two of them > they actually mean different things. What fun! > > > For small values of "fun", Is -909 too large a value to be fun? Regards Antoine. From stefan_ml at behnel.de Mon Aug 4 21:14:49 2014 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 04 Aug 2014 21:14:49 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140804181013.GO4525@ando> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> Message-ID: Steven D'Aprano schrieb am 04.08.2014 um 20:10: > On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote: > >> Good point -- I was trying to make the point about .join() vs + for strings >> in an intro python class last year, and made the mistake of having the >> students test the performance. >> >> You need to concatenate a LOT of strings to see any difference at all -- I >> know that O() of algorithms is unavoidable, but between efficient python >> optimizations and a an apparently good memory allocator, it's really a >> practical non-issue. > > If only that were the case, but it isn't. Here's a cautionary tale for > how using string concatenation can blow up in your face: > > Chris Withers asks for help debugging HTTP slowness: > https://mail.python.org/pipermail/python-dev/2009-August/091125.html > > and publishes some times: > https://mail.python.org/pipermail/python-dev/2009-September/091581.html > > (notice that Python was SIX HUNDRED times slower than wget or IE) > > and Simon Cross identified the problem: > https://mail.python.org/pipermail/python-dev/2009-September/091582.html > > leading Guido to describe the offending code as an embarrassment. Thanks for digging up that story. >> Blocking sum( some_strings) because it _might_ have poor performance seems >> awfully pedantic. > > The rationale for explicitly prohibiting strings while merely implicitly > discouraging other non-numeric types is that beginners, who are least > likely to understand why their code occasionally and unpredictably > becomes catastrophically slow, are far more likely to sum strings than > sum tuples or lists. Well, the obvious difference between strings and lists (not tuples) is that strings are immutable, so it would seem more obvious at first sight to concatenate strings than to do the same thing with lists, which can easily be extended (they are clearly designed for that). This rational may not apply as much to beginners as to more experienced programmers, but it should still explain why this is so often discussed in the context of string concatenation and pretty much never for lists. As for tuples, their most common use case is to represent a fixed length sequence of semantically different values. That renders their concatenation a sufficiently uncommon use case to make no-one ask loudly for "large scale" sum(tuples) support. Basically, extending lists is an obvious thing, but getting multiple strings joined without using "+"-concatenating them isn't. Stefan From jimjjewett at gmail.com Mon Aug 4 22:22:27 2014 From: jimjjewett at gmail.com (Jim J. Jewett) Date: Mon, 04 Aug 2014 13:22:27 -0700 (PDT) Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53DCB96A.8050809@googlemail.com> Message-ID: <53dfeb83.c36fe00a.1596.65c9@mx.google.com> Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in https://mail.python.org/pipermail/python-dev/2014-August/135623.html ) wrote: > Andrea Griffini wrote: >> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. > hm could this be a pure python case that would profit from temporary > elision [ https://mail.python.org/pipermail/python-dev/2014-June/134826.html ]? > lists could declare the tp_can_elide slot and call list.extend on the > temporary during its tp_add slot instead of creating a new temporary. > extend/realloc can avoid the copy if there is free memory available > after the block. Yes, with all the same problems. When dealing with a complex object, how can you be sure that __add__ won't need access to the original values during the entire computation? It works with matrix addition, but not with matric multiplication. Depending on the details of the implementation, it could even fail for a sort of sliding-neighbor addition similar to the original justification. Of course, then those tricky implementations should not define an _eliding_add_, but maybe the builtin objects still should? After all, a plain old list is OK to re-use. Unless the first evaluation to create it ends up evaluating an item that has side effects... In the end, it looks like a lot of machinery (and extra checks that may slow down the normal small-object case) for something that won't be used all that often. Though it is really tempting to consider a compilation mode that assumes objects and builtins will be "normal", and lets you replace the entire above expression with compile-time [1, 2, 3, 4, 5, 6]. Would writing objects to that stricter standard and encouraging its use (and maybe offering a few AST transforms to auto-generate the out-parameters?) work as well for those who do need the speed? -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ From taleinat at gmail.com Tue Aug 5 12:08:05 2014 From: taleinat at gmail.com (Tal Einat) Date: Tue, 5 Aug 2014 13:08:05 +0300 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: On Mon, Aug 4, 2014 at 10:12 AM, Larry Hastings wrote: > > It's my contention that "nullable" is the correct name. But I've been asked > to bring up the topic for discussion, to see if a consensus forms around > this or around some other name. > > Let the bike-shedding begin, > > > /arry +1 for some form of "allow None" rather than "nullable". - Tal Einat From martin at v.loewis.de Tue Aug 5 17:13:12 2014 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 05 Aug 2014 17:13:12 +0200 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53DF326F.9030908@hastings.org> References: <53DF326F.9030908@hastings.org> Message-ID: <53E0F488.1090105@v.loewis.de> Am 04.08.14 09:12, schrieb Larry Hastings: > It's my contention that "nullable" is the correct name. But I've been > asked to bring up the topic for discussion, to see if a consensus forms > around this or around some other name. I have personally no problems with calling a type "nullable" even in Python, and, as a type *adjective* this seems to be the right choice (i.e. I wouldn't say "noneable int" or "allow_none int"; the former is no established or intuitive term, the latter is not an adjective). As a type *flag*, flexibility in naming is greater. zeroes=True formally creates a subtype (of string), and it doesn't hurt that it is not an adjective. "allow_zeroes" might be more descriptive. bitwise=True doesn't really create a subtype of int. For the feature in question, I find both "allow_none" and "nullable" acceptable; "noneable" is not. Regards, Martin From ischwabacher at wisc.edu Thu Aug 7 00:36:37 2014 From: ischwabacher at wisc.edu (Isaac Schwabacher) Date: Wed, 06 Aug 2014 17:36:37 -0500 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: <7620eb9316499.53e2acf5@wiscmail.wisc.edu> References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> Message-ID: <7740eea410ec7.53e267a5@wiscmail.wisc.edu> pathlib.Path currently strips trailing slashes from pathnames, but this behavior contradicts POSIX (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12), which specifies that the resolution of the pathname of a symbolic link to a directory in the context of a function that operates on symbolic links shall depend on whether the pathname has a trailing slash: > 4.12 Pathname Resolution > ======================== > > [...] > > A pathname that contains at least one non- character and that ends with one or more trailing characters shall not be resolved successfully unless the last pathname component before the trailing characters names an existing directory or a directory entry that is to be created for a directory immediately after the pathname is resolved. Interfaces using pathname resolution may specify additional constraints[1] when a pathname that does not name an existing directory contains at least one non- character and contains one or more trailing characters. > > If a symbolic link is encountered during pathname resolution, the behavior shall depend on whether the pathname component is at the end of the pathname and on the function being performed. If all of the following are true, then pathname resolution is complete: > > 1. This is the last pathname component of the pathname. > 2. The pathname has no trailing . > 3. The function is required to act on the symbolic link itself, or certain arguments direct that the function act on the symbolic link itself. > > In all other cases, the system shall prefix the remaining pathname, if any, with the contents of the symbolic link. [...] The following sentence appeared in an earlier version of POSIX (http://pubs.opengroup.org/onlinepubs/009604499/basedefs/xbd_chap04.html#tag_04_11) but has since been removed: > A pathname that contains at least one non-slash character and that ends with one or more trailing slashes shall be resolved as if a single dot character ( '.' ) were appended to the pathname. Is this important enough to preserve trailing slashes? - Isaac Schwabacher From antoine at python.org Thu Aug 7 02:11:36 2014 From: antoine at python.org (Antoine Pitrou) Date: Wed, 06 Aug 2014 20:11:36 -0400 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: <7740eea410ec7.53e267a5@wiscmail.wisc.edu> References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: Le 06/08/2014 18:36, Isaac Schwabacher a ?crit : >> >> If a symbolic link is encountered during pathname resolution, the >> behavior shall depend on whether the pathname component is at the >> end of the pathname and on the function being performed. If all of >> the following are true, then pathname resolution is complete: >> >> 1. This is the last pathname component of the pathname. 2. The >> pathname has no trailing . 3. The function is required to >> act on the symbolic link itself, or certain arguments direct that >> the function act on the symbolic link itself. >> >> In all other cases, the system shall prefix the remaining pathname, >> if any, with the contents of the symbolic link. [...] So the only case where this would make a difference is when calling a "function acting on the symbolic link itself" (such as lstat() or unlink()) on a path with a trailing slash: >>> os.lstat('foo') os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370025, st_mtime=1407370025, st_ctime=1407370025) >>> os.lstat('foo/') os.stat_result(st_mode=17407, st_ino=917505, st_dev=2050, st_nlink=7, st_uid=0, st_gid=0, st_size=4096, st_atime=1407367916, st_mtime=1407369857, st_ctime=1407369857) >>> pathlib.Path('foo').lstat() os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370037, st_mtime=1407370025, st_ctime=1407370025) >>> pathlib.Path('foo/').lstat() os.stat_result(st_mode=41471, st_ino=1981954, st_dev=2050, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4, st_atime=1407370037, st_mtime=1407370025, st_ctime=1407370025) But you can also call resolve() explicitly if you want to act on the link target rather than the link itself: >>> pathlib.Path('foo/').resolve().lstat() os.stat_result(st_mode=17407, st_ino=917505, st_dev=2050, st_nlink=7, st_uid=0, st_gid=0, st_size=4096, st_atime=1407367916, st_mtime=1407369857, st_ctime=1407369857) Am I overlooking other cases? Regards Antoine. From alexander.belopolsky at gmail.com Thu Aug 7 02:50:14 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Wed, 6 Aug 2014 20:50:14 -0400 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: On Wed, Aug 6, 2014 at 8:11 PM, Antoine Pitrou wrote: > Am I overlooking other cases? There are many interfaces where trailing slash is significant. For example, rsync uses trailing slash on the target directory to avoid creating an additional directory level at the destination. Loosing it when passing path strings through pathlib.Path() may be a source of bugs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Thu Aug 7 03:55:14 2014 From: antoine at python.org (Antoine Pitrou) Date: Wed, 06 Aug 2014 21:55:14 -0400 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: Le 06/08/2014 20:50, Alexander Belopolsky a ?crit : > On Wed, Aug 6, 2014 at 8:11 PM, Antoine Pitrou > wrote: > > Am I overlooking other cases? > > There are many interfaces where trailing slash is significant. For > example, rsync uses trailing slash on the target directory to avoid > creating an additional directory level at the destination. Loosing it > when passing path strings through pathlib.Path() may be a source of bugs. pathlib is generally concerned with filesystem operations written in Python, not arbitrary third-party tools. Also it is probably easy to append the trailing slash in your command-line invocation, if so desired. Regards Antoine. From ben+python at benfinney.id.au Thu Aug 7 04:12:30 2014 From: ben+python at benfinney.id.au (Ben Finney) Date: Thu, 07 Aug 2014 12:12:30 +1000 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: <85lhr19hf5.fsf@benfinney.id.au> Antoine Pitrou writes: > Le 06/08/2014 20:50, Alexander Belopolsky a ?crit : > > There are many interfaces where trailing slash is significant. [?] > > Loosing it when passing path strings through pathlib.Path() may be a > > source of bugs. > > pathlib is generally concerned with filesystem operations written in > Python, not arbitrary third-party tools. The operating system shell is more than an ?arbitrary third-party tool?, though; it preserves paths, and handles invoking commands. You seem to be saying that ?pathlib? is not intended to be helpful for constructing a shell command. Will its documentation warn that is so? > Also it is probably easy to append the trailing slash in your > command-line invocation, if so desired. The trouble is that one can desire it, and construct a path knowing that the presence or absence of a trailing slash has semantic significance; and then have it unaccountably altered by the pathlib.Path code. This is worse than preserving the semantic value. -- \ ?But Marge, what if we chose the wrong religion? Each week we | `\ just make God madder and madder.? ?Homer, _The Simpsons_ | _o__) | Ben Finney From antoine at python.org Thu Aug 7 04:30:52 2014 From: antoine at python.org (Antoine Pitrou) Date: Wed, 06 Aug 2014 22:30:52 -0400 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: <85lhr19hf5.fsf@benfinney.id.au> References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> <85lhr19hf5.fsf@benfinney.id.au> Message-ID: Le 06/08/2014 22:12, Ben Finney a ?crit : > You seem to be saying that ?pathlib? is not intended to be helpful for > constructing a shell command. pathlib lets you do operations on paths. It also gives you a string representation of the path that's expected to designate that path when talking to operating system APIs. It doesn't give you the possibility to store other semantic variations ("whether a new directory level must be created"); that's up to you to add those. (similarly, it doesn't have separate classes to represent "a file", "a directory", "a non-existing file", etc.) Regards Antoine. From bcannon at gmail.com Thu Aug 7 16:04:04 2014 From: bcannon at gmail.com (Brett Cannon) Date: Thu, 07 Aug 2014 14:04:04 +0000 Subject: [Python-Dev] [Python-checkins] Daily reference leaks (09f56fdcacf1): sum=21004 References: Message-ID: test_codecs is not happy. Looking at the subject lines of commit emails from the past day I don't see any obvious cause. On Thu Aug 07 2014 at 4:35:05 AM wrote: > results for 09f56fdcacf1 on branch "default" > -------------------------------------------- > > test_codecs leaked [5825, 5825, 5825] references, sum=17475 > test_codecs leaked [1172, 1174, 1174] memory blocks, sum=3520 > test_collections leaked [0, 2, 0] references, sum=2 > test_functools leaked [0, 0, 3] memory blocks, sum=3 > test_site leaked [0, 2, 0] references, sum=2 > test_site leaked [0, 2, 0] memory blocks, sum=2 > > > Command line was: ['./python', '-m', 'test.regrtest', '-uall', '-R', > '3:3:/home/antoine/cpython/refleaks/reflogdA4OO6', '-x'] > _______________________________________________ > Python-checkins mailing list > Python-checkins at python.org > https://mail.python.org/mailman/listinfo/python-checkins > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Thu Aug 7 17:05:46 2014 From: guido at python.org (Guido van Rossum) Date: Thu, 7 Aug 2014 08:05:46 -0700 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> <85lhr19hf5.fsf@benfinney.id.au> Message-ID: Hm. I personally consider a trailing slash significant. It feels semantically different (and in some cases it is) so I don't think it should be normalized. The behavior of os.path.split() here feels right. On Wed, Aug 6, 2014 at 7:30 PM, Antoine Pitrou wrote: > > Le 06/08/2014 22:12, Ben Finney a ?crit : > > You seem to be saying that ?pathlib? is not intended to be helpful for >> constructing a shell command. >> > > pathlib lets you do operations on paths. It also gives you a string > representation of the path that's expected to designate that path when > talking to operating system APIs. It doesn't give you the possibility to > store other semantic variations ("whether a new directory level must be > created"); that's up to you to add those. > > (similarly, it doesn't have separate classes to represent "a file", "a > directory", "a non-existing file", etc.) > > Regards > > Antoine. > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ > guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From zachary.ware+pydev at gmail.com Thu Aug 7 19:16:03 2014 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 7 Aug 2014 12:16:03 -0500 Subject: [Python-Dev] [Python-checkins] Daily reference leaks (09f56fdcacf1): sum=21004 In-Reply-To: References: Message-ID: On Thu, Aug 7, 2014 at 9:04 AM, Brett Cannon wrote: > test_codecs is not happy. Looking at the subject lines of commit emails from > the past day I don't see any obvious cause. Looks like this was caused by the change I made to regrtest in [1] to fix refleak testing in test_asyncio [2]. I'm looking into it, but haven't found any kind of reason for it yet. -- Zach [1] http://hg.python.org/cpython/rev/7bc53cf8b2df [2] http://bugs.python.org/issue22104 From zachary.ware+pydev at gmail.com Thu Aug 7 22:51:24 2014 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 7 Aug 2014 15:51:24 -0500 Subject: [Python-Dev] [Python-checkins] Daily reference leaks (09f56fdcacf1): sum=21004 In-Reply-To: References: Message-ID: On Thu, Aug 7, 2014 at 12:16 PM, Zachary Ware wrote: > On Thu, Aug 7, 2014 at 9:04 AM, Brett Cannon wrote: >> test_codecs is not happy. Looking at the subject lines of commit emails from >> the past day I don't see any obvious cause. > > Looks like this was caused by the change I made to regrtest in [1] to > fix refleak testing in test_asyncio [2]. I'm looking into it, but > haven't found any kind of reason for it yet. I've created http://bugs.python.org/issue22166 to keep track of this and report my findings thus far. -- Zach From chris.barker at noaa.gov Fri Aug 8 00:06:18 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 7 Aug 2014 15:06:18 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140804181013.GO4525@ando> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> Message-ID: On Mon, Aug 4, 2014 at 11:10 AM, Steven D'Aprano wrote: > On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote: > > > Good point -- I was trying to make the point about .join() vs + for > strings > > in an intro python class last year, and made the mistake of having the > > students test the performance. > > > > You need to concatenate a LOT of strings to see any difference at all > > If only that were the case, but it isn't. Here's a cautionary tale for > how using string concatenation can blow up in your face: > > Chris Withers asks for help debugging HTTP slowness: > https://mail.python.org/pipermail/python-dev/2009-August/091125.html Thanks for that -- interesting story. note that that was not suing sum() in that case though, which is really the issue at hand. It shouldn't be hard to demonstrate the difference between repeated > string concatenation and join, all you need do is defeat sum()'s > prohibition against strings. Run this bit of code, and you'll see a > significant difference in performance, even with CPython's optimized > concatenation: > well, that does look compelling, but what it shows is that sum(a_list_of_strings) is slow compared to ''.join(a_list_of_stings). That doesn't surprise me a bit -- this is really similar to why: a_numpy_array.sum() is going to be a lot faster than: sum(a_numpy_array) and why I'll tell everyone that is working with lots of numbers to use numpy. ndarray.sum know what data type it's deaing with,a nd can do the loop in C. similarly with ''.join() (though not as optimized. But I'm not sure we're seeing the big O difference here at all -- but rather the extra calls though each element in the list's __add__ method. In the case where you already HAVE a big list of strings, then yes, ''.join is the clear winner. But I think the case we're often talking about, and I've tested with students, is when you are building up a long string on the fly out of little strings. In that case, you need to profile the full "append to list, then call join()", not just the join() call: # continued adding of strings ( O(n^2)? ) In [6]: def add_strings(l): ...: s = '' ...: for i in l: ...: s+=i ...: return s Using append and then join ( O(n)? ) In [14]: def join_strings(list_of_strings): ....: l = [] ....: for i in list_of_strings: ....: l.append(i) ....: return ''.join(l) In [23]: timeit add_strings(strings) 1000000 loops, best of 3: 831 ns per loop In [24]: timeit join_strings(strings) 100000 loops, best of 3: 1.87 ?s per loop ## hmm -- concatenating is faster for a small list of tiny strings.... In [31]: strings = list('Hello World')* 1000 strings *= 1000 In [26]: timeit add_strings(strings) 1000 loops, best of 3: 932 ?s per loop In [27]: timeit join_strings(strings) 1000 loops, best of 3: 967 ?s per loop ## now about the same. In [31]: strings = list('Hello World')* 10000 In [29]: timeit add_strings(strings) 100 loops, best of 3: 9.44 ms per loop In [30]: timeit join_strings(strings) 100 loops, best of 3: 10.1 ms per loop still about he same? In [31]: strings = list('Hello World')* 1000000 In [32]: timeit add_strings(strings) 1 loops, best of 3: 1.27 s per loop In [33]: timeit join_strings(strings) 1 loops, best of 3: 1.05 s per loop there we go -- slight advantage to joining..... So this is why we've said that the common wisdom about string concatenating isn't really a practical issue. But if you already have the strings all in a list, then yes, join() is a major win over sum() In fact, I tried the above with sum() -- and it was really, really slow. So slow I didn't have the patience to wait for it. Here is a smaller example: In [22]: strings = list('Hello World')* 10000 In [23]: timeit add_strings(strings) 100 loops, best of 3: 9.61 ms per loop In [24]: timeit sum( strings, Faker() ) 1 loops, best of 3: 246 ms per loop So why is sum() so darn slow with strings compared to a simple loop with += ? (and if I try it with a list 10 times as long it takes "forever") Perhaps the http issue cited was before some nifty optimizations in current CPython? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Aug 8 01:01:49 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Thu, 07 Aug 2014 16:01:49 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> Message-ID: <53E4055D.2040305@stoneleaf.us> On 08/07/2014 03:06 PM, Chris Barker wrote: [snip timings, etc.] I don't remember where, but I believe that cPython has an optimization built in for repeated string concatenation, which is probably why you aren't seeing big differences between the + and the sum(). A little testing shows how to defeat that optimization: blah = '' for string in ['booyah'] * 100000: blah = string + blah Note the reversed order of the addition. --> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1) [0.021117210388183594, 0.013692855834960938, 0.00768280029296875] --> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1) [15.301048994064331, 15.343288898468018, 15.268463850021362] -- ~Ethan~ From ethan at stoneleaf.us Fri Aug 8 01:05:50 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Thu, 07 Aug 2014 16:05:50 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E4055D.2040305@stoneleaf.us> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> Message-ID: <53E4064E.1080600@stoneleaf.us> On 08/07/2014 04:01 PM, Ethan Furman wrote: > On 08/07/2014 03:06 PM, Chris Barker wrote: > > the + and the sum(). Yeah, that 'sum' should be 'join' :/ -- ~Ethan~ From ethan at stoneleaf.us Fri Aug 8 01:08:14 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Thu, 07 Aug 2014 16:08:14 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E4055D.2040305@stoneleaf.us> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> Message-ID: <53E406DE.3080509@stoneleaf.us> On 08/07/2014 04:01 PM, Ethan Furman wrote: > On 08/07/2014 03:06 PM, Chris Barker wrote: > > --> timeit.Timer("for string in ['booya'] * 100000: blah = blah + string", "blah = ''").repeat(3, 1) > [0.021117210388183594, 0.013692855834960938, 0.00768280029296875] > > --> timeit.Timer("for string in ['booya'] * 100000: blah = string + blah", "blah = ''").repeat(3, 1) > [15.301048994064331, 15.343288898468018, 15.268463850021362] Oh, and the join() timings: --> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah = ''").repeat(3, 1) [0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188] So, + is three orders of magnitude slower than join. -- ~Ethan~ From larry at hastings.org Fri Aug 8 06:41:13 2014 From: larry at hastings.org (Larry Hastings) Date: Thu, 07 Aug 2014 21:41:13 -0700 Subject: [Python-Dev] Surely "nullable" is a reasonable name? In-Reply-To: <53E0F488.1090105@v.loewis.de> References: <53DF326F.9030908@hastings.org> <53E0F488.1090105@v.loewis.de> Message-ID: <53E454E9.3030100@hastings.org> On 08/05/2014 08:13 AM, "Martin v. L?wis" wrote: > For the feature in question, > I find both "allow_none" and "nullable" acceptable; "noneable" is not. Well! It's rare that the core dev community is so consistent in its opinion. I still think "nullable" is totally appropriate, but I'll change it to "allow_none". //arry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Aug 8 14:27:28 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Fri, 8 Aug 2014 13:27:28 +0100 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: On 7 August 2014 02:55, Antoine Pitrou wrote: > pathlib is generally concerned with filesystem operations written in Python, > not arbitrary third-party tools. Also it is probably easy to append the > trailing slash in your command-line invocation, if so desired. I had a use case where I wanted to allow a config file to contain "path: foo" to create a file called foo, and "path: foo/" to create a directory. It was a shortcut for specifying an explicit "directory: true" parameter as well. The fact that pathlib stripped the slash made coding this mildly tricky (especially as I wanted to cater for Windows users writing "foo\\"...) It's not a showstopper, but I agree that semantically, being able to distinguish whether an input had a trailing slash is sometimes useful. Paul From alexander.belopolsky at gmail.com Fri Aug 8 15:39:43 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Fri, 8 Aug 2014 09:39:43 -0400 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: References: <7610bb2511765.53e2a8af@wiscmail.wisc.edu> <7720c7201089e.53e2a8eb@wiscmail.wisc.edu> <75109da9179bf.53e2a92b@wiscmail.wisc.edu> <761082131730e.53e2a967@wiscmail.wisc.edu> <7780ac0016cba.53e2a9a3@wiscmail.wisc.edu> <763086e915f3d.53e2a9e0@wiscmail.wisc.edu> <7600e07a110a6.53e2aa1c@wiscmail.wisc.edu> <7660d0b6127e1.53e2aa58@wiscmail.wisc.edu> <7660a12a17d35.53e2aad3@wiscmail.wisc.edu> <7510b3091081e.53e2ab10@wiscmail.wisc.edu> <7740962212b5e.53e2ab4c@wiscmail.wisc.edu> <76f0afab10193.53e2ab88@wiscmail.wisc.edu> <7690af93164a2.53e2abc5@wiscmail.wisc.edu> <7690800915d00.53e2ac01@wiscmail.wisc.edu> <7780ff531168d.53e2ac3d@wiscmail.wisc.edu> <7620b637135c3.53e2ac7a@wiscmail.wisc.edu> <7620eb9316499.53e2acf5@wiscmail.wisc.edu> <7740eea410ec7.53e267a5@wiscmail.wisc.edu> Message-ID: On Fri, Aug 8, 2014 at 8:27 AM, Paul Moore wrote: > I had a use case where I wanted to allow a config file to contain > "path: foo" to create a file called foo, and "path: foo/" to create a > directory. It was a shortcut for specifying an explicit "directory: > true" parameter as well. > Here is my use case: I have a database application that can save a table in a variety of formats based on the supplied file name. For example, save('t.csv', t) saves in CSV text format while save('t', t) saves in the default binary format. In addition, it supports "splayed" format where a table is saved in multiple files across a directory - one file per column. The native database save function chooses this format when file name ends with a slash: save('t/', t). I would like to make the save() function in Python that works like this, but takes pathlib.Path instances instead of str, but in the current version, I cannot supply 't/' as a Path instance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From status at bugs.python.org Fri Aug 8 18:08:08 2014 From: status at bugs.python.org (Python tracker) Date: Fri, 8 Aug 2014 18:08:08 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20140808160808.21A4C5622F@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2014-08-01 - 2014-08-08) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 4602 (+10) closed 29340 (+43) total 33942 (+53) Open issues with patches: 2177 Issues opened (39) ================== #21039: pathlib strips trailing slash http://bugs.python.org/issue21039 reopened by pitrou #21591: "exec(a, b, c)" not the same as "exec a in b, c" in nested fun http://bugs.python.org/issue21591 reopened by Arfrever #22121: IDLE should start with HOME as the initial working directory http://bugs.python.org/issue22121 opened by mark #22123: Provide a direct function for types.SimpleNamespace() http://bugs.python.org/issue22123 opened by mark #22125: Cure signedness warnings introduced by #22003 http://bugs.python.org/issue22125 opened by dw #22126: mc68881 fpcr inline asm breaks clang -flto build http://bugs.python.org/issue22126 opened by ivank #22128: patch: steer people away from codecs.open http://bugs.python.org/issue22128 opened by Frank.van.Dijk #22131: uuid.bytes optimization http://bugs.python.org/issue22131 opened by kevinlondon #22133: IDLE: Set correct WM_CLASS on X11 http://bugs.python.org/issue22133 opened by sahutd #22135: allow to break into pdb with Ctrl-C for all the commands that http://bugs.python.org/issue22135 opened by xdegaye #22137: Test imaplib API on all methods specified in RFC 3501 http://bugs.python.org/issue22137 opened by zvyn #22138: patch.object doesn't restore function defaults http://bugs.python.org/issue22138 opened by chepner #22139: python windows 2.7.8 64-bit wrong binary version http://bugs.python.org/issue22139 opened by Andreas.Richter #22140: "python-config --includes" returns a wrong path (double prefix http://bugs.python.org/issue22140 opened by Michael.Dussere #22141: rlcompleter.Completer matches too much http://bugs.python.org/issue22141 opened by donlorenzo #22143: rlcompleter.Completer has duplicate matches http://bugs.python.org/issue22143 opened by donlorenzo #22144: ellipsis needs better display in lexer documentation http://bugs.python.org/issue22144 opened by Fran??ois-Ren??.Rideau #22145: <> in parser spec but not lexer spec http://bugs.python.org/issue22145 opened by Fran??ois-Ren??.Rideau #22147: PosixPath() constructor should not accept strings with embedde http://bugs.python.org/issue22147 opened by ischwabacher #22148: frozen.c should #include instead of "importlib.h http://bugs.python.org/issue22148 opened by jbeck #22149: the frame of a suspended generator should not have a local tra http://bugs.python.org/issue22149 opened by xdegaye #22150: deprecated-removed directive is broken in Sphinx 1.2.2 http://bugs.python.org/issue22150 opened by berker.peksag #22153: There is no standard TestCase.runTest implementation http://bugs.python.org/issue22153 opened by vadmium #22154: ZipFile.open context manager support http://bugs.python.org/issue22154 opened by Ralph.Broenink #22155: Out of date code example for tkinter's createfilehandler http://bugs.python.org/issue22155 opened by vadmium #22156: Fix compiler warnings http://bugs.python.org/issue22156 opened by haypo #22157: FAIL: test_with_pip (test.test_venv.EnsurePipTest) http://bugs.python.org/issue22157 opened by snehal #22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy http://bugs.python.org/issue22158 opened by zvyn #22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec http://bugs.python.org/issue22159 opened by zvyn #22160: Windows installers need to be updated following OpenSSL securi http://bugs.python.org/issue22160 opened by alex #22161: Remove unsupported code from ctypes http://bugs.python.org/issue22161 opened by serhiy.storchaka #22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul http://bugs.python.org/issue22163 opened by edulix #22164: cell object cleared too early? http://bugs.python.org/issue22164 opened by pitrou #22165: Empty response from http.server when directory listing contain http://bugs.python.org/issue22165 opened by jleedev #22166: test_codecs "leaking" references http://bugs.python.org/issue22166 opened by zach.ware #22167: iglob() has misleading documentation (does indeed store names http://bugs.python.org/issue22167 opened by roysmith #22168: Turtle Graphics RawTurtle problem http://bugs.python.org/issue22168 opened by Kent.D..Lee #22171: stack smash when using ctypes/libffi to access union http://bugs.python.org/issue22171 opened by wes.kerfoot #22173: Update lib2to3.tests and test_lib2to3 to use test discovery http://bugs.python.org/issue22173 opened by zach.ware Most recent 15 issues with no replies (15) ========================================== #22173: Update lib2to3.tests and test_lib2to3 to use test discovery http://bugs.python.org/issue22173 #22171: stack smash when using ctypes/libffi to access union http://bugs.python.org/issue22171 #22166: test_codecs "leaking" references http://bugs.python.org/issue22166 #22164: cell object cleared too early? http://bugs.python.org/issue22164 #22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul http://bugs.python.org/issue22163 #22161: Remove unsupported code from ctypes http://bugs.python.org/issue22161 #22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec http://bugs.python.org/issue22159 #22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy http://bugs.python.org/issue22158 #22155: Out of date code example for tkinter's createfilehandler http://bugs.python.org/issue22155 #22153: There is no standard TestCase.runTest implementation http://bugs.python.org/issue22153 #22149: the frame of a suspended generator should not have a local tra http://bugs.python.org/issue22149 #22143: rlcompleter.Completer has duplicate matches http://bugs.python.org/issue22143 #22140: "python-config --includes" returns a wrong path (double prefix http://bugs.python.org/issue22140 #22135: allow to break into pdb with Ctrl-C for all the commands that http://bugs.python.org/issue22135 #22115: Add new methods to trace Tkinter variables http://bugs.python.org/issue22115 Most recent 15 issues waiting for review (15) ============================================= #22173: Update lib2to3.tests and test_lib2to3 to use test discovery http://bugs.python.org/issue22173 #22165: Empty response from http.server when directory listing contain http://bugs.python.org/issue22165 #22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul http://bugs.python.org/issue22163 #22161: Remove unsupported code from ctypes http://bugs.python.org/issue22161 #22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec http://bugs.python.org/issue22159 #22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy http://bugs.python.org/issue22158 #22156: Fix compiler warnings http://bugs.python.org/issue22156 #22150: deprecated-removed directive is broken in Sphinx 1.2.2 http://bugs.python.org/issue22150 #22149: the frame of a suspended generator should not have a local tra http://bugs.python.org/issue22149 #22148: frozen.c should #include instead of "importlib.h http://bugs.python.org/issue22148 #22143: rlcompleter.Completer has duplicate matches http://bugs.python.org/issue22143 #22141: rlcompleter.Completer matches too much http://bugs.python.org/issue22141 #22138: patch.object doesn't restore function defaults http://bugs.python.org/issue22138 #22137: Test imaplib API on all methods specified in RFC 3501 http://bugs.python.org/issue22137 #22133: IDLE: Set correct WM_CLASS on X11 http://bugs.python.org/issue22133 Top 10 most discussed issues (10) ================================= #19838: test.test_pathlib.PosixPathTest.test_touch_common fails on Fre http://bugs.python.org/issue19838 23 msgs #21448: Email Parser use 100% CPU http://bugs.python.org/issue21448 14 msgs #22123: Provide a direct function for types.SimpleNamespace() http://bugs.python.org/issue22123 11 msgs #21965: Add support for Memory BIO to _ssl http://bugs.python.org/issue21965 10 msgs #14910: argparse: disable abbreviation http://bugs.python.org/issue14910 9 msgs #21308: PEP 466: backport ssl changes http://bugs.python.org/issue21308 9 msgs #22046: ZipFile.read() should mention that it might throw NotImplement http://bugs.python.org/issue22046 9 msgs #21091: EmailMessage.is_attachment should be a method http://bugs.python.org/issue21091 8 msgs #22118: urljoin fails with messy relative URLs http://bugs.python.org/issue22118 8 msgs #22160: Windows installers need to be updated following OpenSSL securi http://bugs.python.org/issue22160 8 msgs Issues closed (43) ================== #5411: Add xz support to shutil http://bugs.python.org/issue5411 closed by serhiy.storchaka #11763: assertEqual memory issues with large text inputs http://bugs.python.org/issue11763 closed by ezio.melotti #13540: Document the Action API in argparse http://bugs.python.org/issue13540 closed by jason.coombs #15114: Deprecate strict mode of HTMLParser http://bugs.python.org/issue15114 closed by ezio.melotti #15826: Increased test coverage of test_glob.py http://bugs.python.org/issue15826 closed by ezio.melotti #15974: Optional compact and colored output for regrest http://bugs.python.org/issue15974 closed by pitrou #17665: convert test_wsgiref to idiomatic unittest code http://bugs.python.org/issue17665 closed by ezio.melotti #18034: Last two entries in the programming FAQ are out of date (impor http://bugs.python.org/issue18034 closed by ezio.melotti #18142: Tests fail on Mageia Linux Cauldron x86-64 with some configure http://bugs.python.org/issue18142 closed by ned.deily #18588: timeit examples should be consistent http://bugs.python.org/issue18588 closed by ezio.melotti #19055: Regular expressions: * does not match as many repetitions as p http://bugs.python.org/issue19055 closed by ezio.melotti #20056: Got deprecation warning when running test_shutil.py on Windows http://bugs.python.org/issue20056 closed by serhiy.storchaka #20170: Derby #1: Convert 137 sites to Argument Clinic in Modules/posi http://bugs.python.org/issue20170 closed by larry #20402: List comprehensions should be noted in for loop documentation http://bugs.python.org/issue20402 closed by rhettinger #20977: pyflakes: undefined "ctype" in 2 except blocks in the email mo http://bugs.python.org/issue20977 closed by ezio.melotti #21047: html.parser.HTMLParser: convert_charrefs should become True by http://bugs.python.org/issue21047 closed by berker.peksag #21539: pathlib's Path.mkdir() should allow for "mkdir -p" functionali http://bugs.python.org/issue21539 closed by barry #21972: Bugs in the lexer and parser documentation http://bugs.python.org/issue21972 closed by loewis #21975: Using pickled/unpickled sqlite3.Row results in segfault rather http://bugs.python.org/issue21975 closed by serhiy.storchaka #22077: Improve the error message for various sequences http://bugs.python.org/issue22077 closed by terry.reedy #22092: Executing some tests inside Lib/unittest/test individually thr http://bugs.python.org/issue22092 closed by ezio.melotti #22097: Linked list API for ordereddict http://bugs.python.org/issue22097 closed by rhettinger #22104: test_asyncio unstable in refleak mode http://bugs.python.org/issue22104 closed by python-dev #22105: Idle: Hang during File "Save As" http://bugs.python.org/issue22105 closed by terry.reedy #22110: enable extra compilation warnings http://bugs.python.org/issue22110 closed by neologix #22114: You cannot call communicate() safely after receiving an except http://bugs.python.org/issue22114 closed by amrith #22116: Weak reference support for C function objects http://bugs.python.org/issue22116 closed by pitrou #22119: Some input chars (i.e. '++') break re.match http://bugs.python.org/issue22119 closed by ezio.melotti #22120: Return converter code generated by Argument Clinic has a warni http://bugs.python.org/issue22120 closed by larry #22122: turtle module examples should all begin "from turtle import *" http://bugs.python.org/issue22122 closed by mark #22124: Rotating items of list to left http://bugs.python.org/issue22124 closed by zach.ware #22127: performance regression in socket getsockaddrarg() http://bugs.python.org/issue22127 closed by loewis #22129: Please add an equivalent to QString::simplified() to Python st http://bugs.python.org/issue22129 closed by serhiy.storchaka #22130: Logging fileConfig behavior does not match documentation http://bugs.python.org/issue22130 closed by python-dev #22132: Cannot copy the same directory structure to the same destinati http://bugs.python.org/issue22132 closed by eric.araujo #22134: string formatting float rounding errors http://bugs.python.org/issue22134 closed by ned.deily #22136: Fix _tkinter compiler warnings on MSVC http://bugs.python.org/issue22136 closed by python-dev #22142: PEP 465 operators not described in lexical_analysis http://bugs.python.org/issue22142 closed by python-dev #22146: Error message for __build_class__ contains typo http://bugs.python.org/issue22146 closed by python-dev #22162: Activating a venv - Dash doesn't understand source http://bugs.python.org/issue22162 closed by vinay.sajip #22169: sys.tracebacklimit = 0 does not work as documented in 3.x http://bugs.python.org/issue22169 closed by ned.deily #22170: Typo in iterator doc http://bugs.python.org/issue22170 closed by ezio.melotti #22172: Local files shadow system modules, even from system modules http://bugs.python.org/issue22172 closed by ncoghlan From chris.barker at noaa.gov Fri Aug 8 17:23:51 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 8 Aug 2014 08:23:51 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E4055D.2040305@stoneleaf.us> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> Message-ID: On Thu, Aug 7, 2014 at 4:01 PM, Ethan Furman wrote: > I don't remember where, but I believe that cPython has an optimization > built in for repeated string concatenation, which is probably why you > aren't seeing big differences between the + and the sum(). > Indeed -- clearly so. A little testing shows how to defeat that optimization: blah = '' > for string in ['booyah'] * 100000: > blah = string + blah > > Note the reversed order of the addition. > thanks -- cool trick. Oh, and the join() timings: > --> timeit.Timer("blah = ''.join(['booya'] * 100000)", "blah = > ''").repeat(3, 1) > [0.0014629364013671875, 0.0014190673828125, 0.0011930465698242188] > So, + is three orders of magnitude slower than join. only one if if you use the optimized form of + and not even that if you need to build up the list first, which is the common use-case. So my final question is this: repeated string concatenation is not the "recommended" way to do this -- but nevertheless, cPython has an optimization that makes it fast and efficient, to the point that there is no practical performance reason to prefer appending to a list and calling join()) afterward. So why not apply a similar optimization to sum() for strings? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Fri Aug 8 20:09:45 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 08 Aug 2014 11:09:45 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> Message-ID: <53E51269.5030209@stoneleaf.us> On 08/08/2014 08:23 AM, Chris Barker wrote: > > So my final question is this: > > repeated string concatenation is not the "recommended" way to do this -- but nevertheless, cPython has an optimization > that makes it fast and efficient, to the point that there is no practical performance reason to prefer appending to a > list and calling join()) afterward. > > So why not apply a similar optimization to sum() for strings? That I cannot answer -- I find the current situation with sum highly irritating. -- ~Ethan~ From raymond.hettinger at gmail.com Sat Aug 9 02:34:34 2014 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Fri, 8 Aug 2014 17:34:34 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E51269.5030209@stoneleaf.us> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> Message-ID: <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> On Aug 8, 2014, at 11:09 AM, Ethan Furman wrote: >> So why not apply a similar optimization to sum() for strings? > > That I cannot answer -- I find the current situation with sum highly irritating. > It is only irritating if you are misusing sum(). The str.__add__ optimization was put in because it was common for people to accidentally incur the performance penalty. With sum(), we don't seem to have that problem (I don't see people using it to add lists except just to show that could be done). Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Sat Aug 9 02:56:24 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 08 Aug 2014 17:56:24 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> Message-ID: <53E571B8.7030103@stoneleaf.us> On 08/08/2014 05:34 PM, Raymond Hettinger wrote: > > On Aug 8, 2014, at 11:09 AM, Ethan Furman > wrote: > >>> So why not apply a similar optimization to sum() for strings? >> >> That I cannot answer -- I find the current situation with sum highly irritating. >> > > It is only irritating if you are misusing sum(). Actually, I have an advanced degree in irritability -- perhaps you've noticed in the past? I don't use sum at all, or at least very rarely, and it still irritates me. It feels like I'm being told I'm too dumb to figure out when I can safely use sum and when I can't. -- ~Ethan~ From alexander.belopolsky at gmail.com Sat Aug 9 04:20:37 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Fri, 8 Aug 2014 22:20:37 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E571B8.7030103@stoneleaf.us> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> Message-ID: On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman wrote: > I don't use sum at all, or at least very rarely, and it still irritates me. You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but in Python it is 0 + a + b + c. If we had a "join" operator for strings that is different form + - then sure, I would not try to use sum to join strings, but we don't. I have always thought that sum(x) is just a shorthand for reduce(operator.add, x), but again it is not so in Python. While "sum should only be used for numbers," it turns out it is not a good choice for floats - use math.fsum. While "strings are blocked because sum is slow," numpy arrays with millions of elements are not. And try to explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine. Why have builtin sum at all if its use comes with so many caveats? -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Aug 9 07:08:45 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 9 Aug 2014 15:08:45 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> Message-ID: <20140809050845.GZ4525@ando> On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote: > On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman wrote: > > > I don't use sum at all, or at least very rarely, and it still irritates me. > > > You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but > in Python it is 0 + a + b + c. If we had a "join" operator for strings > that is different form + - then sure, I would not try to use sum to join > strings, but we don't. I've long believed that + is the wrong operator for concatenating strings, and that & makes a much better operator. We wouldn't be having these interminable arguments about using sum() to concatenate strings (and lists, and tuples) if the & operator was used for concatenation and + was only used for numeric addition. > I have always thought that sum(x) is just a > shorthand for reduce(operator.add, x), but again it is not so in Python. The signature of reduce is: reduce(...) reduce(function, sequence[, initial]) -> value so sum() is (at least conceptually) a shorthand for reduce: def sum(values, initial=0): return reduce(operator.add, values, initial) but that's an implementation detail, not a language promise, and sum() is free to differ from that simple version. Indeed, even the public interface is different, since sum() prohibits using a string as the initial value and only promises to work with numbers. The fact that it happens to work with lists and tuples is somewhat of an accident of implementation. > While "sum should only be used for numbers," it turns out it is not a > good choice for floats - use math.fsum. Correct. And if you (generic you, not you personally) do not understand why simple-minded addition of floats is troublesome, then you're going to have a world of trouble. Anyone who is disturbed by the question of "should I use sum or math.fsum?" probably shouldn't be writing serious floating point code at all. Floating point computations are hard, and there is simply no escaping this fact. > While "strings are blocked because > sum is slow," numpy arrays with millions of elements are not. That's not a good example. Strings are potentially O(N**2), which means not just "slow" but *agonisingly* slow, as in taking a week -- no exaggeration -- to concat a million strings. If it takes a nanosecond to concat two strings, then 1e6**2 such concatenations could take over eleven days. Slowness of such magnitude might as well be "the process has locked up". In comparison, summing a numpy array with a million entries is not really slow in that sense. The time taken is proportional to the number of entries, and differs from summing a list only by a constant factor. Besides, in the case of strings it is quite simple to decide "is the initial value a string?", whereas with lists or numpy arrays it's quite hard to decide "is the list or array so huge that the user will consider this too slow?". What counts as "too slow" depends on the machine it is running on, what other processes are running, and the user's mood, and leads to the silly result that summing an array of N items succeeds but N+1 items doesn't. So in the case of strings, it is easy to make a blanket prohibition, but in the case of lists or arrays, there is no reasonable place to draw the line. > And try to > explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine. I think that's because sum() has to box up each and every element in the array into an object, which is wasteful, while abs() can delegate to a specialist array.__abs__ method. Although that's not something beginners should be expected to understand, no serious Python programmer should be confused by this. As a programmer, we should expect to have some understanding of our tools, how they work, their limitations, and when to use a different tool. That's why numpy has its own version of sum which is designed to work specifically on numpy arrays. Use a specialist tool for a specialist job: py> with Stopwatch(): ... sum(carray) # carray is a numpy array of 75000000 floats. ... 112500000.0 time taken: 52.659770 seconds py> with Stopwatch(): ... numpy.sum(carray) ... 112500000.0 time taken: 0.161263 seconds > Why have builtin sum at all if its use comes with so many caveats? Because sum() is a perfectly reasonable general purpose tool for adding up small amounts of numbers where high floating point precision is not required. It has been included as a built-in because Python comes with "batteries included", and a basic function for adding up a few numbers is an obvious, simple battery. But serious programmers should be comfortable with the idea that you use the right tool for the right job. If you visit a hardware store, you will find that even something as simple as the hammer exists in many specialist varieties. There are tack hammers, claw hammers, framing hammers, lump hammers, rubber and wooden mallets, "brass" non-sparking hammers, carpet hammers, brick hammers, ball-peen and cross-peen hammers, and even more specialist versions like geologist's hammers. Bashing an object with something hard is remarkably complicated, and there are literally dozens of types and sizes of "the hammer". Why should it be a surprise that there are a handful of different ways to sum items? -- Steven From greg.ewing at canterbury.ac.nz Sat Aug 9 07:36:11 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 09 Aug 2014 17:36:11 +1200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140809050845.GZ4525@ando> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <20140809050845.GZ4525@ando> Message-ID: <53E5B34B.1020302@canterbury.ac.nz> Steven D'Aprano wrote: > I've long believed that + is the wrong operator for concatenating > strings, and that & makes a much better operator. Do you have a reason for preferring '&' in particular, or do you just want something different from '+'? Personally I can't see why "bitwise and" on strings should be a better metaphor for concatenation that "addition". :-) -- Greg From antoine at python.org Sat Aug 9 07:39:16 2014 From: antoine at python.org (Antoine Pitrou) Date: Sat, 09 Aug 2014 01:39:16 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140809050845.GZ4525@ando> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <20140809050845.GZ4525@ando> Message-ID: Le 09/08/2014 01:08, Steven D'Aprano a ?crit : > On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote: >> On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman wrote: >> >>> I don't use sum at all, or at least very rarely, and it still irritates me. >> >> You are not alone. When I see sum([a, b, c]), I think it is a + b + c, but >> in Python it is 0 + a + b + c. If we had a "join" operator for strings >> that is different form + - then sure, I would not try to use sum to join >> strings, but we don't. > > I've long believed that + is the wrong operator for concatenating > strings, and that & makes a much better operator. We wouldn't be having > these interminable arguments about using sum() to concatenate strings > (and lists, and tuples) if the & operator was used for concatenation and > + was only used for numeric addition. Come on. These arguments are interminable because many people (including you) love feeding interminable arguments. No need to blame Python for that. And for that matter, this interminable discussion should probably have taken place on python-ideas or even python-list. Regards Antoine. From stephen at xemacs.org Sat Aug 9 09:08:41 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 09 Aug 2014 16:08:41 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> Message-ID: <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> Alexander Belopolsky writes: > Why have builtin sum at all if its use comes with so many caveats? Because we already have it. If the caveats had been known when it was introduced, maybe it wouldn't have been. The question is whether you can convince python-dev that it's worth changing the definition of sum(). IMO that's going to be very hard to do. All the suggestions I've seen so far are (IMHO, YMMV) just as ugly as the present situation. From p.f.moore at gmail.com Sat Aug 9 10:36:31 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 9 Aug 2014 09:36:31 +0100 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140809050845.GZ4525@ando> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <20140809050845.GZ4525@ando> Message-ID: On 9 August 2014 06:08, Steven D'Aprano wrote: > py> with Stopwatch(): > ... sum(carray) # carray is a numpy array of 75000000 floats. > ... > 112500000.0 > time taken: 52.659770 seconds > py> with Stopwatch(): > ... numpy.sum(carray) > ... > 112500000.0 > time taken: 0.161263 seconds > > >> Why have builtin sum at all if its use comes with so many caveats? > > Because sum() is a perfectly reasonable general purpose tool for adding > up small amounts of numbers where high floating point precision is not > required. It has been included as a built-in because Python comes with > "batteries included", and a basic function for adding up a few numbers > is an obvious, simple battery. But serious programmers should be > comfortable with the idea that you use the right tool for the right job. Changing the subject a little, but the Stopwatch function you used up there is "an obvious, simple battery" for timing a chunk of code at the interactive prompt. I'm amazed there's nothing like it in the timeit module... Paul From benhoyt at gmail.com Sat Aug 9 18:43:01 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Sat, 9 Aug 2014 12:43:01 -0400 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir Message-ID: Just thought I'd share some of my excitement about how fast the all-C version [1] of os.scandir() is turning out to be. Below are the results of my scandir / walk benchmark run with three different versions. I'm using an SSD, which seems to make it especially faster than listdir / walk. Note that benchmark results can vary a lot, depending on operating system, file system, hard drive type, and the OS's caching state. Anyway, os.walk() can be FIFTY times as fast using os.scandir(). # Old ctypes implementation of scandir in scandir.py: C:\work\scandir>\work\python\cpython\python benchmark.py -r Using slower ctypes version of scandir os.walk took 1.144s, scandir.walk took 0.060s -- 19.2x as fast # Existing "half C" implementation of scandir in _scandir.c: C:\work\scandir>\Python34-x86\python.exe benchmark.py -r Using fast C version of scandir os.walk took 1.160s, scandir.walk took 0.042s -- 27.6x as fast # New "all C" os.scandir implementation in posixmodule.c: C:\work\scandir>\work\python\cpython\python benchmark.py -r Using Python 3.5's builtin os.scandir() os.walk took 1.141s, scandir.walk took 0.022s -- 53.0x as fast [1] Work in progress implementation as part of Python 3.5's posixmodule.c available here: https://github.com/benhoyt/scandir/blob/master/posixmodule.c -Ben From alexander.belopolsky at gmail.com Sat Aug 9 20:02:58 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sat, 9 Aug 2014 14:02:58 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull wrote: > All the suggestions > I've seen so far are (IMHO, YMMV) just as ugly as the present > situation. > What is ugly about allowing strings? CPython certainly has a way to to make sum(x, '') at least as efficient as y='';for in in x; y+= x is now. What is ugly about making sum([a, b, ..]) be equivalent to a + b + .. so that non-empty lists of arbitrary types can be "summed"? What is ugly about harmonizing sum(x) and reduce(operator.add, x) behaviors? -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexander.belopolsky at gmail.com Sat Aug 9 20:04:00 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sat, 9 Aug 2014 14:04:00 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Sat, Aug 9, 2014 at 2:02 PM, Alexander Belopolsky < alexander.belopolsky at gmail.com> wrote: > y='';for in in x; y+= x Should have been y='' for i in x; y += i -------------- next part -------------- An HTML attachment was scrubbed... URL: From gokoproject at gmail.com Sat Aug 9 20:44:10 2014 From: gokoproject at gmail.com (John Yeuk Hon Wong) Date: Sat, 09 Aug 2014 14:44:10 -0400 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc Message-ID: <53E66BFA.6070001@gmail.com> Hi. Referring to my discussion on [1] and then on #python this afternoon. A little background would help people to understand where this was coming from. 1. I write Python 2 code and have done zero Python-3 specific code. 2. I have always been using class Foo(object) so I do not know the new style is no longer required in Python 3. I feel "stupid" and "wrong" by thinking (object) is still a convention in Python 3. 3. Many Python 2 tutorials do not use object as the base class whether for historical reason, or lack of information/education, and can cause confusing to newcomers searching for answers when they consult the official documentation. While Python 3 code no longer requires object be the base class for the new-style class definition, I believe (object) is still required if one has to write a 2-3 compatible code. But this was not explained or warned anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please correct me) I propose the followings: * It is desirable to state boldly to users that (object) is no longer needed in Python-3 **only** code and warn users to revert to (object) style if the code needs to be 2 and 3 compatible. * In addition, Python 2 doc [2] should be fixed by introducing the new-style classes. This problem was noted a long long time ago according to [4]. * I would like to see warnings from suggested action item 1 on [2] and [3], for python 2 and 3 documentations. Possible objections(s): * We are pushing toward Python 3, some years later we don't need to maintain both Python 2 and 3 code. And many people, especially the newcomers will probably have no need to maintain Python 2 and 3 compatible codes. My answer to that is we need to be careful with marketing. First, it is a little embarrassing to assume and to find out the assumption is not entirely accurate. Secondly, Python 2 will not go away any time soon and most tutorials available on the Internet today are still written for Python 2. Furthermore, this CAN be a "gotcha" for new developers knowing only Python 3 writing Python 2 & 3 compatible code. * Books can do a better job I haven't actually reviewed/read any Python 3 books knowing most of my code should work without bothering Python 3-2 incompatibility yet. So I don't have an accurate answer, but a very very quick glance over a popular Python 3 book (I am not sure if naming it out is ethical or not so I am going to grey it out here) the book just writes class Foo: and doesn't note the different between 2 and 3 with classes. It is not wrong since the book is about programming in Python 3, NOT writing 2 and 3, but this is where the communication breaks. Docs and books don't give all the answers needed. P.S. Sorry if I should've have asked on #python-dev first or made a ticket but I've decided to send to mailing list before making a bug ticket. First time! Thanks. Best, Yeuk Hon [1]: https://news.ycombinator.com/item?id=8154471 [2]: https://docs.python.org/2/tutorial/classes.html https://docs.python.org/3/tutorial/classes.html [3]: https://docs.python.org/3/tutorial/classes.html [4]: https://www.python.org/doc/newstyle/ From alexander.belopolsky at gmail.com Sat Aug 9 21:20:42 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sat, 9 Aug 2014 15:20:42 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <20140809050845.GZ4525@ando> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <20140809050845.GZ4525@ando> Message-ID: On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano wrote: > We wouldn't be having > these interminable arguments about using sum() to concatenate strings > (and lists, and tuples) if the & operator was used for concatenation and > + was only used for numeric addition. > But we would probably have a similar discussion about all(). :-) Use of + is consistent with the use of * for repetition. What would you use use for repetition if you use & instead? Compare, for example s + ' ' * (n - len(s)) and s & ' ' * (n - len(s)) Which one is clearer? It is sum() that need to be fixed, not +. Not having sum([a, b]) equivalent to a + b for any a, b pair is hard to justify. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Sat Aug 9 22:46:56 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Sat, 09 Aug 2014 16:46:56 -0400 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: <53E66BFA.6070001@gmail.com> References: <53E66BFA.6070001@gmail.com> Message-ID: On 8/9/2014 2:44 PM, John Yeuk Hon Wong wrote: > Hi. > > Referring to my discussion on [1] and then on #python this afternoon. > > A little background would help people to understand where this was > coming from. > > 1. I write Python 2 code and have done zero Python-3 specific code. > 2. I have always been using class Foo(object) so I do not know the new > style is no longer required in Python 3. I feel "stupid" and "wrong" by > thinking (object) is still a convention in Python 3. If someone else tried to make you feel that way, they are Code of Conduct violators who should be ignored. If you are beating yourself on the head, stop. > 3. Many Python 2 tutorials do not use object as the base class whether > for historical reason, or lack of information/education, Probably both. Either way, the result is a disservice to readers. > and can cause confusing to newcomers searching for answers > when they consult the official documentation. I and some other people STRONGLY recommend that newcomers start with Python 3 and Python 3 docs and completely ignore Python 2 unless they cannot. > While Python 3 code no longer requires object be the base class for the > new-style class definition, I believe (object) is still required if one > has to write a 2-3 compatible code. But this was not explained or warned > anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please > correct me) > > I propose the followings: > > * It is desirable to state boldly to users that (object) is no longer > needed in Python-3 **only** code and warn users to revert to (object) > style if the code needs to be 2 and 3 compatible. I think 'boldly' and 'warn' are a bit overstated. > * In addition, Python 2 doc [2] should be fixed by introducing the > new-style classes. Definitely. The 2.x tutorial start with class x: and continues that way half way through the chapter. I think it should start with class x(object): and at the end of the first half, briefly mention that class x in 2.x gets something slightly different that beginners can mostly ignore, while class x: in 3.x == class x(object): and that the latter works the same for both. The 3.x tutorial, in the same place could *briefly* mention that class x: == class x(object): and the the latter is usually only used in code that also runs on 2.x or has been converted without removing the extra code. The 3.x tutorial should *not* mention old style classes. > This problem was noted a long long time ago according to [4]. The opening statement "Unfortunately, new-style classes have not yet been integrated into Python's standard documention." is perhaps a decade out of date. That page should not have been included in the new site design without being modified. > [1]: https://news.ycombinator.com/item?id=8154471 > > [2]: https://docs.python.org/2/tutorial/classes.html > https://docs.python.org/3/tutorial/classes.html > > [3]: https://docs.python.org/3/tutorial/classes.html > > [4]: https://www.python.org/doc/newstyle/ -- Terry Jan Reedy From jeanpierreda at gmail.com Sat Aug 9 23:07:58 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Sat, 9 Aug 2014 14:07:58 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <20140809050845.GZ4525@ando> Message-ID: On Sat, Aug 9, 2014 at 12:20 PM, Alexander Belopolsky wrote: > > On Sat, Aug 9, 2014 at 1:08 AM, Steven D'Aprano wrote: >> >> We wouldn't be having >> these interminable arguments about using sum() to concatenate strings >> (and lists, and tuples) if the & operator was used for concatenation and >> + was only used for numeric addition. > > > But we would probably have a similar discussion about all(). :-) > > Use of + is consistent with the use of * for repetition. What would you use > use for repetition if you use & instead? If the only goal is to not be tempted to use sum() for string concatenation, how about using *? This is more consistent with mathematics terminology, where a * b is not necessarily the same as b * a (unlike +, which is commutative). As an example, consider matrix multiplication. Then, to answer your question, repetition would have been s ** n. (In fact, this is the notation for concatenation and repetition used in formal language theory.) (If we really super wanted to add this to Python, obviously we'd use the @ and @@ operators. But it's a bit late for that.) -- Devin From steve at pearwood.info Sun Aug 10 02:44:52 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 10 Aug 2014 10:44:52 +1000 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: <53E66BFA.6070001@gmail.com> References: <53E66BFA.6070001@gmail.com> Message-ID: <20140810004452.GB4525@ando> On Sat, Aug 09, 2014 at 02:44:10PM -0400, John Yeuk Hon Wong wrote: > Hi. > > Referring to my discussion on [1] and then on #python this afternoon. > > A little background would help people to understand where this was > coming from. > > 1. I write Python 2 code and have done zero Python-3 specific code. > 2. I have always been using class Foo(object) so I do not know the new > style is no longer required in Python 3. I feel "stupid" and "wrong" by > thinking (object) is still a convention in Python 3. But object is still a convention in Python 3. It is certainly required when writing code that will behave the same in version 2 and 3, and it's optional in 3-only code, but certainly not frowned upon or discouraged. There's nothing wrong with explicitly inheriting from object in Python 3, and with the Zen of Python "Explicit is better than implicit" I would argue that *leaving it out* should be very slightly discouraged. class Spam: # okay, but a bit lazy class Spam(object): # better Perhaps PEP 8 should make a recommendation, but if so, I think it should be a very weak one. In Python 3, it really doesn't matter which you write. My own personal practice is to explicitly inherit from object when the class is "important" or more than half a dozen lines, and leave it out if the class is a stub or tiny. > 3. Many Python 2 tutorials do not use object as the base class whether > for historical reason, or lack of information/education, and can cause > confusing to newcomers searching for answers when they consult the > official documentation. We can't do anything about third party tutorials :-( > While Python 3 code no longer requires object be the base class for the > new-style class definition, I believe (object) is still required if one > has to write a 2-3 compatible code. But this was not explained or warned > anywhere in Python 2 and Python 3 code, AFAIK. (if I am wrong, please > correct me) It's not *always* required, only if you use features which require new-style classes, e.g. super, or properties. > I propose the followings: > > * It is desirable to state boldly to users that (object) is no longer > needed in Python-3 **only** code I'm against that. Stating this boldly will be understood by some readers that object should not be used, and I'm strongly against that. I believe explicitly inheriting from object should be mildly preferred, not strongly discouraged. > and warn users to revert to (object) > style if the code needs to be 2 and 3 compatible. I don't think that should be necesary, but have no objections to it being mentioned. I think it should be obvious: if you need new-style behaviour in Python 2, then obviously you have to inherit from object otherwise you have a classic class. That requirement doesn't go away just because your code will sometimes run under Python 3. Looking at your comment here: > [1]: https://news.ycombinator.com/item?id=8154471 there is a reply from zeckalpha, who says: "Actually, leaving out `object` is the preferred convention for Python 3, as they are semantically equivalent." How does (s)he justify this claim? "Explicit is better than implicit." which is not logical. If you leave out `object`, that's implicit, not explicit. -- Steven From rosuav at gmail.com Sun Aug 10 03:01:17 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sun, 10 Aug 2014 11:01:17 +1000 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: <20140810004452.GB4525@ando> References: <53E66BFA.6070001@gmail.com> <20140810004452.GB4525@ando> Message-ID: On Sun, Aug 10, 2014 at 10:44 AM, Steven D'Aprano wrote: > Looking at your comment here: > >> [1]: https://news.ycombinator.com/item?id=8154471 > > there is a reply from zeckalpha, who says: > > "Actually, leaving out `object` is the preferred convention for > Python 3, as they are semantically equivalent." > > How does (s)he justify this claim? > > "Explicit is better than implicit." > > which is not logical. If you leave out `object`, that's implicit, not > explicit. The justification is illogical. However, I personally believe boilerplate should be omitted where possible; that's why we have a whole lot of things that "just work". Why does Python not have explicit boolification for if/while checks? REXX does (if you try to use anything else, you get a run-time error "Logical value not 0 or 1"), and that's more explicit - Python could require you to write "if bool(x)" for the case where you actually want the truthiness magic, to distinguish from "if x is not None" etc. But that's unnecessary boilerplate. Python could have required explicit nonlocal declarations for all names used in closures, but that's unhelpful too. Python strives to eliminate that kind of thing. So, my view would be: Py3-only tutorials can and probably should omit it, for the same reason that we don't advise piles of __future__ directives. You can always add stuff later for coping with Py2+Py3 execution; chances are any non-trivial code will have much bigger issues than accidentally making an old-style class. ChrisA From antoine at python.org Sun Aug 10 05:20:27 2014 From: antoine at python.org (Antoine Pitrou) Date: Sat, 09 Aug 2014 23:20:27 -0400 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: References: Message-ID: Le 09/08/2014 12:43, Ben Hoyt a ?crit : > Just thought I'd share some of my excitement about how fast the all-C > version [1] of os.scandir() is turning out to be. > > Below are the results of my scandir / walk benchmark run with three > different versions. I'm using an SSD, which seems to make it > especially faster than listdir / walk. Note that benchmark results can > vary a lot, depending on operating system, file system, hard drive > type, and the OS's caching state. > > Anyway, os.walk() can be FIFTY times as fast using os.scandir(). Very nice results, thank you :-) Regards Antoine. From ncoghlan at gmail.com Sun Aug 10 05:57:36 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 10 Aug 2014 13:57:36 +1000 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: References: Message-ID: On 10 August 2014 13:20, Antoine Pitrou wrote: > Le 09/08/2014 12:43, Ben Hoyt a ?crit : > >> Just thought I'd share some of my excitement about how fast the all-C >> version [1] of os.scandir() is turning out to be. >> >> Below are the results of my scandir / walk benchmark run with three >> different versions. I'm using an SSD, which seems to make it >> especially faster than listdir / walk. Note that benchmark results can >> vary a lot, depending on operating system, file system, hard drive >> type, and the OS's caching state. >> >> Anyway, os.walk() can be FIFTY times as fast using os.scandir(). > > > Very nice results, thank you :-) Indeed! This may actually motivate me to start working on a redesign of walkdir at some point, with scandir and DirEntry objects as the basis. My original approach was just too slow to be useful in practice (at least when working with trees on the scale of a full Fedora or RHEL build hosted on an NFS share). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From robertc at robertcollins.net Sun Aug 10 07:40:47 2014 From: robertc at robertcollins.net (Robert Collins) Date: Sun, 10 Aug 2014 17:40:47 +1200 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: References: Message-ID: A small tip from my bzr days - cd into the directory before scanning it - especially if you'll end up statting more than a fraction of the files, or are recursing - otherwise the VFS does a traversal for each path you directly stat / recurse into. This can become a dominating factor in some workloads (I shaved several hundred milliseconds off of bzr stat on kernel trees doing this). -Rob On 10 August 2014 15:57, Nick Coghlan wrote: > On 10 August 2014 13:20, Antoine Pitrou wrote: >> Le 09/08/2014 12:43, Ben Hoyt a ?crit : >> >>> Just thought I'd share some of my excitement about how fast the all-C >>> version [1] of os.scandir() is turning out to be. >>> >>> Below are the results of my scandir / walk benchmark run with three >>> different versions. I'm using an SSD, which seems to make it >>> especially faster than listdir / walk. Note that benchmark results can >>> vary a lot, depending on operating system, file system, hard drive >>> type, and the OS's caching state. >>> >>> Anyway, os.walk() can be FIFTY times as fast using os.scandir(). >> >> >> Very nice results, thank you :-) > > Indeed! > > This may actually motivate me to start working on a redesign of > walkdir at some point, with scandir and DirEntry objects as the basis. > My original approach was just too slow to be useful in practice (at > least when working with trees on the scale of a full Fedora or RHEL > build hosted on an NFS share). > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/robertc%40robertcollins.net -- Robert Collins Distinguished Technologist HP Converged Cloud From larry at hastings.org Sun Aug 10 08:11:41 2014 From: larry at hastings.org (Larry Hastings) Date: Sat, 09 Aug 2014 23:11:41 -0700 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: References: Message-ID: <53E70D1D.3040306@hastings.org> On 08/09/2014 10:40 PM, Robert Collins wrote: > A small tip from my bzr days - cd into the directory before scanning it I doubt that's permissible for a library function like os.scandir(). //arry/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Sun Aug 10 10:24:32 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 10 Aug 2014 17:24:32 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> Alexander Belopolsky writes: > On Sat, Aug 9, 2014 at 3:08 AM, Stephen J. Turnbull > wrote: > > > All the suggestions > > I've seen so far are (IMHO, YMMV) just as ugly as the present > > situation. > > > > What is ugly about allowing strings? CPython certainly has a way to to > make sum(x, '') sum(it, '') itself is ugly. As I say, YMMV, but in general last I heard arguments that are usually constants drawn from a small set of constants are considered un-Pythonic; a separate function to express that case is preferred. I like the separate function style. And that's the current situation, except that in the case of strings it turns out to be useful to allow for "sums" that have "glue" at the joints, so it's spelled as a string method rather than a builtin: eg, ", ".join(paramlist). Actually ... if I were a fan of the "".join() idiom, I'd seriously propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could deprecate "".join(string_iterable) in favor of "".sum(string_iterable) (with the same efficient semantics). From stephen at xemacs.org Sun Aug 10 11:13:51 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 10 Aug 2014 18:13:51 +0900 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: References: <53E66BFA.6070001@gmail.com> <20140810004452.GB4525@ando> Message-ID: <8738d4k8q8.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Angelico writes: > The justification is illogical. However, I personally believe > boilerplate should be omitted where possible; But it mostly can't be omitted. I wrote 22 classes (all trivial) yesterday for a Python 3 program. Not one derived directly from object. That's a bit unusual, but in the three longish scripts I have to hand, not one had more than 30% "new" classes derived from object. As a matter of personal style, I don't use optional positional arguments (with a few "traditional" exceptions); if I omit one most of the time, when I need it I use a keyword. That's not an argument, it's just an observation that's consistent with support for using an explicit parent class of object "most of the time". > that's why we have a whole lot of things that "just work". Why does > Python not have explicit boolification for if/while checks? Because it does have explicit boolification (signaled by the control structure syntax itself). No? I don't think this is less explicit than REXX, because it doesn't happen elsewhere (10 + False == 10 -- not True, and even bool(10) + False != True). > So, my view would be: Py3-only tutorials can and probably should omit > it, But this doesn't make things simpler. It means that there are two syntaxes to define some classes, and you want to make one of them TOOWTDI for classes derived directly from object, and the other TOOWTDI for non-trivial subclasses. I'll grant that in some sense it's no more complex, either, of course. Note that taken to extremes, your argument could be construed as "we should define defaults for all arguments and omit them where possible". Of course for typing in quick programs, and for trivial classes, omitting the derivation from object is a useful convenience. But I don't think it's something that should be encouraged in tutorials. Steve From arigo at tunes.org Sun Aug 10 12:28:25 2014 From: arigo at tunes.org (Armin Rigo) Date: Sun, 10 Aug 2014 12:28:25 +0200 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: <53E70D1D.3040306@hastings.org> References: <53E70D1D.3040306@hastings.org> Message-ID: Hi Larry, On 10 August 2014 08:11, Larry Hastings wrote: >> A small tip from my bzr days - cd into the directory before scanning it > > I doubt that's permissible for a library function like os.scandir(). Indeed, chdir() is notably not compatible with multithreading. There would be a non-portable but clean way to do that: the functions openat() and fstatat(). They only exist on relatively modern Linuxes, though. A bient?t, Armin. From rdmurray at bitdance.com Sun Aug 10 15:55:40 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sun, 10 Aug 2014 09:55:40 -0400 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: References: Message-ID: <20140810135542.02715250DF8@webabinitio.net> On Sun, 10 Aug 2014 13:57:36 +1000, Nick Coghlan wrote: > On 10 August 2014 13:20, Antoine Pitrou wrote: > > Le 09/08/2014 12:43, Ben Hoyt a ??crit : > > > >> Just thought I'd share some of my excitement about how fast the all-C > >> version [1] of os.scandir() is turning out to be. > >> > >> Below are the results of my scandir / walk benchmark run with three > >> different versions. I'm using an SSD, which seems to make it > >> especially faster than listdir / walk. Note that benchmark results can > >> vary a lot, depending on operating system, file system, hard drive > >> type, and the OS's caching state. > >> > >> Anyway, os.walk() can be FIFTY times as fast using os.scandir(). > > > > > > Very nice results, thank you :-) > > Indeed! > > This may actually motivate me to start working on a redesign of > walkdir at some point, with scandir and DirEntry objects as the basis. > My original approach was just too slow to be useful in practice (at > least when working with trees on the scale of a full Fedora or RHEL > build hosted on an NFS share). There is another potentially good place in the stdlib to apply scandir: iglob. See issue 22167. --David From barry at python.org Sun Aug 10 16:39:10 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 10 Aug 2014 10:39:10 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20140810103910.2c8b9079@anarchist.localdomain> On Aug 10, 2014, at 05:24 PM, Stephen J. Turnbull wrote: >Actually ... if I were a fan of the "".join() idiom, I'd seriously >propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could >deprecate "".join(string_iterable) in favor of "".sum(string_iterable) >(with the same efficient semantics). Ever since ''.join was added, there has been vague talk about adding a join() built-in. If the semantics and argument syntax can be worked out, I'd still be in favor of that. Probably deserves a PEP and a millithread community bikeshed paintdown. -Barry From alexander.belopolsky at gmail.com Sun Aug 10 17:51:51 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Sun, 10 Aug 2014 11:51:51 -0400 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: <20140810004452.GB4525@ando> References: <53E66BFA.6070001@gmail.com> <20140810004452.GB4525@ando> Message-ID: On Sat, Aug 9, 2014 at 8:44 PM, Steven D'Aprano wrote: > It is certainly required when writing code that will behave the same in > version 2 and 3 > This is not true. An alternative is to put __metaclass__ = type at the top of your module to make all classes in your module new-style in python2. -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Sun Aug 10 18:26:39 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 10 Aug 2014 12:26:39 -0400 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: References: <53E66BFA.6070001@gmail.com> <20140810004452.GB4525@ando> Message-ID: <20140810122639.364756bf@anarchist.localdomain> On Aug 10, 2014, at 11:51 AM, Alexander Belopolsky wrote: >This is not true. An alternative is to put > >__metaclass__ = type > >at the top of your module to make all classes in your module new-style in >python2. I like this much better, and it's what I do in my own bilingual code. It makes it much easier to remove the unnecessary cruft when you drop the Python 2 support. -Barry From steve at pearwood.info Sun Aug 10 19:21:46 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 11 Aug 2014 03:21:46 +1000 Subject: [Python-Dev] class Foo(object) vs class Foo: should be clearly explained in python 2 and 3 doc In-Reply-To: References: <53E66BFA.6070001@gmail.com> <20140810004452.GB4525@ando> Message-ID: <20140810172146.GE4525@ando> On Sun, Aug 10, 2014 at 11:51:51AM -0400, Alexander Belopolsky wrote: > On Sat, Aug 9, 2014 at 8:44 PM, Steven D'Aprano wrote: > > > It is certainly required when writing code that will behave the same in > > version 2 and 3 > > > > This is not true. An alternative is to put > > __metaclass__ = type > > at the top of your module to make all classes in your module new-style in > python2. So it is. I forgot about that, thank you for the correction. -- Steven From v+python at g.nevcal.com Sun Aug 10 22:12:26 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Sun, 10 Aug 2014 13:12:26 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <53E7D22A.3010802@g.nevcal.com> On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote: > Actually ... if I were a fan of the "".join() idiom, I'd seriously > propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could > deprecate "".join(string_iterable) in favor of "".sum(string_iterable) > (with the same efficient semantics). Actually, there is no need to wait for 0.sum() to propose "".sum... but it is only a spelling change, so no real benefit. Thinking about this more, maybe it should be a class function, so that it wouldn't require an instance: str.sum( iterable_containing_strings ) [ or str.join( iterable_containing_strings ) ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Sun Aug 10 22:27:25 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sun, 10 Aug 2014 16:27:25 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E7D22A.3010802@g.nevcal.com> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7D22A.3010802@g.nevcal.com> Message-ID: <20140810202725.D55C7250D4E@webabinitio.net> On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman wrote: > On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote: > > Actually ... if I were a fan of the "".join() idiom, I'd seriously > > propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could > > deprecate "".join(string_iterable) in favor of "".sum(string_iterable) > > (with the same efficient semantics). > Actually, there is no need to wait for 0.sum() to propose "".sum... but > it is only a spelling change, so no real benefit. > > Thinking about this more, maybe it should be a class function, so that > it wouldn't require an instance: > > str.sum( iterable_containing_strings ) > > [ or str.join( iterable_containing_strings ) ] That's how it used to be spelled in python2. --David From rdmurray at bitdance.com Sun Aug 10 22:29:38 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sun, 10 Aug 2014 16:29:38 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E7D22A.3010802@g.nevcal.com> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7D22A.3010802@g.nevcal.com> Message-ID: <20140810202939.78B0C250D67@webabinitio.net> On Sun, 10 Aug 2014 13:12:26 -0700, Glenn Linderman wrote: > On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote: > > Actually ... if I were a fan of the "".join() idiom, I'd seriously > > propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could > > deprecate "".join(string_iterable) in favor of "".sum(string_iterable) > > (with the same efficient semantics). > Actually, there is no need to wait for 0.sum() to propose "".sum... but > it is only a spelling change, so no real benefit. > > Thinking about this more, maybe it should be a class function, so that > it wouldn't require an instance: > > str.sum( iterable_containing_strings ) > > [ or str.join( iterable_containing_strings ) ] Sorry, I mean 'string.join' is how it used to be spelled. Making it a class method is indeed slightly different. --David From stephen at xemacs.org Mon Aug 11 01:57:36 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Mon, 11 Aug 2014 08:57:36 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E7CBA4.40105@g.nevcal.com> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> Message-ID: <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On 8/10/2014 1:24 AM, Stephen J. Turnbull wrote: > > Actually ... if I were a fan of the "".join() idiom, I'd seriously > > propose 0.sum(numeric_iterable) as the RightThang{tm]. Then we could > > deprecate "".join(string_iterable) in favor of "".sum(string_iterable) > > (with the same efficient semantics). > Actually, there is no need to wait for 0.sum() to propose "".sum... but > it is only a spelling change, so no real benefit. IMO it's worse than merely a spelling change, because (1) "join" is a more evocative term for concatenating strings than "sum" and (2) I don't know of any other sums that allow "glue". I'm overall -1 on trying to change the current situation (except for adding a join() builtin or str.join class method). We could probably fix everything in a static-typed language (because that would allow picking an initial object of the appropriate type), but without that we need to pick a default of some particular type, and 0 makes the most sense. I can understand the desire of people who want to use the same syntax for summing an iterable of numbers and for concatenating an iterable of strings, but to me they're really not even formally the same in practical use. I'm very sympathetic to Steven's explanation that "we wouldn't be having this discussion if we used a different operator for string concatenation". Although that's not the whole story: in practice even numerical sums get split into multiple functions because floating point addition isn't associative, and so needs careful treatment to preserve accuracy. At that point I'm strongly +1 on abandoning attempts to "rationalize" summation. I'm not sure how I'd feel about raising an exception if you try to sum any iterable containing misbehaved types like float. But not only would that be a Python 4 effort due to backward incompatibility, but it sorta contradicts the main argument of proponents ("any type implementing __add__ should be sum()-able"). From uwe.schmitt at id.ethz.ch Mon Aug 11 11:10:53 2014 From: uwe.schmitt at id.ethz.ch (Schmitt Uwe (ID SIS)) Date: Mon, 11 Aug 2014 09:10:53 +0000 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object Message-ID: Dear all, I discovered a problem using cPickle.loads from CPython 2.7.6. The last line in the following code raises an infinite recursion class T(object): def __init__(self): self.item = list() def __getattr__(self, name): return getattr(self.item, name) import cPickle t = T() l = cPickle.dumps(t) cPickle.loads(l) loads triggers T.__getattr__ using "getattr(inst, "__setstate__", None)" for looking up a "__setstate__" method, which is not implemented for T. As the item attribute is missing at this time, the ininfite recursion starts. The infinite recursion disappears if I attach a default implementation for __setstate__ to T: def __setstate__(self, dd): self.__dict__ = dd This could be fixed by using ?hasattr? in pickle before trying to call ?getattr?. Is this a bug or did I miss something ? Kind Regards, Uwe From tjreedy at udel.edu Mon Aug 11 13:28:44 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 11 Aug 2014 07:28:44 -0400 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object In-Reply-To: References: Message-ID: On 8/11/2014 5:10 AM, Schmitt Uwe (ID SIS) wrote: Python usage questions should be directed to python-list, for instance. > I discovered a problem using cPickle.loads from CPython 2.7.6. The problem is your code having infinite recursion. You only discovered it with pickle. > The last line in the following code raises an infinite recursion > > class T(object): > > def __init__(self): > self.item = list() > > def __getattr__(self, name): > return getattr(self.item, name) This is a (common) bug in your program. __getattr__ should call self.__dict__(name) to avoid the recursion. -- Terry Jan Reedy From __peter__ at web.de Mon Aug 11 13:40:13 2014 From: __peter__ at web.de (Peter Otten) Date: Mon, 11 Aug 2014 13:40:13 +0200 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object References: Message-ID: Terry Reedy wrote: > On 8/11/2014 5:10 AM, Schmitt Uwe (ID SIS) wrote: > > Python usage questions should be directed to python-list, for instance. > >> I discovered a problem using cPickle.loads from CPython 2.7.6. > > The problem is your code having infinite recursion. You only discovered > it with pickle. > > >> The last line in the following code raises an infinite recursion >> >> class T(object): >> >> def __init__(self): >> self.item = list() >> >> def __getattr__(self, name): >> return getattr(self.item, name) > > This is a (common) bug in your program. __getattr__ should call > self.__dict__(name) to avoid the recursion. Read again. The OP tries to delegate attribute lookup to an (existing) attribute. IMO the root cause of the problem is that pickle looks up __dunder__ methods in the instance rather than the class. From rosuav at gmail.com Mon Aug 11 13:43:00 2014 From: rosuav at gmail.com (Chris Angelico) Date: Mon, 11 Aug 2014 21:43:00 +1000 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object In-Reply-To: References: Message-ID: On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at web.de> wrote: > Read again. The OP tries to delegate attribute lookup to an (existing) > attribute. > > IMO the root cause of the problem is that pickle looks up __dunder__ methods > in the instance rather than the class. The recursion comes from the attempted lookup of self.item, when __init__ hasn't been called. ChrisA From rdmurray at bitdance.com Mon Aug 11 14:10:30 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 11 Aug 2014 08:10:30 -0400 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object In-Reply-To: References: Message-ID: <20140811121031.4BF05250DC4@webabinitio.net> On Mon, 11 Aug 2014 21:43:00 +1000, Chris Angelico wrote: > On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at web.de> wrote: > > Read again. The OP tries to delegate attribute lookup to an (existing) > > attribute. > > > > IMO the root cause of the problem is that pickle looks up __dunder__ methods > > in the instance rather than the class. > > The recursion comes from the attempted lookup of self.item, when > __init__ hasn't been called. Indeed, and this is what the OP missed. With a class like this, it is necessary to *make* it pickleable, since the pickle protocol doesn't call __init__. --David From __peter__ at web.de Mon Aug 11 14:25:01 2014 From: __peter__ at web.de (Peter Otten) Date: Mon, 11 Aug 2014 14:25:01 +0200 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object References: Message-ID: Chris Angelico wrote: > On Mon, Aug 11, 2014 at 9:40 PM, Peter Otten <__peter__ at web.de> wrote: >> Read again. The OP tries to delegate attribute lookup to an (existing) >> attribute. >> >> IMO the root cause of the problem is that pickle looks up __dunder__ >> methods in the instance rather than the class. > > The recursion comes from the attempted lookup of self.item, when > __init__ hasn't been called. You are right. Sorry for the confusion. From benhoyt at gmail.com Mon Aug 11 14:26:47 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Mon, 11 Aug 2014 08:26:47 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: It seems to me this is something of a pointless discussion -- I highly doubt the current situation is going to change, and it works very well. Even if not perfect, sum() is for numbers, sep.join() for strings. However, I will add one comment: I'm overall -1 on trying to change the current situation (except for > adding a join() builtin or str.join class method). Did you know there actually is a str.join "class method"? I've never actually seen it used this way, but for people who just can't stand sep.join(seq), you can always call str.join(sep, seq) -- works in Python 2 and 3: >>> str.join('.', ['abc', 'def', 'ghi']) 'abc.def.ghi' This works as a side effect of the fact that you can call methods as cls.method(instance, args). -Ben -------------- next part -------------- An HTML attachment was scrubbed... URL: From 4kir4.1i at gmail.com Mon Aug 11 15:01:31 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Mon, 11 Aug 2014 17:01:31 +0400 Subject: [Python-Dev] python2.7 infinite recursion when loading pickled object References: Message-ID: <8761hz9o44.fsf@gmail.com> "Schmitt Uwe (ID SIS)" writes: > I discovered a problem using cPickle.loads from CPython 2.7.6. > > The last line in the following code raises an infinite recursion > > class T(object): > > def __init__(self): > self.item = list() > > def __getattr__(self, name): > return getattr(self.item, name) > > import cPickle > > t = T() > > l = cPickle.dumps(t) > cPickle.loads(l) ... > Is this a bug or did I miss something ? The issue is that your __getattr__ raises RuntimeError (due to infinite recursion) for non-existing attributes instead of AttributeError. To fix it, you could use object.__getattribute__: class C: def __init__(self): self.item = [] def __getattr__(self, name): return getattr(object.__getattribute__(self, 'item'), name) There were issues in the past due to {get,has}attr silencing non-AttributeError exceptions; therefore it is good that pickle breaks when it gets RuntimeError instead of AttributeError. -- Akira From 4kir4.1i at gmail.com Mon Aug 11 17:26:29 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Mon, 11 Aug 2014 19:26:29 +0400 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir References: <53E70D1D.3040306@hastings.org> Message-ID: <87zjfb82u2.fsf@gmail.com> Armin Rigo writes: > On 10 August 2014 08:11, Larry Hastings wrote: >>> A small tip from my bzr days - cd into the directory before scanning it >> >> I doubt that's permissible for a library function like os.scandir(). > > Indeed, chdir() is notably not compatible with multithreading. There > would be a non-portable but clean way to do that: the functions > openat() and fstatat(). They only exist on relatively modern Linuxes, > though. There is os.fwalk() that could be both safer and faster than os.walk(). It yields rootdir fd that can be used by functions that support dir_fd parameter, see os.supports_dir_fd set. They use *at() functions under the hood. os.fwalk() could be implemented in terms of os.scandir() if the latter would support fd parameter like os.listdir() does (be in os.supports_fd set (note: it is different from os.supports_dir_fd)). Victor Stinner suggested [1] to allow scandir(fd) but I don't see it being mentioned in the pep 471 [2]: it neither supports nor rejects the idea. [1] https://mail.python.org/pipermail/python-dev/2014-July/135283.html [2] http://legacy.python.org/dev/peps/pep-0471/ -- Akira From benhoyt at gmail.com Mon Aug 11 17:51:26 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Mon, 11 Aug 2014 11:51:26 -0400 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: <87zjfb82u2.fsf@gmail.com> References: <53E70D1D.3040306@hastings.org> <87zjfb82u2.fsf@gmail.com> Message-ID: > Victor Stinner suggested [1] to allow scandir(fd) but I don't see it > being mentioned in the pep 471 [2]: it neither supports nor rejects the > idea. > > [1] https://mail.python.org/pipermail/python-dev/2014-July/135283.html > [2] http://legacy.python.org/dev/peps/pep-0471/ Yes, listdir() supports fd, and I think scandir() probably will too to parallel that, if not for v1.0 then soon after. Victor and I want to focus on getting the PEP 471 (string path only) version working first. -Ben From chris.barker at noaa.gov Mon Aug 11 17:07:39 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Mon, 11 Aug 2014 08:07:39 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <-2448384566377912251@unknownmsgid> > I'm very sympathetic to Steven's explanation that "we > wouldn't be having this discussion if we used a different operator for > string concatenation". Sure -- but just imagine the conversations we could be having instead : what does bit wise and of a string mean? A bytes object? I cod see it as a character-wise and, for instance ;-) My confusion is still this: Repeated summation of strings has been optimized in cpython even though it's not the recommended way to solve that problem. So why not special case optimize sum() for strings? We are already special-case strings to raise an exception. It seems pretty pedantic to say: we cod make this work well, but we'd rather chide you for not knowing the "proper" way to do it. Practicality beats purity? -Chris > Although that's not the whole story: in > practice even numerical sums get split into multiple functions because > floating point addition isn't associative, and so needs careful > treatment to preserve accuracy. At that point I'm strongly +1 on > abandoning attempts to "rationalize" summation. > > I'm not sure how I'd feel about raising an exception if you try to sum > any iterable containing misbehaved types like float. But not only > would that be a Python 4 effort due to backward incompatibility, but > it sorta contradicts the main argument of proponents ("any type > implementing __add__ should be sum()-able"). > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov From jtaylor.debian at googlemail.com Mon Aug 11 19:20:42 2014 From: jtaylor.debian at googlemail.com (Julian Taylor) Date: Mon, 11 Aug 2014 19:20:42 +0200 Subject: [Python-Dev] sum(...) limitation - temporary elision take 2 In-Reply-To: <53dfeb83.c36fe00a.1596.65c9@mx.google.com> References: <53dfeb83.c36fe00a.1596.65c9@mx.google.com> Message-ID: <53E8FB6A.7020203@googlemail.com> On 04.08.2014 22:22, Jim J. Jewett wrote: > > > > Sat Aug 2 12:11:54 CEST 2014, Julian Taylor wrote (in > https://mail.python.org/pipermail/python-dev/2014-August/135623.html ) wrote: > > >> Andrea Griffini wrote: > >>> However sum([[1,2,3],[4],[],[5,6]], []) concatenates the lists. > >> hm could this be a pure python case that would profit from temporary >> elision [ https://mail.python.org/pipermail/python-dev/2014-June/134826.html ]? > >> lists could declare the tp_can_elide slot and call list.extend on the >> temporary during its tp_add slot instead of creating a new temporary. >> extend/realloc can avoid the copy if there is free memory available >> after the block. > > Yes, with all the same problems. > > When dealing with a complex object, how can you be sure that __add__ > won't need access to the original values during the entire computation? > It works with matrix addition, but not with matric multiplication. > Depending on the details of the implementation, it could even fail for > a sort of sliding-neighbor addition similar to the original justification. The c-extension object knows what its add slot does. An object that cannot elide would simply always return 0 indicating to python to not call the inplace variant. E.g. the numpy __matmul__ operator would never tell python that it can work inplace, but __add__ would (if the arguments allow it). Though we may have found a way to do it without the direct help of Python, but it involves reading and storing the current instruction of the frame object to figure out if it is called directly from the interpreter. unfinished patch to numpy, see the can_elide_temp function: https://github.com/numpy/numpy/pull/4322.diff Probably not the best way as this is hardly intended Python C-API but assuming there is no overlooked issue with this approach it could be a good workaround for known good Python versions. From matsjoyce at gmail.com Mon Aug 11 19:42:19 2014 From: matsjoyce at gmail.com (matsjoyce) Date: Mon, 11 Aug 2014 17:42:19 +0000 (UTC) Subject: [Python-Dev] Reviving restricted mode? References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: Yup, I read that post. However, those specific issues do not exist in my module, as there is a module whitelist, and a method whitelist. Builtins are now proxied, and all types going in to functions are checked for modification. There maybe some holes in my approach, but I can't find them. From breamoreboy at yahoo.co.uk Mon Aug 11 19:55:07 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Mon, 11 Aug 2014 18:55:07 +0100 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: On 11/08/2014 18:42, matsjoyce wrote: > Yup, I read that post. However, those specific issues do not exist in my > module, as there is a module whitelist, and a method whitelist. Builtins are > now proxied, and all types going in to functions are checked for > modification. There maybe some holes in my approach, but I can't find them. > Any chance of giving us some context, or do I have to retrieve my crystal ball from the menders? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence From skip at pobox.com Mon Aug 11 21:00:32 2014 From: skip at pobox.com (Skip Montanaro) Date: Mon, 11 Aug 2014 14:00:32 -0500 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce wrote: > There maybe some holes in my approach, but I can't find them. There's the rub. Given time, I suspect someone will discover a hole or two. Skip From tjreedy at udel.edu Mon Aug 11 22:29:03 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 11 Aug 2014 16:29:03 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 8/11/2014 8:26 AM, Ben Hoyt wrote: > It seems to me this is something of a pointless discussion -- I highly > doubt the current situation is going to change, and it works very well. > Even if not perfect, sum() is for numbers, sep.join() for strings. > However, I will add one comment: > > I'm overall -1 on trying to change the current situation (except for > adding a join() builtin or str.join class method). > > > Did you know there actually is a str.join "class method"? A 'method' is a function accessed as an attribute of a class. An 'instance method' is a method whose first parameter is an instance of the class. str.join is an instance method. A 'class method', wrapped as such with classmether(), usually by decorating it with @classmethod, would take the class as a parameter. > I've never > actually seen it used this way, but for people who just can't stand > sep.join(seq), you can always call str.join(sep, seq) -- works in Python > 2 and 3: > > >>> str.join('.', ['abc', 'def', 'ghi']) > 'abc.def.ghi' One could even put 'join = str.join' at the top of a file. All this is true of *every* instance method. For instance >>> int.__add__(1, 2) == 1 .__add__(2) == 1 + 2 True However, your point that people who cannot stand the abbreviation *could* use the full form that is being abbreviated. In ancient Python, when strings did not have methods, the current string methods were functions in the string module. The functions were removed in 3.0. Their continued use in 2.x code is bad for 3.x compatibility, so I would not encourage it. >>> help(string.join) # 2.7.8 Help on function join in module string: join(words, sep=' ') join(list [,sep]) -> string Return a string composed of the words in list, with intervening occurrences of sep. The default separator is a single space. 'List' is obsolete. Since sometime before 2.7, 'words' meant an iterable of strings. >>> def digits(): for i in range(10): yield str(i) >>> string.join(digits(), '') '0123456789' Of of the string functions, I believe the conversion of join (and its synonum 'joinfields') to a method has been the most contentious. -- Terry Jan Reedy From ischwabacher at wisc.edu Mon Aug 11 20:36:48 2014 From: ischwabacher at wisc.edu (Isaac Schwabacher) Date: Mon, 11 Aug 2014 13:36:48 -0500 Subject: [Python-Dev] pathlib handling of trailing slash (Issue #21039) In-Reply-To: <7300cd9c96075.53e90d35@wiscmail.wisc.edu> References: <7450d74797c00.53e8fda2@wiscmail.wisc.edu> <7720ab5690aa6.53e90218@wiscmail.wisc.edu> <7300df5891ad6.53e90291@wiscmail.wisc.edu> <7740dd049250d.53e902d0@wiscmail.wisc.edu> <76d0b5d095b29.53e9030c@wiscmail.wisc.edu> <76d0faa295c73.53e90349@wiscmail.wisc.edu> <7300d3ad9158d.53e90385@wiscmail.wisc.edu> <76d0891e9113b.53e903c1@wiscmail.wisc.edu> <7450aa7d96719.53e903fe@wiscmail.wisc.edu> <76e09658905d2.53e9043a@wiscmail.wisc.edu> <73d0e8ec96390.53e90477@wiscmail.wisc.edu> <7720a0b797a73.53e904b3@wiscmail.wisc.edu> <7450e7a1961cc.53e904ef@wiscmail.wisc.edu> <76d0f2bf97b45.53e9052c@wiscmail.wisc.edu> <7300c69992ecb.53e90568@wiscmail.wisc.edu> <730095779477f.53e905a5@wiscmail.wisc.edu> <777088e89092d.53e905e1@wiscmail.wisc.edu> <76d0c1c4943f4.53e9061e@wiscmail.wisc.edu> <76d0c1d195abe.53e9065a@wiscmail.wisc.edu> <7720a64797f49.53e908ef@wiscmail.wisc.edu> <73d0cfb59558a.53e90969@wiscmail.wisc.edu> <7730a669919cc.53e909a6@wiscmail.wisc.edu> <76f0fdbb943ad.53e90b11@wiscmail.wisc.edu> <7570f69e9331b.53e90b4e@wiscmail.wisc.edu> <7770c7ed96a70.53e90b8a@wiscmail.wisc.edu> <7720a69b96e77.53e90bc6@wiscmail.wisc.edu> <76f0e0c690b6c.53e90c03@wiscmail.wisc.edu> <7690824d9444a.53e90c3f@wiscmail.wisc.edu> <76d0b5b291548.53e90c7e@wiscmail.wisc.edu> <7300cd9c96075.53e90d35@wiscmail.wisc.edu> Message-ID: <76f0ccfd96094.53e8c6f0@wiscmail.wisc.edu> I see this as a parallel to the question of `pathlib.PurePath.resolve()`, about which `pathlib` is (rightly!) very opinionated. Just as `foo/../bar` shouldn't resolve to `bar`, `foo/` shouldn't be truncated to `foo`. And if `PurePath` doesn't do this, `Path` shouldn't either, because the difference between a `Path` and a `PurePath` is the availability of filesystem operations, not the identities of the objects involved. On another level, I think that this is a simple decision: `PosixPath` claims right there in the name to implement POSIX behavior, and POSIX specifies that `foo` and `foo/` refer (in some cases) to different directory entries. Therefore, `foo` and `foo/` can't be the same path. Moreover, `PosixPath` implements several methods that have the same name as syscalls that POSIX specifies to depend on whether their path arguments end in trailing slashes. (Even `stat` [http://pubs.opengroup.org/onlinepubs/9699919799/functions/stat.html], which explicitly follows symbolic links regardless of the presence of a trailing slash, fails with ENOTDIR if given "path/to/existing/file/".) It feels pathological for `pathlib.PosixPath` to be so almost-compliant. -ijs From victor.stinner at gmail.com Mon Aug 11 23:42:41 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Mon, 11 Aug 2014 23:42:41 +0200 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: 2014-08-11 19:42 GMT+02:00 matsjoyce : > Yup, I read that post. However, those specific issues do not exist in my > module, as there is a module whitelist, and a method whitelist. Builtins are > now proxied, and all types going in to functions are checked for > modification. There maybe some holes in my approach, but I can't find them. I take a look at your code and it looks like almost everything is blocked. Right now, I'm not sure that your sandbox is useful. For example, for a simple IRC bot, it would help to have access to some modules like math, time or random. The problem is to provide a way to allow these modules and ensure that the policy doesn't introduce a new hole. Allowing more functions increase the risk of new holes. Even if your sandbox is strong, CPython contains a lot of code written in C (50% of CPython is written in C), and the C code usually takes shortcuts which ignore your sandbox. CPython source code is huge (+210k of C lines just for the core). Bugs are common, your sandbox is vulnerable to all these bugs. See for example the Lib/test/crashers/ directory of CPython. For my pysandbox project, I wrote some proxies and many vulnerabilities were found in these proxies. They can be explained by the nature of Python, you can introspect everything, modify everything, etc. It's very hard to design such proxy in Python. Implementing such proxy in C helps a little bit. The rule is always the same: your sandbox is as strong as its weakest function. A very minor bug is enough to break the whole sandbox. See the history of pysandbox for examples of such bugs (called "vulnerabilities" in the case of a sandbox). Victor From cyberdupo56 at gmail.com Tue Aug 12 01:08:00 2014 From: cyberdupo56 at gmail.com (Allen Li) Date: Mon, 11 Aug 2014 16:08:00 -0700 Subject: [Python-Dev] Multiline with statement line continuation Message-ID: <20140811230800.GA12210@gensokyo> This is a problem I sometimes run into when working with a lot of files simultaneously, where I need three or more `with` statements: with open('foo') as foo: with open('bar') as bar: with open('baz') as baz: pass Thankfully, support for multiple items was added in 3.1: with open('foo') as foo, open('bar') as bar, open('baz') as baz: pass However, this begs the need for a multiline form, especially when working with three or more items: with open('foo') as foo, \ open('bar') as bar, \ open('baz') as baz, \ open('spam') as spam \ open('eggs') as eggs: pass Currently, this works with explicit line continuation, but as all style guides favor implicit line continuation over explicit, it would be nice if you could do the following: with (open('foo') as foo, open('bar') as bar, open('baz') as baz, open('spam') as spam, open('eggs') as eggs): pass Currently, this is a syntax error, since the language specification for `with` is with_stmt ::= "with" with_item ("," with_item)* ":" suite with_item ::= expression ["as" target] as opposed to something like with_stmt ::= "with" with_expr ":" suite with_expr ::= with_item ("," with_item)* | '(' with_item ("," with_item)* ')' This is really just a style issue, furthermore a style issue that requires a change to the languagee grammar (probably, someone who knows for sure please confirm), so at first I thought it wasn't worth mentioning, but I'd like to hear what everyone else thinks. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 473 bytes Desc: not available URL: From ben+python at benfinney.id.au Tue Aug 12 01:27:57 2014 From: ben+python at benfinney.id.au (Ben Finney) Date: Tue, 12 Aug 2014 09:27:57 +1000 Subject: [Python-Dev] =?utf-8?b?TXVsdGlsaW5lIOKAmHdpdGjigJkgc3RhdGVtZW50?= =?utf-8?q?_line_continuation?= References: <20140811230800.GA12210@gensokyo> Message-ID: <85ha1i8v42.fsf@benfinney.id.au> Allen Li writes: > Currently, this works with explicit line continuation, but as all > style guides favor implicit line continuation over explicit, it would > be nice if you could do the following: > > with (open('foo') as foo, > open('bar') as bar, > open('baz') as baz, > open('spam') as spam, > open('eggs') as eggs): > pass > > Currently, this is a syntax error Even if it weren't a syntax error, the syntax would be ambiguous. How will you discern the meaning of:: with ( foo, bar, baz): pass Is that three separate context managers? Or is it one tuple with three items? I am definitely sympathetic to the desire for a good solution to multi-line ?with? statements, but I also don't want to see a special case to make it even more difficult to understand when a tuple literal is being specified in code. I admit I don't have a good answer to satisfy both those simultaneously. -- \ ?We have met the enemy and he is us.? ?Walt Kelly, _Pogo_ | `\ 1971-04-22 | _o__) | Ben Finney From ncoghlan at gmail.com Tue Aug 12 02:19:06 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Aug 2014 10:19:06 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <-2448384566377912251@unknownmsgid> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> Message-ID: On 12 Aug 2014 03:03, "Chris Barker - NOAA Federal" wrote: > > My confusion is still this: > > Repeated summation of strings has been optimized in cpython even > though it's not the recommended way to solve that problem. The quadratic behaviour of repeated str summation is a subtle, silent error. It *is* controversial that CPython silently optimises some cases of it away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick. It's considered worth the cost, since it dramatically improves the performance of common naive code in a way that doesn't alter the semantics. > So why not special case optimize sum() for strings? We are already > special-case strings to raise an exception. > > It seems pretty pedantic to say: we cod make this work well, but we'd > rather chide you for not knowing the "proper" way to do it. Yes, that's exactly what this is - a nudge towards the right way to concatenate strings without incurring quadratic behaviour. We *want* people to learn that distinction, not sweep it under the rug. That's the other reason the implicit optimisation is controversial - it hides an important difference in algorithmic complexity from users. > Practicality beats purity? Teaching users the difference between linear time operations and quadratic ones isn't about purity, it's about passing along a fundamental principle of algorithm scalability. We do it specifically for strings because they *do* have an optimised algorithm available that we can point users towards, and concatenating multiple strings is common. Other containers don't tend to be concatenated like that in the first place, so there's no such check pushing other iterables towards itertools.chain. Regards, Nick. > > -Chris > > > > > > Although that's not the whole story: in > > practice even numerical sums get split into multiple functions because > > floating point addition isn't associative, and so needs careful > > treatment to preserve accuracy. At that point I'm strongly +1 on > > abandoning attempts to "rationalize" summation. > > > > I'm not sure how I'd feel about raising an exception if you try to sum > > any iterable containing misbehaved types like float. But not only > > would that be a Python 4 effort due to backward incompatibility, but > > it sorta contradicts the main argument of proponents ("any type > > implementing __add__ should be sum()-able"). > > > > _______________________________________________ > > Python-Dev mailing list > > Python-Dev at python.org > > https://mail.python.org/mailman/listinfo/python-dev > > Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Tue Aug 12 02:28:14 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Aug 2014 10:28:14 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140811230800.GA12210@gensokyo> References: <20140811230800.GA12210@gensokyo> Message-ID: On 12 Aug 2014 09:09, "Allen Li" wrote: > > This is a problem I sometimes run into when working with a lot of files > simultaneously, where I need three or more `with` statements: > > with open('foo') as foo: > with open('bar') as bar: > with open('baz') as baz: > pass > > Thankfully, support for multiple items was added in 3.1: > > with open('foo') as foo, open('bar') as bar, open('baz') as baz: > pass > > However, this begs the need for a multiline form, especially when > working with three or more items: > > with open('foo') as foo, \ > open('bar') as bar, \ > open('baz') as baz, \ > open('spam') as spam \ > open('eggs') as eggs: > pass I generally see this kind of construct as a sign that refactoring is needed. For example, contextlib.ExitStack offers a number of ways to manage multiple context managers dynamically rather than statically. Regards, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benhoyt at gmail.com Tue Aug 12 02:29:51 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Mon, 11 Aug 2014 20:29:51 -0400 Subject: [Python-Dev] Multiline 'with' statement line continuation In-Reply-To: <85ha1i8v42.fsf@benfinney.id.au> References: <20140811230800.GA12210@gensokyo> <85ha1i8v42.fsf@benfinney.id.au> Message-ID: > Even if it weren't a syntax error, the syntax would be ambiguous. How > will you discern the meaning of:: > > with ( > foo, > bar, > baz): > pass > > Is that three separate context managers? Or is it one tuple with three > items? Is it meaningful to use "with" with a tuple, though? Because a tuple isn't a context manager with __enter__ and __exit__ methods. For example: >>> with (1,2,3): pass ... Traceback (most recent call last): File "", line 1, in AttributeError: __exit__ So -- although I'm not arguing for it here -- you'd be turning an code (a runtime AttributeError) into valid syntax. -Ben From alexander.belopolsky at gmail.com Tue Aug 12 02:50:28 2014 From: alexander.belopolsky at gmail.com (Alexander Belopolsky) Date: Mon, 11 Aug 2014 20:50:28 -0400 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> Message-ID: On Mon, Aug 11, 2014 at 8:19 PM, Nick Coghlan wrote: > Teaching users the difference between linear time operations and quadratic > ones isn't about purity, it's about passing along a fundamental principle > of algorithm scalability. I would understand if this was done in reduce(operator.add, ..) which indeed spells out the choice of an algorithm, but why sum() should be O(N) for numbers and O(N**2) for containers? Would a python implementation that, for example, optimizes away 0's in sum(list_of_numbers) be non-compliant with some fundamental principle? -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Tue Aug 12 03:21:15 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Mon, 11 Aug 2014 18:21:15 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> Message-ID: <2076096455819154683@unknownmsgid> Sorry for the bike shedding here, but: The quadratic behaviour of repeated str summation is a subtle, silent error. OK, fair enough. I suppose it would be hard and ugly to catch those instances and raise an exception pointing users to "".join. *is* controversial that CPython silently optimises some cases of it away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick. Is there anything in the language spec that says string concatenation is O(n^2)? Or for that matter any of the performs characteristics of build in types? Those striker as implementation details that SHOULD be particular to the implementation. Should we cripple the performance of some operation in Cpython so that it won't work better that Jython? That seems an odd choice. Then how dare PyPy make scalar computation faster? People might switch to cPython and not know they should have been using numpy all along... It's considered worth the cost, since it dramatically improves the performance of common naive code in a way that doesn't alter the semantics. Seems the same argument could be made for sum(list_of_strings). > It seems pretty pedantic to say: we could make this work well, but we'd > rather chide you for not knowing the "proper" way to do it. Yes, that's exactly what this is - a nudge towards the right way to concatenate strings without incurring quadratic behaviour. But if it were optimized, it wouldn't incur quadratic behavior. We *want* people to learn that distinction, not sweep it under the rug. But sum() is not inherently quadratic -- that's a limitation of the implementation. I agree that disallowing it is a good idea given that behavior, but if it were optimized, there would be no reason to steer people away. "".join _could_ be naively written with the same poor performance -- why should users need to understand why one was optimized and one was not? That's the other reason the implicit optimisation is controversial - it hides an important difference in algorithmic complexity from users. It doesn't hide it -- it eliminates it. I suppose it's good for folks to understand the implications of string immutability for when they write their own algorithms, but this wouldn't be considered a good argument for a poorly performing sort() for instance. > Practicality beats purity? Teaching users the difference between linear time operations and quadratic ones isn't about purity, it's about passing along a fundamental principle of algorithm scalability. That is a very import a lesson to learn, sure, but python is not only a teaching language. People will need to learn those lessons at some point, this one feature makes little difference. We do it specifically for strings because they *do* have an optimised algorithm available that we can point users towards, and concatenating multiple strings is common. Sure, but I think all that does is teach people about a cpython specific implementation -- and I doubt naive users get any closer to understanding algorithmic complexity -- all they learn is you should use string.join(). Oh well, not really that big a deal. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben+python at benfinney.id.au Tue Aug 12 05:41:20 2014 From: ben+python at benfinney.id.au (Ben Finney) Date: Tue, 12 Aug 2014 13:41:20 +1000 Subject: [Python-Dev] Multiline 'with' statement line continuation References: <20140811230800.GA12210@gensokyo> <85ha1i8v42.fsf@benfinney.id.au> Message-ID: <85d2c68jdr.fsf@benfinney.id.au> Ben Hoyt writes: > So -- although I'm not arguing for it here -- you'd be turning an code > (a runtime AttributeError) into valid syntax. Exactly what I'd want to avoid, especially because it *looks* like a tuple. There are IMO too many pieces of code that look confusingly similar to tuples but actually mean something else. -- \ ?I have an answering machine in my car. It says, ?I'm home now. | `\ But leave a message and I'll call when I'm out.?? ?Steven Wright | _o__) | Ben Finney From stephen at xemacs.org Tue Aug 12 05:50:21 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 12 Aug 2014 12:50:21 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <2076096455819154683@unknownmsgid> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> Message-ID: <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Barker - NOAA Federal writes: > Is there anything in the language spec that says string concatenation is > O(n^2)? Or for that matter any of the performs characteristics of build in > types? Those striker as implementation details that SHOULD be particular to > the implementation. Container concatenation isn't quadratic in Python at all. The naive implementation of sum() as a loop repeatedly calling __add__ is quadratic for them. Strings (and immutable containers in general) are particularly horrible, as they don't have __iadd__. You could argue that sum() being a function of an iterable isn't just a calling convention for a loop encapsulated in a function, but rather a completely different kind of function that doesn't imply anything about the implementation, and therefore that it should dispatch on type(it). But explicitly dispatching on type(x) is yucky (what if somebody wants to sum a different type not currently recognized by the sum() builtin?) so, obviously, we should define a standard __sum__ dunder! IMO we'd also want a homogeneous_iterable ABC, and a concrete homogeneous_iterable_of_TYPE for each sum()-able TYPE to help users catch bugs injecting the wrong type into an iterable_of_TYPE. But this still sucks. Why? Because obviously we'd want the attractive nuisance of "if you have __add__, there's a default definition of __sum__" (AIUI, this is what bothers Alexander most about the current situation, at least of the things he's mentioned, I can really sympathize with his dislike). And new Pythonistas and lazy programmers who only intend to use sum() on "small enough" iterables will use the default, and their programs will appear to hang on somewhat larger iterable, or a realtime requirement will go unsatisfied when least expected, or .... If we *don't* have that property for sum(), ugh! Yuck! Same old same old! (IMHO, YMMV of course) It's possible that Python could provide some kind of feature that would allow an optimized sum function for every type that has __add__, but I think this will take a lot of thinking. *Somebody* will do it (I don't think anybody is +1 on restricting sum() to a subset of types with __add__). I just think we should wait until that somebody appears. > Should we cripple the performance of some operation in Cpython so that it > won't work better that Jython? Nobody is crippling operations. We're prohibiting use of a *name* for an operation that is associated (strongly so, in my mind) with an inefficient algorithm in favor of the *same operation* by a different name (which has no existing implementation, and therefore Python implementers are responsible for implementing it efficiently). Note: the "inefficient" algorithm isn't inefficient for integers, and it isn't inefficient for numbers in general (although it's inaccurate for some classes of numbers). > Seems the same argument [that Python language doesn't prohibit > optimizations in particular implementations just because they > aren't made in others] could be made for sum(list_of_strings). It could. But then we have to consider special-casing every builtin type that provides __add__, and we impose an unobvious burden on user types that provide __add__. > > It seems pretty pedantic to say: we could make this work well, > > but we'd rather chide you for not knowing the "proper" way to do > > it. Nobody disagrees. But backward compatibility gets in the way. > But sum() is not inherently quadratic -- that's a limitation of the > implementation. But the faulty implementation is the canonical implementation, the only one that can be defined directly in terms of __add__, and it is efficient for non-container types.[1] > "".join _could_ be naively written with the same poor performance > -- why should users need to understand why one was optimized and > one was not? Good question. They shouldn't -- thus the prohibition on sum()ing strings. > That is a very import a lesson to learn, sure, but python is not > only a teaching language. People will need to learn those lessons > at some point, this one feature makes little difference. No, it makes a big difference. If you can do something, then it's OK to do it, is something Python tries to implement. If sum() works for everything with an __add__, given current Python language features some people are going to end up with very inefficient code and it will bite some of them (and not necessarily the authors!) at some time. If it doesn't work for every type with __add__, why not? You'll end up playing whack-a-mole with type prohibitions. Ugh. > Sure, but I think all that does is teach people about a cpython specific > implementation -- and I doubt naive users get any closer to understanding > algorithmic complexity -- all they learn is you should use string.join(). > > Oh well, not really that big a deal. Not to Python. Maybe not to you. But I've learned a lot about Pythonic ways of doing things trying to channel the folks who implemented this restriction. (I don't claim to have gotten it right! Just that it's been fun and educational. :-) Steve Footnotes: [1] This isn't quite true. One can imagine a "token" or "symbol" type that is str without __len__, but does have __add__. But that seems silly enough to not be a problem in practice. From ethan at stoneleaf.us Tue Aug 12 06:02:17 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Mon, 11 Aug 2014 21:02:17 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <53E991C9.7020404@stoneleaf.us> On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote: > Chris Barker - NOAA Federal writes: > >> It seems pretty pedantic to say: we could make this work well, >> but we'd rather chide you for not knowing the "proper" way to do >> it. > > Nobody disagrees. But backward compatibility gets in the way. Something that currently doesn't work, starts to. How is that a backward compatibility problem? -- ~Ethan~ From Nikolaus at rath.org Tue Aug 12 06:39:11 2014 From: Nikolaus at rath.org (Nikolaus Rath) Date: Mon, 11 Aug 2014 21:39:11 -0700 Subject: [Python-Dev] Commit-ready patches in need of review Message-ID: <53E99A6F.3020304@rath.org> Hello, The following commit-ready patches have been waiting for review since May and earlier.It'd be great if someone could find the time to take a look. I'll be happy to incorporate feedback as necessary: * http://bugs.python.org/issue1738 (filecmp.dircmp does exact match only) * http://bugs.python.org/issue15955 (gzip, bz2, lzma: add option to limit output size) * http://bugs.python.org/issue20177 (Derby #8: Convert 28 sites to Argument Clinic across 2 files) I only wrote the patch for one file because I'd like to have feedback before tackling the second. However, the patches are independent so unless there are other problems this is ready for commit. Best, Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.? From stephen at xemacs.org Tue Aug 12 08:07:29 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 12 Aug 2014 15:07:29 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <53E991C9.7020404@stoneleaf.us> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> Message-ID: <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> Ethan Furman writes: > On 08/11/2014 08:50 PM, Stephen J. Turnbull wrote: > > Chris Barker - NOAA Federal writes: > > > >> It seems pretty pedantic to say: we could make this work well, > >> but we'd rather chide you for not knowing the "proper" way to do > >> it. > > > > Nobody disagrees. But backward compatibility gets in the way. > > Something that currently doesn't work, starts to. How is that a > backward compatibility problem? I'm referring to removing the unnecessary information that there's a better way to do it, and simply raising an error (as in Python 3.2, say) which is all a RealProgrammer[tm] should ever need! That would be a regression and backward incompatible. From ncoghlan at gmail.com Tue Aug 12 09:30:22 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 12 Aug 2014 17:30:22 +1000 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <2076096455819154683@unknownmsgid> References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> Message-ID: On 12 Aug 2014 11:21, "Chris Barker - NOAA Federal" wrote: > > Sorry for the bike shedding here, but: > >> The quadratic behaviour of repeated str summation is a subtle, silent error. > > OK, fair enough. I suppose it would be hard and ugly to catch those instances and raise an exception pointing users to "".join. >> >> *is* controversial that CPython silently optimises some cases of it away, since it can cause problems when porting affected code to other interpreters that don't use refcounting and thus have a harder time implementing such a trick. > > Is there anything in the language spec that says string concatenation is O(n^2)? Or for that matter any of the performs characteristics of build in types? Those striker as implementation details that SHOULD be particular to the implementation. If you implement strings so they have multiple data segments internally (as is the case for StringIO these days), yes, you can avoid quadratic time concatenation behaviour. Doing so makes it harder to meet other complexity expectations (like O(1) access to arbitrary code points), and isn't going to happen in CPython regardless due to C API backwards compatibility constraints. For the explicit loop with repeated concatenation, we can't say "this is slow, don't do it". People do it anyway, so we've opted for the "fine, make it as fast as we can" option as being preferable to an obscure and relatively hard to debug performance problem. For sum(), we have the option of being more direct and just telling people Python's answer to the string concatenation problem (i.e. str.join). That is decidedly *not* the series of operations described in sum's documentation as "Sums start and the items of an iterable from left to right and returns the total." Regards, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Aug 12 10:02:00 2014 From: arigo at tunes.org (Armin Rigo) Date: Tue, 12 Aug 2014 10:02:00 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> Message-ID: Hi all, The core of the matter is that if we repeatedly __add__ strings from a long list, we get O(n**2) behavior. For one point of view, the reason is that the additions proceed in left-to-right order. Indeed, sum() could proceed in a more balanced tree-like order: from [x0, x1, x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat until there is only one item in the final list. This order ensures that sum(list_of_strings) is at worst O(n log n). It might be in practice close enough from linear to not matter. It also improves a lot the precision of sum(list_of_floats) (though not reaching the same precision levels of math.fsum()). Just a thought, Armin. From jeanpierreda at gmail.com Tue Aug 12 12:43:07 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Tue, 12 Aug 2014 03:43:07 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140811230800.GA12210@gensokyo> References: <20140811230800.GA12210@gensokyo> Message-ID: I think this thread is probably Python-Ideas territory... On Mon, Aug 11, 2014 at 4:08 PM, Allen Li wrote: > Currently, this works with explicit line continuation, but as all style > guides favor implicit line continuation over explicit, it would be nice > if you could do the following: > > with (open('foo') as foo, > open('bar') as bar, > open('baz') as baz, > open('spam') as spam, > open('eggs') as eggs): > pass The parentheses seem unnecessary/redundant/weird. Why not allow newlines in-between "with" and the terminating ":"? with open('foo') as foo, open('bar') as bar, open('baz') as baz: pass -- Devin From steve at pearwood.info Tue Aug 12 14:15:41 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Tue, 12 Aug 2014 22:15:41 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> Message-ID: <20140812121541.GG4525@ando> On Tue, Aug 12, 2014 at 10:28:14AM +1000, Nick Coghlan wrote: > On 12 Aug 2014 09:09, "Allen Li" wrote: > > > > This is a problem I sometimes run into when working with a lot of files > > simultaneously, where I need three or more `with` statements: > > > > with open('foo') as foo: > > with open('bar') as bar: > > with open('baz') as baz: > > pass > > > > Thankfully, support for multiple items was added in 3.1: > > > > with open('foo') as foo, open('bar') as bar, open('baz') as baz: > > pass > > > > However, this begs the need for a multiline form, especially when > > working with three or more items: > > > > with open('foo') as foo, \ > > open('bar') as bar, \ > > open('baz') as baz, \ > > open('spam') as spam \ > > open('eggs') as eggs: > > pass > > I generally see this kind of construct as a sign that refactoring is > needed. For example, contextlib.ExitStack offers a number of ways to manage > multiple context managers dynamically rather than statically. I don't think that ExitStack is the right solution for when you have a small number of context managers known at edit-time. The extra effort of writing your code, and reading it, in a dynamic manner is not justified. Compare the natural way of writing this: with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese: # do stuff with spam, eggs, cheese versus the dynamic way: with ExitStack() as stack: spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in zip(("spam", "eggs"), ("r", "w")] cheese = stack.enter_context(frobulate("cheese")) # do stuff with spam, eggs, cheese I prefer the first, even with the long line. -- Steven From graffatcolmingov at gmail.com Tue Aug 12 15:04:35 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Tue, 12 Aug 2014 08:04:35 -0500 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140812121541.GG4525@ando> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> Message-ID: On Tue, Aug 12, 2014 at 7:15 AM, Steven D'Aprano wrote: > On Tue, Aug 12, 2014 at 10:28:14AM +1000, Nick Coghlan wrote: >> On 12 Aug 2014 09:09, "Allen Li" wrote: >> > >> > This is a problem I sometimes run into when working with a lot of files >> > simultaneously, where I need three or more `with` statements: >> > >> > with open('foo') as foo: >> > with open('bar') as bar: >> > with open('baz') as baz: >> > pass >> > >> > Thankfully, support for multiple items was added in 3.1: >> > >> > with open('foo') as foo, open('bar') as bar, open('baz') as baz: >> > pass >> > >> > However, this begs the need for a multiline form, especially when >> > working with three or more items: >> > >> > with open('foo') as foo, \ >> > open('bar') as bar, \ >> > open('baz') as baz, \ >> > open('spam') as spam \ >> > open('eggs') as eggs: >> > pass >> >> I generally see this kind of construct as a sign that refactoring is >> needed. For example, contextlib.ExitStack offers a number of ways to manage >> multiple context managers dynamically rather than statically. > > I don't think that ExitStack is the right solution for when you have a > small number of context managers known at edit-time. The extra effort of > writing your code, and reading it, in a dynamic manner is not justified. > Compare the natural way of writing this: > > with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese: > # do stuff with spam, eggs, cheese > > versus the dynamic way: > > with ExitStack() as stack: > spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in > zip(("spam", "eggs"), ("r", "w")] > cheese = stack.enter_context(frobulate("cheese")) > # do stuff with spam, eggs, cheese > > I prefer the first, even with the long line. I agree with Steven for *small* numbers of context managers. Once they become too long though, either refactoring is severely needed or the user should ExitStack. To quote Ben Hoyt: > Is it meaningful to use "with" with a tuple, though? Because a tuple > isn't a context manager with __enter__ and __exit__ methods. For > example: > > >>> with (1,2,3): pass > ... > Traceback (most recent call last): > File "", line 1, in > AttributeError: __exit__ > > So -- although I'm not arguing for it here -- you'd be turning an code > (a runtime AttributeError) into valid syntax. I think by introducing parentheses we are going to risk seriously confusing users who may then try to write an assignment like a = (open('spam') as spam, open('eggs') as eggs) Because it looks like a tuple but isn't and I think the extra complexity this would add to the language would not be worth the benefit. If we simply look at Ruby for what happens when you have an overloaded syntax that means two different things, you can see why I'm against modifying this syntax. In Ruby, parentheses for method calls are optional and curly braces (i.e, {}) are used for blocks and hash literals. With a method on class that takes a parameter and a block, you get some confusing errors, take for example: class Spam def eggs(ham) puts ham yield if block_present? end end s = Spam.new s.eggs {monty: 'python'} SyntaxError: ... But s.eggs({monty: 'python'}) Will print out the hash. The interpreter isn't intelligent enough to know if you're attempting to pass a hash as a parameter or a block to be executed. This may seem like a stretch to apply to Python, but the concept of muddling the meaning of something already very well defined seems like a bad idea. From guido at python.org Tue Aug 12 17:12:45 2014 From: guido at python.org (Guido van Rossum) Date: Tue, 12 Aug 2014 08:12:45 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> Message-ID: On Tue, Aug 12, 2014 at 3:43 AM, Devin Jeanpierre wrote: > I think this thread is probably Python-Ideas territory... > > On Mon, Aug 11, 2014 at 4:08 PM, Allen Li wrote: > > Currently, this works with explicit line continuation, but as all style > > guides favor implicit line continuation over explicit, it would be nice > > if you could do the following: > > > > with (open('foo') as foo, > > open('bar') as bar, > > open('baz') as baz, > > open('spam') as spam, > > open('eggs') as eggs): > > pass > > The parentheses seem unnecessary/redundant/weird. Why not allow > newlines in-between "with" and the terminating ":"? > > with open('foo') as foo, > open('bar') as bar, > open('baz') as baz: > pass > That way lies Coffeescript. Too much guessing. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Aug 12 18:57:39 2014 From: arigo at tunes.org (Armin Rigo) Date: Tue, 12 Aug 2014 18:57:39 +0200 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140811230800.GA12210@gensokyo> References: <20140811230800.GA12210@gensokyo> Message-ID: Hi, On 12 August 2014 01:08, Allen Li wrote: > with (open('foo') as foo, > open('bar') as bar, > open('baz') as baz, > open('spam') as spam, > open('eggs') as eggs): > pass +1. It's exactly the same grammar extension as for "from import" statements, for the same reason. Armin From g.brandl at gmx.net Tue Aug 12 20:52:44 2014 From: g.brandl at gmx.net (Georg Brandl) Date: Tue, 12 Aug 2014 20:52:44 +0200 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> Message-ID: On 08/12/2014 06:57 PM, Armin Rigo wrote: > Hi, > > On 12 August 2014 01:08, Allen Li wrote: >> with (open('foo') as foo, >> open('bar') as bar, >> open('baz') as baz, >> open('spam') as spam, >> open('eggs') as eggs): >> pass > > +1. It's exactly the same grammar extension as for "from import" > statements, for the same reason. Not the same: in import statements it unambiguously replaces a list of (optionally as-renamed) identifiers. Here, it would replace an arbitrary expression, which I think would mean that we couldn't differentiate between e.g. with (expr).meth(): # a line break in "expr" # would make the parens useful and with (expr1, expr2): cheers, Georg From chris.barker at noaa.gov Tue Aug 12 21:11:35 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Tue, 12 Aug 2014 12:11:35 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull wrote: > I'm referring to removing the unnecessary information that there's a > better way to do it, and simply raising an error (as in Python 3.2, > say) which is all a RealProgrammer[tm] should ever need! > I can't imagine anyone is suggesting that -- disallow it, but don't tell anyone why? The only thing that is remotely on the table here is: 1) remove the special case for strings -- buyer beware -- but consistent and less "ugly" 2) add a special case for strings that is fast and efficient -- may be as simple as calling "".join() under the hood --no more code than the exception check. And I doubt anyone really is pushing for anything but (2) Steven Turnbull wrote: > IMO we'd also want a homogeneous_iterable ABC Actually, I've thought for years that that would open the door to a lot of optimizations -- but that's a much broader question that sum(). I even brought it up probably over ten years ago -- but no one was the least bit iinterested -- nor are they now -- I now this was a rhetorical suggestion to make the point about what not to do.... Because obviously we'd want the > attractive nuisance of "if you have __add__, there's a default > definition of __sum__" now I'm confused -- isn't that exactly what we have now? It's possible that Python could provide some kind of feature that > would allow an optimized sum function for every type that has __add__, > but I think this will take a lot of thinking. does it need to be every type? As it is the common ones work fine already except for strings -- so if we add an optimized string sum() then we're done. *Somebody* will do it > (I don't think anybody is +1 on restricting sum() to a subset of types > with __add__). uhm, that's exactly what we have now -- you can use sum() with anything that has an __add__, except strings. Ns by that logic, if we thought there were other inefficient use cases, we'd restrict those too. But users can always define their own classes that have a __sum__ and are really inefficient -- so unless sum() becomes just for a certain subset of built-in types -- does anyone want that? Then we are back to the current situation: sum() can be used for any type that has an __add__ defined. But naive users are likely to try it with strings, and that's bad, so we want to prevent that, and have a special case check for strings. What I fail to see is why it's better to raise an exception and point users to a better way, than to simply provide an optimization so that it's a mute issue. The only justification offered here is that will teach people that summing strings (and some other objects?) is order(N^2) and a bad idea. But: a) Python's primary purpose is practical, not pedagogical (not that it isn't great for that) b) I doubt any naive users learn anything other than "I can't use sum() for strings, I should use "".join()". Will they make the leap to "I shouldn't use string concatenation in a loop, either"? Oh, wait, you can use string concatenation in a loop -- that's been optimized. So will they learn: "some types of object shave poor performance with repeated concatenation and shouldn't be used with sum(). So If I write such a class, and want to sum them up, I'll need to write an optimized version of that code"? I submit that no naive user is going to get any closer to a proper understanding of algorithmic Order behavior from this small hint. Which leaves no reason to prefer an Exception to an optimization. One other point: perhaps this will lead a naive user into thinking -- "sum() raises an exception if I try to use it inefficiently, so it must be OK to use for anything that doesn't raise an exception" -- that would be a bad lesson to mis-learn.... -Chris PS: Armin Rigo wrote: > It also improves a > lot the precision of sum(list_of_floats) (though not reaching the same > precision levels of math.fsum()). while we are at it, having the default sum() for floats be fsum() would be nice -- I'd rather the default was better accuracy loser performance. Folks that really care about performance could call math.fastsum(), or really, use numpy... This does turn sum() into a function that does type-based dispatch, but isn't python full of those already? do something special for the types you know about, call the generic dunder method for the rest. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From Stefan.Richthofer at gmx.de Tue Aug 12 21:48:01 2014 From: Stefan.Richthofer at gmx.de (Stefan Richthofer) Date: Tue, 12 Aug 2014 21:48:01 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp>, Message-ID: An HTML attachment was scrubbed... URL: From jeanpierreda at gmail.com Wed Aug 13 02:41:32 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Tue, 12 Aug 2014 17:41:32 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> Message-ID: On Tue, Aug 12, 2014 at 8:12 AM, Guido van Rossum wrote: > On Tue, Aug 12, 2014 at 3:43 AM, Devin Jeanpierre > wrote: >> The parentheses seem unnecessary/redundant/weird. Why not allow >> newlines in-between "with" and the terminating ":"? >> >> with open('foo') as foo, >> open('bar') as bar, >> open('baz') as baz: >> pass > > > That way lies Coffeescript. Too much guessing. There's no syntactic ambiguity, so what guessing are you talking about? What *really* requires guessing, is figuring out where in Python's syntax parentheses are allowed vs not allowed ;). For example, "from foo import (bar, baz)" is legal, but "import (bar, baz)" is not. Sometimes it feels like Python is slowly and organically evolving into a parenthesis-delimited language. -- Devin From Nikolaus at rath.org Wed Aug 13 04:48:34 2014 From: Nikolaus at rath.org (Nikolaus Rath) Date: Tue, 12 Aug 2014 19:48:34 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: (Chris Barker's message of "Tue, 12 Aug 2014 12:11:35 -0700") References: <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87a9796r5p.fsf@vostro.rath.org> Chris Barker writes: > What I fail to see is why it's better to raise an exception and point users > to a better way, than to simply provide an optimization so that it's a mute > issue. > > The only justification offered here is that will teach people that summing > strings (and some other objects?) is order(N^2) and a bad idea. But: > > a) Python's primary purpose is practical, not pedagogical (not that it > isn't great for that) > > b) I doubt any naive users learn anything other than "I can't use sum() for > strings, I should use "".join()". Will they make the leap to "I shouldn't > use string concatenation in a loop, either"? Oh, wait, you can use string > concatenation in a loop -- that's been optimized. So will they learn: "some > types of object shave poor performance with repeated concatenation and > shouldn't be used with sum(). So If I write such a class, and want to sum > them up, I'll need to write an optimized version of that code"? > > I submit that no naive user is going to get any closer to a proper > understanding of algorithmic Order behavior from this small hint. Which > leaves no reason to prefer an Exception to an optimization. > > One other point: perhaps this will lead a naive user into thinking -- > "sum() raises an exception if I try to use it inefficiently, so it must be > OK to use for anything that doesn't raise an exception" -- that would be a > bad lesson to mis-learn.... AOL to that. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.? From steve at pearwood.info Wed Aug 13 05:38:55 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Wed, 13 Aug 2014 13:38:55 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> Message-ID: <20140813033855.GH4525@ando> On Tue, Aug 12, 2014 at 08:04:35AM -0500, Ian Cordasco wrote: > I think by introducing parentheses we are going to risk seriously > confusing users who may then try to write an assignment like > > a = (open('spam') as spam, open('eggs') as eggs) Seriously? If they try it, they will get a syntax error. Now, admittedly Python's syntax error messages tend to be terse and cryptic, but it's still enough to show that you can't do that. py> a = (open('spam') as spam, open('eggs') as eggs) File "", line 1 a = (open('spam') as spam, open('eggs') as eggs) ^ SyntaxError: invalid syntax I don't see this as a problem. There's no limit to the things that people *might* do if they don't understand Python semantics: for module in sys, math, os, import module (and yes, I once tried this as a beginner) but they try it once, realise it doesn't work, and never do it again. > Because it looks like a tuple but isn't and I think the extra > complexity this would add to the language would not be worth the > benefit. Do we have a problem with people thinking that, since tuples are normally interchangable with lists, they can write this? from module import [fe, fi, fo, fum, spam, eggs, cheese] and then being "seriously confused" by the syntax error they receive? Or writing this? from (module import fe, fi, fo, fum, spam, eggs, cheese) It's not sufficient that people might try it, see it fails, and move on. Your claim is that it will cause serious confusion. I just don't see that happening. > If we simply look at Ruby for what happens when you have an > overloaded syntax that means two different things, you can see why I'm > against modifying this syntax. That ship has sailed in Python, oh, 20+ years ago. Parens are used for grouping, for tuples[1], for function calls, for parameter lists, class base-classes, generator expressions and line continuations. I cannot think of any examples where these multiple uses for parens has cause meaningful confusion, and I don't think this one will either. [1] Technically not, since it's the comma, not the ( ), which makes a tuple, but a lot of people don't know that and treat it as if it the parens were compulsary. -- Steven From stephen at xemacs.org Wed Aug 13 08:21:42 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 13 Aug 2014 15:21:42 +0900 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87egwkj4eh.fsf@uwakimon.sk.tsukuba.ac.jp> Redirecting to python-ideas, so trimming less than I might. Chris Barker writes: > On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull > wrote: > > > I'm referring to removing the unnecessary information that there's a > > better way to do it, and simply raising an error (as in Python 3.2, > > say) which is all a RealProgrammer[tm] should ever need! > > > > I can't imagine anyone is suggesting that -- disallow it, but don't tell > anyone why? As I said, it's a regression. That's exactly the behavior in Python 3.2. > The only thing that is remotely on the table here is: > > 1) remove the special case for strings -- buyer beware -- but consistent > and less "ugly" It's only consistent if you believe that Python has strict rules for use of various operators. It doesn't, except as far as they are constrained by precedence. For example, I have an application where I add bytestrings bytewise modulo N <= 256, and concatenate them. In fact I use function call syntax, but the obvious operator syntax is '+' for the bytewise addition, and '*' for the concatenation. It's not in the Zen, but I believe in the maxim "If it's worth doing, it's worth doing well." So for me, 1) is out anyway. > 2) add a special case for strings that is fast and efficient -- may be as > simple as calling "".join() under the hood --no more code than the > exception check. Sure, but what about all the other immutable containers with __add__ methods? What about mappings with key-wise __add__ methods whose values might be immutable but have __add__ methods? Where do you stop with the special-casing? I consider this far more complex and ugly than the simple "sum() is for numbers" rule (and even that is way too complex considering accuracy of summing floats). > And I doubt anyone really is pushing for anything but (2) I know that, but I think it's the wrong solution to the problem (which is genuine IMO). The right solution is something generic, possibly a __sum__ method. The question is whether that leads to too much work to be worth it (eg, "homogeneous_iterable"). > > Because obviously we'd want the attractive nuisance of "if you > > have __add__, there's a default definition of __sum__" > > now I'm confused -- isn't that exactly what we have now? Yes and my feeling (backed up by arguments that I admit may persuade nobody but myself) is that what we have now kinda sucks[tm]. It seemed like a good idea when I first saw it, but then, my apps don't scale to where the pain starts in my own usage. > > It's possible that Python could provide some kind of feature that > > would allow an optimized sum function for every type that has > > __add__, but I think this will take a lot of thinking. > > does it need to be every type? As it is the common ones work fine already > except for strings -- so if we add an optimized string sum() then we're > done. I didn't say provide an optimized sum(), I said provide a feature enabling people who want to optimize sum() to do so. So yes, it needs to be every type (the optional __sum__ method is a proof of concept, modulo it actually being implementable ;-). > > *Somebody* will do it (I don't think anybody is +1 on restricting > > sum() to a subset of types with __add__). > > uhm, that's exactly what we have now Exactly. Who's arguing that the sum() we have now is a ticket to Paradise? I'm just saying that there's probably somebody out there negative enough on the current situation to come up with an answer that I think is general enough (and I suspect that python-dev consensus is that demanding, too). > sum() can be used for any type that has an __add__ defined. I'd like to see that be mutable types with __iadd__. > What I fail to see is why it's better to raise an exception and > point users to a better way, than to simply provide an optimization > so that it's a mute issue. Because inefficient sum() is an attractive nuisance, easy to overlook, and likely to bite users other than the author. > The only justification offered here is that will teach people that summing > strings (and some other objects?) Summing tuples works (with appropriate start=tuple()). Haven't benchmarked, but I bet that's O(N^2). > is order(N^2) and a bad idea. But: > > a) Python's primary purpose is practical, not pedagogical (not that it > isn't great for that) My argument is that in practical use sum() is a bad idea, period, until you book up on the types and applications where it *does* work. N.B. It doesn't even work properly for numbers (inaccurate for floats). > b) I doubt any naive users learn anything other than "I can't use sum() for > strings, I should use "".join()". For people who think that special-casing strings is a good idea, I think this is about as much benefit as you can expect. Why go farther?<0.5 wink/> > I submit that no naive user is going to get any closer to a proper > understanding of algorithmic Order behavior from this small hint. Which > leaves no reason to prefer an Exception to an optimization. TOOWTDI. str.join is in pretty much every code base by now, and tutorials and FAQs recommending its user and severely deprecating sum for strings are legion. > One other point: perhaps this will lead a naive user into thinking -- > "sum() raises an exception if I try to use it inefficiently, so it must be > OK to use for anything that doesn't raise an exception" -- that would be a > bad lesson to mis-learn.... That assumes they know about the start argument. I think most naive users will just try to sum a bunch of tuples, and get the "can't add 0, tuple" Exception and write a loop. I suspect that many of the users who get the "use str.join" warning along with the Exception are unaware of the start argument, too. They expect sum(iter_of_str) to magically add the strings. Ie, when in 3.2 they got the uninformative "can't add 0, str" message, they did not immediately go "d'oh" and insert ", start=''" in the call to sum, they wrote a loop. > while we are at it, having the default sum() for floats be fsum() > would be nice How do you propose to implement that, given math.fsum is perfectly happy to sum integers? You can't just check one or a few leading elements for floatiness. I think you have to dispatch on type(start), but then sum(iter_of_floats) DTWT. So I would suggest changing the signature to sum(it, start=0.0). This would probably be acceptable to most users with iterables of ints, but does imply some performance hit. > This does turn sum() into a function that does type-based dispatch, > but isn't python full of those already? do something special for > the types you know about, call the generic dunder method for the > rest. AFAIK Python is moving in the opposite direction: if there's a common need for dispatching to type-specific implementations of a method, define a standard (not "generic") dunder for the purpose, and have the builtin (or operator, or whatever) look up (not "call") the appropriate instance in the usual way, then call it. If there's a useful generic implementation, define an ABC to inherit from that provides that generic implementation. From ncoghlan at gmail.com Wed Aug 13 10:34:58 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 13 Aug 2014 18:34:58 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140812121541.GG4525@ando> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> Message-ID: On 12 August 2014 22:15, Steven D'Aprano wrote: > Compare the natural way of writing this: > > with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese: > # do stuff with spam, eggs, cheese > > versus the dynamic way: > > with ExitStack() as stack: > spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in > zip(("spam", "eggs"), ("r", "w")] > cheese = stack.enter_context(frobulate("cheese")) > # do stuff with spam, eggs, cheese You wouldn't necessarily switch at three. At only three, you have lots of options, including multiple nested with statements: with open("spam") as spam: with open("eggs", "w") as eggs: with frobulate("cheese") as cheese: # do stuff with spam, eggs, cheese The "multiple context managers in one with statement" form is there *solely* to save indentation levels, and overuse can often be a sign that you may have a custom context manager trying to get out: @contextlib.contextmanager def dish(spam_file, egg_file, topping): with open(spam_file), open(egg_file, 'w'), frobulate(topping): yield with dish("spam", "eggs", "cheese") as spam, eggs, cheese: # do stuff with spam, eggs & cheese ExitStack is mostly useful as a tool for writing flexible custom context managers, and for dealing with context managers in cases where lexical scoping doesn't necessarily work, rather than being something you'd regularly use for inline code. "Why do I have so many contexts open at once in this function?" is a question developers should ask themselves in the same way its worth asking "why do I have so many local variables in this function?" Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ijmorlan at uwaterloo.ca Wed Aug 13 15:11:15 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Wed, 13 Aug 2014 09:11:15 -0400 (EDT) Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: On Mon, 11 Aug 2014, Skip Montanaro wrote: > On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce wrote: >> There maybe some holes in my approach, but I can't find them. > > There's the rub. Given time, I suspect someone will discover a hole or two. Schneier's Law: Any person can invent a security system so clever that she or he can't think of how to break it. While I would not claim a Python sandbox is utterly impossible, I'm suspicious that the whole "consenting adults" approach in Python is incompatible with a sandbox. The whole idea of a sandbox is to absolutely prevent people from doing things even if they really want to and know what they are doing. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From 4kir4.1i at gmail.com Wed Aug 13 17:47:18 2014 From: 4kir4.1i at gmail.com (Akira Li) Date: Wed, 13 Aug 2014 19:47:18 +0400 Subject: [Python-Dev] Multiline with statement line continuation References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> Message-ID: <87sil08k8p.fsf@gmail.com> Nick Coghlan writes: > On 12 August 2014 22:15, Steven D'Aprano wrote: >> Compare the natural way of writing this: >> >> with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese: >> # do stuff with spam, eggs, cheese >> >> versus the dynamic way: >> >> with ExitStack() as stack: >> spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in >> zip(("spam", "eggs"), ("r", "w")] >> cheese = stack.enter_context(frobulate("cheese")) >> # do stuff with spam, eggs, cheese > > You wouldn't necessarily switch at three. At only three, you have lots > of options, including multiple nested with statements: > > with open("spam") as spam: > with open("eggs", "w") as eggs: > with frobulate("cheese") as cheese: > # do stuff with spam, eggs, cheese > > The "multiple context managers in one with statement" form is there > *solely* to save indentation levels, and overuse can often be a sign > that you may have a custom context manager trying to get out: > > @contextlib.contextmanager > def dish(spam_file, egg_file, topping): > with open(spam_file), open(egg_file, 'w'), frobulate(topping): > yield > > with dish("spam", "eggs", "cheese") as spam, eggs, cheese: > # do stuff with spam, eggs & cheese > > ExitStack is mostly useful as a tool for writing flexible custom > context managers, and for dealing with context managers in cases where > lexical scoping doesn't necessarily work, rather than being something > you'd regularly use for inline code. > > "Why do I have so many contexts open at once in this function?" is a > question developers should ask themselves in the same way its worth > asking "why do I have so many local variables in this function?" Multiline with-statement can be useful even with *two* context managers. Two is not many. Saving indentations levels along is a worthy goal. It can affect readability and the perceived complexity of the code. Here's how I'd like the code to look like: with (open('input filename') as input_file, open('output filename', 'w') as output_file): # code with list comprehensions to transform input file into output file Even one additional unnecessary indentation level may force to split list comprehensions into several lines (less readable) and/or use shorter names (less readable). Or it may force to move the inline code into a separate named function prematurely, solely to preserve the indentation level (also may be less readable) i.e., with ... as input_file: with ... as output_file: ... #XXX indentation level is lost for no reason with ... as infile, ... as outfile: #XXX shorter names ... with ... as input_file: with ... as output_file: transform(input_file, output_file) #XXX unnecessary function And (nested() can be implemented using ExitStack): with nested(open(..), open(..)) as (input_file, output_file): ... #XXX less readable Here's an example where nested() won't help: def get_integers(filename): with (open(filename, 'rb', 0) as file, mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file): for match in re.finditer(br'\d+', mmapped_file): yield int(match.group()) Here's another: with (open('log'+'some expression that generates filename', 'a') as logfile, redirect_stdout(logfile)): ... -- Akira From matsjoyce at gmail.com Wed Aug 13 18:19:14 2014 From: matsjoyce at gmail.com (matsjoyce) Date: Wed, 13 Aug 2014 16:19:14 +0000 (UTC) Subject: [Python-Dev] Reviving restricted mode? References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: Unless you remove all the things labelled "keep away from children". I wrote this sandbox to allow python to be used as a "mods"/"add-ons" language for a game I'm writing, hence the perhaps too strict nature. About the crashers: as this is for games, its "fine" for the game to crash, as long as the sandbox is not broken while crashing. time and math can probably be allowed, but random imports a lot of undesirable modules. My sandbox doesn't use proxies, due to the introspection and complexity that it involves. Instead it completely isolates the sandboxed globals, and checks all arguments and globals for irregularities before passing control to non- sandboxed functions. From rosuav at gmail.com Wed Aug 13 18:26:29 2014 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 14 Aug 2014 02:26:29 +1000 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland wrote: > While I would not claim a Python sandbox is utterly impossible, I'm > suspicious that the whole "consenting adults" approach in Python is > incompatible with a sandbox. The whole idea of a sandbox is to absolutely > prevent people from doing things even if they really want to and know what > they are doing. It's certainly not *fundamentally* impossible to sandbox Python. However, the question becomes one of how much effort you're going to go to and how much you're going to restrict the code. I think I remember reading about something that's like ast.literal_eval, but allows name references; with that, plus some tiny features of assignment, you could make a fairly straight-forward evaluator that lets you work comfortably with numbers, strings, lists, dicts, etc. That could be pretty useful - but it wouldn't so much be "Python in a sandbox" as "an expression evaluator that uses a severely restricted set of Python syntax". If you start with all of Python and then start cutting out the dangerous bits, you're doomed to miss something, and your sandbox is broken. If you start with nothing and then start adding functionality, you're looking at a gigantic job before it becomes anything that you could call an applications language. So while it's theoretically possible (I think - certainly I can't say for sure that it's impossible), it's fairly impractical. I've had my own try at it, and failed quite badly (fortunately noisily and at a sufficiently early stage of development to shift). ChrisA From matsjoyce at gmail.com Wed Aug 13 18:17:13 2014 From: matsjoyce at gmail.com (matsjoyce) Date: Wed, 13 Aug 2014 17:17:13 +0100 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: Unless you remove all the things labelled "keep away from children". I wrote this sandbox to allow python to be used as a "mods"/"add-ons" language for a game I'm writing, hence the perhaps too strict nature. About the crashers: as this is for games, its "fine" for the game to crash, as long as the sandbox is not broken while crashing. time and math can probably be allowed, but random imports a lot of undesirable modules. My sandbox doesn't use proxies, due to the introspection and complexity that it involves. Instead it completely isolates the sandboxed globals, and checks all arguments and globals for irregularities before passing control to non-sandboxed functions. On 13 August 2014 14:11, Isaac Morland wrote: > On Mon, 11 Aug 2014, Skip Montanaro wrote: > > On Mon, Aug 11, 2014 at 12:42 PM, matsjoyce wrote: >> >>> There maybe some holes in my approach, but I can't find them. >>> >> >> There's the rub. Given time, I suspect someone will discover a hole or >> two. >> > > Schneier's Law: > > Any person can invent a security system so clever that she or he > can't > think of how to break it. > > While I would not claim a Python sandbox is utterly impossible, I'm > suspicious that the whole "consenting adults" approach in Python is > incompatible with a sandbox. The whole idea of a sandbox is to absolutely > prevent people from doing things even if they really want to and know what > they are doing. > > Isaac Morland CSCF Web Guru > DC 2554C, x36650 WWW Software Specialist > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronaldoussoren at mac.com Wed Aug 13 16:32:13 2014 From: ronaldoussoren at mac.com (Ronald Oussoren) Date: Wed, 13 Aug 2014 16:32:13 +0200 Subject: [Python-Dev] sum(...) limitation In-Reply-To: References: <20140802055738.GA6053@gensokyo> <20140802073912.GI4525@ando> <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> Message-ID: On 12 Aug 2014, at 10:02, Armin Rigo wrote: > Hi all, > > The core of the matter is that if we repeatedly __add__ strings from a > long list, we get O(n**2) behavior. For one point of view, the > reason is that the additions proceed in left-to-right order. Indeed, > sum() could proceed in a more balanced tree-like order: from [x0, x1, > x2, x3, ...], reduce the list to [x0+x1, x2+x3, ...]; then repeat > until there is only one item in the final list. This order ensures > that sum(list_of_strings) is at worst O(n log n). It might be in > practice close enough from linear to not matter. It also improves a > lot the precision of sum(list_of_floats) (though not reaching the same > precision levels of math.fsum()). I wonder why nobody has mentioned previous year?s discussion of the same issue yet: http://marc.info/?l=python-ideas&m=137359619831497&w=2 Maybe someone can write a PEP about this that can be pointed when the question is discussed again next summer ;-) Ronald From steve at pearwood.info Wed Aug 13 18:58:39 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 14 Aug 2014 02:58:39 +1000 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: <20140813165839.GJ4525@ando> On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote: > On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland wrote: > > While I would not claim a Python sandbox is utterly impossible, I'm > > suspicious that the whole "consenting adults" approach in Python is > > incompatible with a sandbox. The whole idea of a sandbox is to absolutely > > prevent people from doing things even if they really want to and know what > > they are doing. The point of a sandbox is that I, the consenting adult writing the application in the first place, may want to allow *untrusted others* to call Python code without giving them control of the entire application. The consenting adults rule applies to me, the application writer, not them, the end-users, even if they happen to be writing Python code. If they want unrestricted access to the Python interpreter, they can run their code on their own machine, not mine. > It's certainly not *fundamentally* impossible to sandbox Python. > However, the question becomes one of how much effort you're going to > go to and how much you're going to restrict the code. I believe that PyPy has an effective sandbox, but to what degree of effectiveness I don't know. http://pypy.readthedocs.org/en/latest/sandbox.html I've had rogue Javascript crash my browser or make my entire computer effectively unusable often enough that I am skeptical about claims that Javascript in the browser is effectively sandboxed, so I'm doubly cautious about Python. -- Steven From rosuav at gmail.com Wed Aug 13 19:06:01 2014 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 14 Aug 2014 03:06:01 +1000 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: <20140813165839.GJ4525@ando> References: <200902231657.52201.victor.stinner@haypocalc.com> <20140813165839.GJ4525@ando> Message-ID: On Thu, Aug 14, 2014 at 2:58 AM, Steven D'Aprano wrote: >> It's certainly not *fundamentally* impossible to sandbox Python. >> However, the question becomes one of how much effort you're going to >> go to and how much you're going to restrict the code. > > I believe that PyPy has an effective sandbox, but to what degree of > effectiveness I don't know. """ A potential attacker can have arbitrary code run in the subprocess, but cannot actually do any input/output not controlled by the outer process. Additional barriers are put to limit the amount of RAM and CPU time used. Note that this is very different from sandboxing at the Python language level, i.e. placing restrictions on what kind of Python code the attacker is allowed to run (why? read about pysandbox). """ That's quite useful, but isn't the same thing as a Python-in-Python sandbox (or even what I was doing, Python-in-C++). ChrisA From yoavglazner at gmail.com Wed Aug 13 19:08:51 2014 From: yoavglazner at gmail.com (yoav glazner) Date: Wed, 13 Aug 2014 20:08:51 +0300 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <87sil08k8p.fsf@gmail.com> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> Message-ID: On Aug 13, 2014 7:04 PM, "Akira Li" <4kir4.1i at gmail.com> wrote: > > Nick Coghlan writes: > > > On 12 August 2014 22:15, Steven D'Aprano wrote: > >> Compare the natural way of writing this: > >> > >> with open("spam") as spam, open("eggs", "w") as eggs, frobulate("cheese") as cheese: > >> # do stuff with spam, eggs, cheese > >> > >> versus the dynamic way: > >> > >> with ExitStack() as stack: > >> spam, eggs = [stack.enter_context(open(fname), mode) for fname, mode in > >> zip(("spam", "eggs"), ("r", "w")] > >> cheese = stack.enter_context(frobulate("cheese")) > >> # do stuff with spam, eggs, cheese > > > > You wouldn't necessarily switch at three. At only three, you have lots > > of options, including multiple nested with statements: > > > > with open("spam") as spam: > > with open("eggs", "w") as eggs: > > with frobulate("cheese") as cheese: > > # do stuff with spam, eggs, cheese > > > > The "multiple context managers in one with statement" form is there > > *solely* to save indentation levels, and overuse can often be a sign > > that you may have a custom context manager trying to get out: > > > > @contextlib.contextmanager > > def dish(spam_file, egg_file, topping): > > with open(spam_file), open(egg_file, 'w'), frobulate(topping): > > yield > > > > with dish("spam", "eggs", "cheese") as spam, eggs, cheese: > > # do stuff with spam, eggs & cheese > > > > ExitStack is mostly useful as a tool for writing flexible custom > > context managers, and for dealing with context managers in cases where > > lexical scoping doesn't necessarily work, rather than being something > > you'd regularly use for inline code. > > > > "Why do I have so many contexts open at once in this function?" is a > > question developers should ask themselves in the same way its worth > > asking "why do I have so many local variables in this function?" > > Multiline with-statement can be useful even with *two* context > managers. Two is not many. > > Saving indentations levels along is a worthy goal. It can affect > readability and the perceived complexity of the code. > > Here's how I'd like the code to look like: > > with (open('input filename') as input_file, > open('output filename', 'w') as output_file): > # code with list comprehensions to transform input file into output file > > Even one additional unnecessary indentation level may force to split > list comprehensions into several lines (less readable) and/or use > shorter names (less readable). Or it may force to move the inline code > into a separate named function prematurely, solely to preserve the > indentation level (also may be less readable) i.e., > > with ... as input_file: > with ... as output_file: > ... #XXX indentation level is lost for no reason > > with ... as infile, ... as outfile: #XXX shorter names > ... > > with ... as input_file: > with ... as output_file: > transform(input_file, output_file) #XXX unnecessary function > > And (nested() can be implemented using ExitStack): > > with nested(open(..), > open(..)) as (input_file, output_file): > ... #XXX less readable > > Here's an example where nested() won't help: > > def get_integers(filename): > with (open(filename, 'rb', 0) as file, > mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file): > for match in re.finditer(br'\d+', mmapped_file): > yield int(match.group()) > > Here's another: > > with (open('log'+'some expression that generates filename', 'a') as logfile, > redirect_stdout(logfile)): > ... > Just a thought, would it bit wierd that: with (a as b, c as d): "works" with (a, c): "boom" with(a as b, c): ? > > -- > Akira > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/yoavglazner%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ijmorlan at uwaterloo.ca Wed Aug 13 19:11:23 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Wed, 13 Aug 2014 13:11:23 -0400 (EDT) Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: <20140813165839.GJ4525@ando> References: <200902231657.52201.victor.stinner@haypocalc.com> <20140813165839.GJ4525@ando> Message-ID: On Thu, 14 Aug 2014, Steven D'Aprano wrote: > On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote: >> On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland wrote: >>> While I would not claim a Python sandbox is utterly impossible, I'm >>> suspicious that the whole "consenting adults" approach in Python is >>> incompatible with a sandbox. The whole idea of a sandbox is to absolutely >>> prevent people from doing things even if they really want to and know what >>> they are doing. > > The point of a sandbox is that I, the consenting adult writing the > application in the first place, may want to allow *untrusted others* to > call Python code without giving them control of the entire application. > The consenting adults rule applies to me, the application writer, not > them, the end-users, even if they happen to be writing Python code. If > they want unrestricted access to the Python interpreter, they can run > their code on their own machine, not mine. Yes, absolutely, and I didn't mean to contradict what you are saying. What I am suggesting is that the basic design of Python isn't a good starting point for imposing mandatory restrictions on what code can do. By contrast, take something like Safe Haskell. I'm not absolutely certain that it really is safe as promised, but it's starting from a very different language in which the compiler performs extremely sophisticated type checking and simply won't compile programs that don't work within the type system. This isn't a knock on Python (which I love using, by the way), just being realistic about what the existing language is likely to be able to support. Having said that, I'll be very interested if somebody does come up with a restricted mode Python that is widely accepted as being secure - that would be a real achievement. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From steve at pearwood.info Wed Aug 13 19:32:26 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Thu, 14 Aug 2014 03:32:26 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> Message-ID: <20140813173225.GL4525@ando> On Wed, Aug 13, 2014 at 08:08:51PM +0300, yoav glazner wrote: [...] > Just a thought, would it bit wierd that: > with (a as b, c as d): "works" > with (a, c): "boom" > with(a as b, c): ? If this proposal is accepted, there is no need for the "boom". The syntax should allow: # Without parens, limited to a single line. with a [as name], b [as name], c [as name], ...: block # With parens, not limited to a single line. with (a [as name], b [as name], c [as name], ... ): block where the "as name" part is always optional. In both these cases, whether there are parens or not, it will be interpreted as a series of context managers and never as a single tuple. Note two things: (1) this means that even in the unlikely event that tuples become context managers in the future, you won't be able to use a tuple literal: with (1, 2, 3): # won't work as expected t = (1, 2, 3) with t: # will work as expected But I cannot imagine any circumstances where tuples will become context managers. (2) Also note that *this is already the case*, since tuples are made by the commas, not the parentheses. E.g. this succeeds: # Not a tuple, actually two context managers. with open("/tmp/foo"), open("/tmp/bar", "w"): pass -- Steven From tjreedy at udel.edu Wed Aug 13 20:11:07 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 13 Aug 2014 14:11:07 -0400 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> Message-ID: On 8/13/2014 12:19 PM, matsjoyce wrote: > Unless you remove all the things labelled "keep away from children". I wrote > this sandbox to allow python to be used as a "mods"/"add-ons" language for a > game I'm writing, hence the perhaps too strict nature. > > About the crashers: as this is for games, its "fine" for the game to crash, > as long as the sandbox is not broken while crashing. > > time and math can probably be allowed, but random imports a lot of > undesirable modules. > > My sandbox doesn't use proxies, due to the introspection and complexity that > it involves. Instead it completely isolates the sandboxed globals, and checks > all arguments and globals for irregularities before passing control to non- > sandboxed functions. pydev is for mainly for discussion of maintaining current versions and development of the next, and for discussion of PEPs which might apply to the one after next. This discussion should be on python-list or perhaps python-ideas if there is a semi-concrete proposal for a future python. -- Terry Jan Reedy From victor.stinner at gmail.com Wed Aug 13 23:25:43 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Wed, 13 Aug 2014 23:25:43 +0200 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: <20140813165839.GJ4525@ando> References: <200902231657.52201.victor.stinner@haypocalc.com> <20140813165839.GJ4525@ando> Message-ID: Hi, I heard that PyPy sandbox cannot be used out of the box. You have to write a policy to allow syscalls. The complexity is moved to this policy which is very hard to write, especially if you only use whitelists. Correct me if I'm wrong. To be honest, I never take a look at this sandbox. Victor Le mercredi 13 ao?t 2014, Steven D'Aprano a ?crit : > On Thu, Aug 14, 2014 at 02:26:29AM +1000, Chris Angelico wrote: > > On Wed, Aug 13, 2014 at 11:11 PM, Isaac Morland > wrote: > > > While I would not claim a Python sandbox is utterly impossible, I'm > > > suspicious that the whole "consenting adults" approach in Python is > > > incompatible with a sandbox. The whole idea of a sandbox is to > absolutely > > > prevent people from doing things even if they really want to and know > what > > > they are doing. > > The point of a sandbox is that I, the consenting adult writing the > application in the first place, may want to allow *untrusted others* to > call Python code without giving them control of the entire application. > The consenting adults rule applies to me, the application writer, not > them, the end-users, even if they happen to be writing Python code. If > they want unrestricted access to the Python interpreter, they can run > their code on their own machine, not mine. > > > > It's certainly not *fundamentally* impossible to sandbox Python. > > However, the question becomes one of how much effort you're going to > > go to and how much you're going to restrict the code. > > I believe that PyPy has an effective sandbox, but to what degree of > effectiveness I don't know. > > http://pypy.readthedocs.org/en/latest/sandbox.html > > I've had rogue Javascript crash my browser or make my entire computer > effectively unusable often enough that I am skeptical about claims that > Javascript in the browser is effectively sandboxed, so I'm doubly > cautious about Python. > > > -- > Steven > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Aug 14 02:10:34 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 13 Aug 2014 17:10:34 -0700 Subject: [Python-Dev] sum(...) limitation In-Reply-To: <87egwkj4eh.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140802203513.GA10447@k2> <20140804181013.GO4525@ando> <53E4055D.2040305@stoneleaf.us> <53E51269.5030209@stoneleaf.us> <7414D373-F598-4805-9DE8-F9779D08FEE8@gmail.com> <53E571B8.7030103@stoneleaf.us> <87fvh6jg1y.fsf@uwakimon.sk.tsukuba.ac.jp> <874mxkkb0f.fsf@uwakimon.sk.tsukuba.ac.jp> <53E7CBA4.40105@g.nevcal.com> <87wqafj3tb.fsf@uwakimon.sk.tsukuba.ac.jp> <-2448384566377912251@unknownmsgid> <2076096455819154683@unknownmsgid> <87ppg6icxu.fsf@uwakimon.sk.tsukuba.ac.jp> <53E991C9.7020404@stoneleaf.us> <87lhqui6la.fsf@uwakimon.sk.tsukuba.ac.jp> <87egwkj4eh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tue, Aug 12, 2014 at 11:21 PM, Stephen J. Turnbull wrote: > Redirecting to python-ideas, so trimming less than I might. reasonable enough -- you are introducing some more significant ideas for changes. I've said all I have to say about this -- I don't seem to see anything encouraging form core devs, so I guess that's it. Thanks for the fun bike-shedding... -Chris > Chris Barker writes: > > On Mon, Aug 11, 2014 at 11:07 PM, Stephen J. Turnbull < > stephen at xemacs.org> > > wrote: > > > > > I'm referring to removing the unnecessary information that there's a > > > better way to do it, and simply raising an error (as in Python 3.2, > > > say) which is all a RealProgrammer[tm] should ever need! > > > > > > > I can't imagine anyone is suggesting that -- disallow it, but don't tell > > anyone why? > > As I said, it's a regression. That's exactly the behavior in Python 3.2. > > > The only thing that is remotely on the table here is: > > > > 1) remove the special case for strings -- buyer beware -- but consistent > > and less "ugly" > > It's only consistent if you believe that Python has strict rules for > use of various operators. It doesn't, except as far as they are > constrained by precedence. For example, I have an application where I > add bytestrings bytewise modulo N <= 256, and concatenate them. In > fact I use function call syntax, but the obvious operator syntax is > '+' for the bytewise addition, and '*' for the concatenation. > > It's not in the Zen, but I believe in the maxim "If it's worth doing, > it's worth doing well." So for me, 1) is out anyway. > > > 2) add a special case for strings that is fast and efficient -- may be > as > > simple as calling "".join() under the hood --no more code than the > > exception check. > > Sure, but what about all the other immutable containers with __add__ > methods? What about mappings with key-wise __add__ methods whose > values might be immutable but have __add__ methods? Where do you stop > with the special-casing? I consider this far more complex and ugly > than the simple "sum() is for numbers" rule (and even that is way too > complex considering accuracy of summing floats). > > > And I doubt anyone really is pushing for anything but (2) > > I know that, but I think it's the wrong solution to the problem (which > is genuine IMO). The right solution is something generic, possibly a > __sum__ method. The question is whether that leads to too much work > to be worth it (eg, "homogeneous_iterable"). > > > > Because obviously we'd want the attractive nuisance of "if you > > > have __add__, there's a default definition of __sum__" > > > > now I'm confused -- isn't that exactly what we have now? > > Yes and my feeling (backed up by arguments that I admit may persuade > nobody but myself) is that what we have now kinda sucks[tm]. It > seemed like a good idea when I first saw it, but then, my apps don't > scale to where the pain starts in my own usage. > > > > It's possible that Python could provide some kind of feature that > > > would allow an optimized sum function for every type that has > > > __add__, but I think this will take a lot of thinking. > > > > does it need to be every type? As it is the common ones work fine > already > > except for strings -- so if we add an optimized string sum() then we're > > done. > > I didn't say provide an optimized sum(), I said provide a feature > enabling people who want to optimize sum() to do so. So yes, it needs > to be every type (the optional __sum__ method is a proof of concept, > modulo it actually being implementable ;-). > > > > *Somebody* will do it (I don't think anybody is +1 on restricting > > > sum() to a subset of types with __add__). > > > > uhm, that's exactly what we have now > > Exactly. Who's arguing that the sum() we have now is a ticket to > Paradise? I'm just saying that there's probably somebody out there > negative enough on the current situation to come up with an answer > that I think is general enough (and I suspect that python-dev > consensus is that demanding, too). > > > sum() can be used for any type that has an __add__ defined. > > I'd like to see that be mutable types with __iadd__. > > > What I fail to see is why it's better to raise an exception and > > point users to a better way, than to simply provide an optimization > > so that it's a mute issue. > > Because inefficient sum() is an attractive nuisance, easy to overlook, > and likely to bite users other than the author. > > > The only justification offered here is that will teach people that > summing > > strings (and some other objects?) > > Summing tuples works (with appropriate start=tuple()). Haven't > benchmarked, but I bet that's O(N^2). > > > is order(N^2) and a bad idea. But: > > > > a) Python's primary purpose is practical, not pedagogical (not that it > > isn't great for that) > > My argument is that in practical use sum() is a bad idea, period, > until you book up on the types and applications where it *does* work. > N.B. It doesn't even work properly for numbers (inaccurate for floats). > > > b) I doubt any naive users learn anything other than "I can't use sum() > for > > strings, I should use "".join()". > > For people who think that special-casing strings is a good idea, I > think this is about as much benefit as you can expect. Why go > farther?<0.5 wink/> > > > I submit that no naive user is going to get any closer to a proper > > understanding of algorithmic Order behavior from this small hint. Which > > leaves no reason to prefer an Exception to an optimization. > > TOOWTDI. str.join is in pretty much every code base by now, and > tutorials and FAQs recommending its user and severely deprecating sum > for strings are legion. > > > One other point: perhaps this will lead a naive user into thinking -- > > "sum() raises an exception if I try to use it inefficiently, so it must > be > > OK to use for anything that doesn't raise an exception" -- that would > be a > > bad lesson to mis-learn.... > > That assumes they know about the start argument. I think most naive > users will just try to sum a bunch of tuples, and get the "can't add > 0, tuple" Exception and write a loop. I suspect that many of the > users who get the "use str.join" warning along with the Exception are > unaware of the start argument, too. They expect sum(iter_of_str) to > magically add the strings. Ie, when in 3.2 they got the > uninformative "can't add 0, str" message, they did not immediately go > "d'oh" and insert ", start=''" in the call to sum, they wrote a loop. > > > while we are at it, having the default sum() for floats be fsum() > > would be nice > > How do you propose to implement that, given math.fsum is perfectly > happy to sum integers? You can't just check one or a few leading > elements for floatiness. I think you have to dispatch on type(start), > but then sum(iter_of_floats) DTWT. So I would suggest changing the > signature to sum(it, start=0.0). This would probably be acceptable to > most users with iterables of ints, but does imply some performance hit. > > > This does turn sum() into a function that does type-based dispatch, > > but isn't python full of those already? do something special for > > the types you know about, call the generic dunder method for the > > rest. > > AFAIK Python is moving in the opposite direction: if there's a common > need for dispatching to type-specific implementations of a method, > define a standard (not "generic") dunder for the purpose, and have the > builtin (or operator, or whatever) look up (not "call") the > appropriate instance in the usual way, then call it. If there's a > useful generic implementation, define an ABC to inherit from that > provides that generic implementation. > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Thu Aug 14 07:46:50 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Thu, 14 Aug 2014 08:46:50 +0300 Subject: [Python-Dev] Documenting enum types Message-ID: Should new enum types added recently to collect module constants be documented at all? For example AddressFamily is absent in socket.__all__ [1]. [1] http://bugs.python.org/issue20689 From ncoghlan at gmail.com Thu Aug 14 09:48:58 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 14 Aug 2014 17:48:58 +1000 Subject: [Python-Dev] Reviving restricted mode? In-Reply-To: References: <200902231657.52201.victor.stinner@haypocalc.com> <20140813165839.GJ4525@ando> Message-ID: On 14 August 2014 07:25, Victor Stinner wrote: > Hi, > > I heard that PyPy sandbox cannot be used out of the box. You have to write a > policy to allow syscalls. The complexity is moved to this policy which is > very hard to write, especially if you only use whitelists. > > Correct me if I'm wrong. To be honest, I never take a look at this sandbox. By default, the PyPy sandbox requires all system access to be proxied through the host application (which is running in a separate process). Similarly, using "sandbox" on Fedora (et al) will get you a default deny OS level sandbox, where you have to provide selective access to things outside the box. The effective decision taken when rexec and Bastion were removed from the standard library was "sandboxing is hard enough for operating systems to get right, we're not going to try to tackle the even harder problem of an in-process sandbox". "Deny all" sandboxes are relatively easy, but also relatively useless. It's "allow these activities, but no others" that's difficult, since any kind of access can often be leveraged into greater access than was intended. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From victor.stinner at gmail.com Thu Aug 14 11:25:06 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Thu, 14 Aug 2014 11:25:06 +0200 Subject: [Python-Dev] Documenting enum types In-Reply-To: References: Message-ID: Hi, IMO we should not document enum types because Python implementations other than CPython may want to implement them differently (ex: not all Python implementations have an enum module currently). By experience, exposing too many things in the public API becomes a problem later when you want to modify the code. Victor Le 14 ao?t 2014 07:47, "Serhiy Storchaka" a ?crit : > Should new enum types added recently to collect module constants be > documented at all? For example AddressFamily is absent in socket.__all__ > [1]. > > [1] http://bugs.python.org/issue20689 > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ > victor.stinner%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Thu Aug 14 13:52:57 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 14 Aug 2014 21:52:57 +1000 Subject: [Python-Dev] Documenting enum types In-Reply-To: References: Message-ID: On 14 August 2014 19:25, Victor Stinner wrote: > Hi, > > IMO we should not document enum types because Python implementations other > than CPython may want to implement them differently (ex: not all Python > implementations have an enum module currently). By experience, exposing too > many things in the public API becomes a problem later when you want to > modify the code. Implementations claiming conformance with Python 3.4 will have to have an enum module - there just aren't any of those other than CPython at this point (I expect PyPy3 will catch up before too long, since the changes between 3.2 and 3.4 shouldn't be too dramatic from an implementation perspective). In this particular case, though, I think the relevant question is "Why are they enums?" and the answer is "for the better representations". I'm not clear on the use case for exposing and documenting the enum types themselves (although I don't have any real objection either). Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Thu Aug 14 17:42:00 2014 From: guido at python.org (Guido van Rossum) Date: Thu, 14 Aug 2014 08:42:00 -0700 Subject: [Python-Dev] Documenting enum types In-Reply-To: References: Message-ID: The enemy must be documented and exported, since users will encounter them. On Aug 14, 2014 4:54 AM, "Nick Coghlan" wrote: > On 14 August 2014 19:25, Victor Stinner wrote: > > Hi, > > > > IMO we should not document enum types because Python implementations > other > > than CPython may want to implement them differently (ex: not all Python > > implementations have an enum module currently). By experience, exposing > too > > many things in the public API becomes a problem later when you want to > > modify the code. > > Implementations claiming conformance with Python 3.4 will have to have > an enum module - there just aren't any of those other than CPython at > this point (I expect PyPy3 will catch up before too long, since the > changes between 3.2 and 3.4 shouldn't be too dramatic from an > implementation perspective). > > In this particular case, though, I think the relevant question is "Why > are they enums?" and the answer is "for the better representations". > I'm not clear on the use case for exposing and documenting the enum > types themselves (although I don't have any real objection either). > > Regards, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benhoyt at gmail.com Thu Aug 14 17:51:59 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Thu, 14 Aug 2014 11:51:59 -0400 Subject: [Python-Dev] Documenting enum types In-Reply-To: References: Message-ID: > The enemy must be documented and exported, since users will encounter them. enum == enemy? Is that you, Raymond? ;-) -Ben From ethan at stoneleaf.us Thu Aug 14 18:14:38 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Thu, 14 Aug 2014 09:14:38 -0700 Subject: [Python-Dev] Documenting enum types In-Reply-To: References: Message-ID: <53ECE06E.2080808@stoneleaf.us> On 08/14/2014 08:51 AM, Ben Hoyt wrote: >> The enemy must be documented and exported, since users will encounter them. > > enum == enemy? Is that you, Raymond? ;-) ROFL! Thanks, I needed that! :D -- ~Ethan~ From breamoreboy at yahoo.co.uk Thu Aug 14 19:24:45 2014 From: breamoreboy at yahoo.co.uk (Mark Lawrence) Date: Thu, 14 Aug 2014 18:24:45 +0100 Subject: [Python-Dev] Documenting enum types In-Reply-To: <53ECE06E.2080808@stoneleaf.us> References: <53ECE06E.2080808@stoneleaf.us> Message-ID: On 14/08/2014 17:14, Ethan Furman wrote: > On 08/14/2014 08:51 AM, Ben Hoyt wrote: The BDFL actually wrote:- >>> The enemy must be documented and exported, since users will encounter >>> them. QOTW. >> >> enum == enemy? Is that you, Raymond? ;-) > > ROFL! Thanks, I needed that! > > :D > > -- > ~Ethan~ I'll be seeing the PSF in court, on the grounds that I've just bust a gut laughing :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence From ncoghlan at gmail.com Fri Aug 15 07:50:25 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 15 Aug 2014 15:50:25 +1000 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray Message-ID: I just posted an updated version of PEP 467 after recently finishing the updates to the Python 3.4+ binary sequence docs to decouple them from the str docs. Key points in the proposal: * deprecate passing integers to bytes() and bytearray() * add bytes.zeros() and bytearray.zeros() as a replacement * add bytes.byte() and bytearray.byte() as counterparts to ord() for binary data * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes() As far as I am aware, that last item poses the only open question, with the alternative being to add an "iterbytes" builtin with a definition along the lines of the following: def iterbytes(data): try: getiter = type(data).__iterbytes__ except AttributeError: iter = map(bytes.byte, data) else: iter = getiter(data) return iter Regards, Nick. PEP URL: http://www.python.org/dev/peps/pep-0467/ Full PEP text: ============================= PEP: 467 Title: Minor API improvements for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes a number of small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make it easier to operate entirely in the binary domain. Background ========== To simplify the task of writing the Python 3 documentation, the ``bytes`` and ``bytearray`` types were documented primarily in terms of the way they differed from the Unicode based Python 3 ``str`` type. Even when I `heavily revised the sequence documentation `__ in 2012, I retained that simplifying shortcut. However, it turns out that this approach to the documentation of these types had a problem: it doesn't adequately introduce users to their hybrid nature, where they can be manipulated *either* as a "sequence of integers" type, *or* as ``str``-like types that assume ASCII compatible data. That oversight has now been corrected, with the binary sequence types now being documented entirely independently of the ``str`` documentation in `Python 3.4+ `__ The confusion isn't just a documentation issue, however, as there are also some lingering design quirks from an earlier pre-release design where there was *no* separate ``bytearray`` type, and instead the core ``bytes`` type was mutable (with no immutable counterpart). Finally, additional experience with using the existing Python 3 binary sequence types in real world applications has suggested it would be beneficial to make it easier to convert integers to length 1 bytes objects. Proposals ========= As a "consistency improvement" proposal, this PEP is actually about a few smaller micro-proposals, each aimed at improving the usability of the binary data model in Python 3. Proposals are motivated by one of two main factors: * removing remnants of the original design of ``bytes`` as a mutable type * allowing users to easily convert integer values to a length 1 ``bytes`` object Alternate Constructors ---------------------- The ``bytes`` and ``bytearray`` constructors currently accept an integer argument, but interpret it to mean a zero-filled object of the given length. This is a legacy of the original design of ``bytes`` as a mutable type, rather than a particularly intuitive behaviour for users. It has become especially confusing now that some other ``bytes`` interfaces treat integers and the corresponding length 1 bytes instances as equivalent input. Compare:: >>> b"\x03" in bytes([1, 2, 3]) True >>> 3 in bytes([1, 2, 3]) True >>> bytes(b"\x03") b'\x03' >>> bytes(3) b'\x00\x00\x00' This PEP proposes that the current handling of integers in the bytes and bytearray constructors by deprecated in Python 3.5 and targeted for removal in Python 3.7, being replaced by two more explicit alternate constructors provided as class methods. The initial python-ideas thread [ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating this constructor behaviour. Firstly, a ``byte`` constructor is proposed that converts integers in the range 0 to 255 (inclusive) to a ``bytes`` object:: >>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') >>> bytes.byte(512) Traceback (most recent call last): File "", line 1, in ValueError: bytes must be in range(0, 256) One specific use case for this alternate constructor is to easily convert the result of indexing operations on ``bytes`` and other binary sequences from an integer to a ``bytes`` object. The documentation for this API should note that its counterpart for the reverse conversion is ``ord()``. The ``ord()`` documentation will also be updated to note that while ``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and ``bytearray.byte`` are the counterparts for binary input. Secondly, a ``zeros`` constructor is proposed that serves as a direct replacement for the current constructor behaviour, rather than having to use sequence repetition to achieve the same effect in a less intuitive way:: >>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00') The chosen name here is taken from the corresponding initialisation function in NumPy (although, as these are sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple) While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more useful duo amongst the new constructors, ``bytes.zeros`` and `bytearray.byte`` are provided in order to maintain API consistency between the two types. Iteration --------- While iteration over ``bytes`` objects and other binary sequences produces integers, it is sometimes desirable to iterate over length 1 bytes objects instead. To handle this situation more obviously (and more efficiently) than would be the case with the ``map(bytes.byte, data)`` construct enabled by the above constructor changes, this PEP proposes the addition of a new ``iterbytes`` method to ``bytes``, ``bytearray`` and ``memoryview``:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer Third party types and arbitrary containers of integers that lack the new method can still be handled by combining ``map`` with the new ``bytes.byte()`` alternate constructor proposed above:: for x in map(bytes.byte, data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive Open questions ^^^^^^^^^^^^^^ * The fallback case above suggests that this could perhaps be better handled as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()`` if defined, but otherwise fell back to ``map(bytes.byte, data)``:: for x in iterbytes(data): # x is a length 1 ``bytes`` object, rather than an integer # This works with *any* container of integers in the range # 0 to 255 inclusive References ========== .. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html .. [empty-buffer-issue] http://bugs.python.org/issue20895 .. [GvR-initial-feedback] https://mail.python.org/pipermail/python-ideas/2014-March/027376.html Copyright ========= This document has been placed in the public domain. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From status at bugs.python.org Fri Aug 15 18:07:43 2014 From: status at bugs.python.org (Python tracker) Date: Fri, 15 Aug 2014 18:07:43 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20140815160743.5F87D56440@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2014-08-08 - 2014-08-15) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 4602 ( +0) closed 29371 (+31) total 33973 (+31) Open issues with patches: 2175 Issues opened (23) ================== #21166: Bus error in pybuilddir.txt 'python -m sysconfigure --generate http://bugs.python.org/issue21166 reopened by ned.deily #22176: update internal libffi copy to 3.1, introducing AArch64 and PO http://bugs.python.org/issue22176 opened by doko #22177: Incorrect version reported after downgrade http://bugs.python.org/issue22177 opened by jpe5605 #22179: Focus stays on Search Dialog when text found in editor http://bugs.python.org/issue22179 opened by BreamoreBoy #22181: os.urandom() should use Linux 3.17 getrandom() syscall http://bugs.python.org/issue22181 opened by haypo #22182: distutils.file_util.move_file unpacks wrongly an exception http://bugs.python.org/issue22182 opened by Claudiu.Popa #22185: Occasional RuntimeError from Condition.notify http://bugs.python.org/issue22185 opened by dougz #22186: Typos in .py files http://bugs.python.org/issue22186 opened by iwontbecreative #22187: commands.mkarg() buggy in East Asian locales http://bugs.python.org/issue22187 opened by jwilk #22188: test_gdb fails on invalid gdbinit http://bugs.python.org/issue22188 opened by lekensteyn #22189: collections.UserString missing some str methods http://bugs.python.org/issue22189 opened by ncoghlan #22191: warnings.__all__ incomplete http://bugs.python.org/issue22191 opened by pitrou #22192: dict_values objects are hashable http://bugs.python.org/issue22192 opened by roippi #22193: Add _PySys_GetSizeOf() http://bugs.python.org/issue22193 opened by serhiy.storchaka #22194: access to cdecimal / libmpdec API http://bugs.python.org/issue22194 opened by pitrou #22195: Make it easy to replace print() calls with logging calls http://bugs.python.org/issue22195 opened by pitrou #22196: namedtuple documentation could/should mention the new Enum typ http://bugs.python.org/issue22196 opened by lelit #22197: Allow better verbosity / output control in test cases http://bugs.python.org/issue22197 opened by pitrou #22198: Odd floor-division corner case http://bugs.python.org/issue22198 opened by mark.dickinson #22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m http://bugs.python.org/issue22199 opened by jamercee #22200: Remove distutils checks for Python version http://bugs.python.org/issue22200 opened by takluyver #22201: python -mzipfile fails to unzip files with folders created by http://bugs.python.org/issue22201 opened by Antony.Lee #22203: inspect.getargspec() returns wrong spec for builtins http://bugs.python.org/issue22203 opened by suor Most recent 15 issues with no replies (15) ========================================== #22201: python -mzipfile fails to unzip files with folders created by http://bugs.python.org/issue22201 #22200: Remove distutils checks for Python version http://bugs.python.org/issue22200 #22197: Allow better verbosity / output control in test cases http://bugs.python.org/issue22197 #22196: namedtuple documentation could/should mention the new Enum typ http://bugs.python.org/issue22196 #22194: access to cdecimal / libmpdec API http://bugs.python.org/issue22194 #22189: collections.UserString missing some str methods http://bugs.python.org/issue22189 #22188: test_gdb fails on invalid gdbinit http://bugs.python.org/issue22188 #22181: os.urandom() should use Linux 3.17 getrandom() syscall http://bugs.python.org/issue22181 #22179: Focus stays on Search Dialog when text found in editor http://bugs.python.org/issue22179 #22173: Update lib2to3.tests and test_lib2to3 to use test discovery http://bugs.python.org/issue22173 #22164: cell object cleared too early? http://bugs.python.org/issue22164 #22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul http://bugs.python.org/issue22163 #22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec http://bugs.python.org/issue22159 #22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy http://bugs.python.org/issue22158 #22153: There is no standard TestCase.runTest implementation http://bugs.python.org/issue22153 Most recent 15 issues waiting for review (15) ============================================= #22200: Remove distutils checks for Python version http://bugs.python.org/issue22200 #22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m http://bugs.python.org/issue22199 #22193: Add _PySys_GetSizeOf() http://bugs.python.org/issue22193 #22186: Typos in .py files http://bugs.python.org/issue22186 #22185: Occasional RuntimeError from Condition.notify http://bugs.python.org/issue22185 #22182: distutils.file_util.move_file unpacks wrongly an exception http://bugs.python.org/issue22182 #22173: Update lib2to3.tests and test_lib2to3 to use test discovery http://bugs.python.org/issue22173 #22166: test_codecs "leaking" references http://bugs.python.org/issue22166 #22165: Empty response from http.server when directory listing contain http://bugs.python.org/issue22165 #22163: max_wbits set incorrectly to -zlib.MAX_WBITS in tarfile, shoul http://bugs.python.org/issue22163 #22159: smtpd.PureProxy and smtpd.DebuggingServer do not work with dec http://bugs.python.org/issue22159 #22158: RFC 6531 (SMTPUTF8) support in smtpd.PureProxy http://bugs.python.org/issue22158 #22156: Fix compiler warnings http://bugs.python.org/issue22156 #22150: deprecated-removed directive is broken in Sphinx 1.2.2 http://bugs.python.org/issue22150 #22149: the frame of a suspended generator should not have a local tra http://bugs.python.org/issue22149 Top 10 most discussed issues (10) ================================= #19494: urllib2.HTTPBasicAuthHandler (or urllib.request.HTTPBasicAuthH http://bugs.python.org/issue19494 15 msgs #15381: Optimize BytesIO to do less reallocations when written, simil http://bugs.python.org/issue15381 10 msgs #22193: Add _PySys_GetSizeOf() http://bugs.python.org/issue22193 7 msgs #22118: urljoin fails with messy relative URLs http://bugs.python.org/issue22118 6 msgs #12954: Multiprocessing logging under Windows http://bugs.python.org/issue12954 5 msgs #18844: allow weights in random.choice http://bugs.python.org/issue18844 5 msgs #21448: Email Parser use 100% CPU http://bugs.python.org/issue21448 5 msgs #22177: Incorrect version reported after downgrade http://bugs.python.org/issue22177 5 msgs #22191: warnings.__all__ incomplete http://bugs.python.org/issue22191 5 msgs #22198: Odd floor-division corner case http://bugs.python.org/issue22198 5 msgs Issues closed (28) ================== #14105: Breakpoints in debug lost if line is inserted; IDLE http://bugs.python.org/issue14105 closed by terry.reedy #16773: int() half-accepts UserString http://bugs.python.org/issue16773 closed by serhiy.storchaka #17923: test glob with trailing slash fail on AIX 6.1 http://bugs.python.org/issue17923 closed by serhiy.storchaka #18004: test_list.test_overflow crashes Win64 http://bugs.python.org/issue18004 closed by serhiy.storchaka #19743: test_gdb failures http://bugs.python.org/issue19743 closed by pitrou #20101: Determine correct behavior for time functions on Windows http://bugs.python.org/issue20101 closed by haypo #20729: mailbox.Mailbox does odd hasattr() check http://bugs.python.org/issue20729 closed by serhiy.storchaka #20746: test_pdb fails in refleak mode http://bugs.python.org/issue20746 closed by pitrou #21121: -Werror=declaration-after-statement is added even for extensio http://bugs.python.org/issue21121 closed by python-dev #21412: Solaris/Oracle Studio: Fatal Python error: PyThreadState_Get w http://bugs.python.org/issue21412 closed by ned.deily #21445: Some asserts in test_filecmp have the wrong messages http://bugs.python.org/issue21445 closed by berker.peksag #21725: RFC 6531 (SMTPUTF8) support in smtpd http://bugs.python.org/issue21725 closed by r.david.murray #21777: Separate out documentation of binary sequence methods http://bugs.python.org/issue21777 closed by ncoghlan #22060: Clean up ctypes.test, use unittest test discovery http://bugs.python.org/issue22060 closed by python-dev #22065: Update turtledemo menu creation http://bugs.python.org/issue22065 closed by terry.reedy #22112: '_UnixSelectorEventLoop' object has no attribute 'create_task' http://bugs.python.org/issue22112 closed by haypo #22139: python windows 2.7.8 64-bit did not install http://bugs.python.org/issue22139 closed by loewis #22145: <> in parser spec but not lexer spec http://bugs.python.org/issue22145 closed by rhettinger #22161: Remove unsupported code from ctypes http://bugs.python.org/issue22161 closed by serhiy.storchaka #22174: property doc fixes http://bugs.python.org/issue22174 closed by rhettinger #22175: improve test_faulthandler readability with dedent http://bugs.python.org/issue22175 closed by python-dev #22178: _winreg.QueryInfoKey Last Modified Time Value Incorrect or Exp http://bugs.python.org/issue22178 closed by python-dev #22180: operator.setitem example no longer works in Python 3 due to la http://bugs.python.org/issue22180 closed by rhettinger #22183: datetime.timezone methods require datetime object http://bugs.python.org/issue22183 closed by belopolsky #22184: lrucache should reject maxsize as a function http://bugs.python.org/issue22184 closed by rhettinger #22190: Integrate tracemalloc into regrtest refleak hunting http://bugs.python.org/issue22190 closed by ncoghlan #22202: Function Bug? http://bugs.python.org/issue22202 closed by steven.daprano #22204: spam http://bugs.python.org/issue22204 closed by ezio.melotti From guido at python.org Fri Aug 15 19:48:58 2014 From: guido at python.org (Guido van Rossum) Date: Fri, 15 Aug 2014 10:48:58 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: This feels chatty. I'd like the PEP to call out the specific proposals and put the more verbose motivation later. It took me a long time to realize that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3). Also your mention of bytes.byte() as the counterpart to ord() confused me -- I think it's more similar to chr(). I don't like iterbytes as a builtin, let's keep it as a method on affected types. On Thu, Aug 14, 2014 at 10:50 PM, Nick Coghlan wrote: > I just posted an updated version of PEP 467 after recently finishing > the updates to the Python 3.4+ binary sequence docs to decouple them > from the str docs. > > Key points in the proposal: > > * deprecate passing integers to bytes() and bytearray() > * add bytes.zeros() and bytearray.zeros() as a replacement > * add bytes.byte() and bytearray.byte() as counterparts to ord() for > binary data > * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes() > > As far as I am aware, that last item poses the only open question, > with the alternative being to add an "iterbytes" builtin with a > definition along the lines of the following: > > def iterbytes(data): > try: > getiter = type(data).__iterbytes__ > except AttributeError: > iter = map(bytes.byte, data) > else: > iter = getiter(data) > return iter > > Regards, > Nick. > > PEP URL: http://www.python.org/dev/peps/pep-0467/ > > Full PEP text: > ============================= > PEP: 467 > Title: Minor API improvements for bytes and bytearray > Version: $Revision$ > Last-Modified: $Date$ > Author: Nick Coghlan > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 2014-03-30 > Python-Version: 3.5 > Post-History: 2014-03-30 2014-08-15 > > > Abstract > ======== > > During the initial development of the Python 3 language specification, the > core ``bytes`` type for arbitrary binary data started as the mutable type > that is now referred to as ``bytearray``. Other aspects of operating in > the binary domain in Python have also evolved over the course of the Python > 3 series. > > This PEP proposes a number of small adjustments to the APIs of the > ``bytes`` > and ``bytearray`` types to make it easier to operate entirely in the binary > domain. > > > Background > ========== > > To simplify the task of writing the Python 3 documentation, the ``bytes`` > and ``bytearray`` types were documented primarily in terms of the way they > differed from the Unicode based Python 3 ``str`` type. Even when I > `heavily revised the sequence documentation > `__ in 2012, I retained > that > simplifying shortcut. > > However, it turns out that this approach to the documentation of these > types > had a problem: it doesn't adequately introduce users to their hybrid > nature, > where they can be manipulated *either* as a "sequence of integers" type, > *or* as ``str``-like types that assume ASCII compatible data. > > That oversight has now been corrected, with the binary sequence types now > being documented entirely independently of the ``str`` documentation in > `Python 3.4+ < > https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview > >`__ > > The confusion isn't just a documentation issue, however, as there are also > some lingering design quirks from an earlier pre-release design where there > was *no* separate ``bytearray`` type, and instead the core ``bytes`` type > was mutable (with no immutable counterpart). > > Finally, additional experience with using the existing Python 3 binary > sequence types in real world applications has suggested it would be > beneficial to make it easier to convert integers to length 1 bytes objects. > > > Proposals > ========= > > As a "consistency improvement" proposal, this PEP is actually about a few > smaller micro-proposals, each aimed at improving the usability of the > binary > data model in Python 3. Proposals are motivated by one of two main factors: > > * removing remnants of the original design of ``bytes`` as a mutable type > * allowing users to easily convert integer values to a length 1 ``bytes`` > object > > > Alternate Constructors > ---------------------- > > The ``bytes`` and ``bytearray`` constructors currently accept an integer > argument, but interpret it to mean a zero-filled object of the given > length. > This is a legacy of the original design of ``bytes`` as a mutable type, > rather than a particularly intuitive behaviour for users. It has become > especially confusing now that some other ``bytes`` interfaces treat > integers > and the corresponding length 1 bytes instances as equivalent input. > Compare:: > > >>> b"\x03" in bytes([1, 2, 3]) > True > >>> 3 in bytes([1, 2, 3]) > True > > >>> bytes(b"\x03") > b'\x03' > >>> bytes(3) > b'\x00\x00\x00' > > This PEP proposes that the current handling of integers in the bytes and > bytearray constructors by deprecated in Python 3.5 and targeted for > removal in Python 3.7, being replaced by two more explicit alternate > constructors provided as class methods. The initial python-ideas thread > [ideas-thread1]_ that spawned this PEP was specifically aimed at > deprecating > this constructor behaviour. > > Firstly, a ``byte`` constructor is proposed that converts integers > in the range 0 to 255 (inclusive) to a ``bytes`` object:: > > >>> bytes.byte(3) > b'\x03' > >>> bytearray.byte(3) > bytearray(b'\x03') > >>> bytes.byte(512) > Traceback (most recent call last): > File "", line 1, in > ValueError: bytes must be in range(0, 256) > > One specific use case for this alternate constructor is to easily convert > the result of indexing operations on ``bytes`` and other binary sequences > from an integer to a ``bytes`` object. The documentation for this API > should note that its counterpart for the reverse conversion is ``ord()``. > The ``ord()`` documentation will also be updated to note that while > ``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and > ``bytearray.byte`` are the counterparts for binary input. > > Secondly, a ``zeros`` constructor is proposed that serves as a direct > replacement for the current constructor behaviour, rather than having to > use > sequence repetition to achieve the same effect in a less intuitive way:: > > >>> bytes.zeros(3) > b'\x00\x00\x00' > >>> bytearray.zeros(3) > bytearray(b'\x00\x00\x00') > > The chosen name here is taken from the corresponding initialisation > function > in NumPy (although, as these are sequence types rather than N-dimensional > matrices, the constructors take a length as input rather than a shape > tuple) > > While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more > useful duo amongst the new constructors, ``bytes.zeros`` and > `bytearray.byte`` are provided in order to maintain API consistency between > the two types. > > > Iteration > --------- > > While iteration over ``bytes`` objects and other binary sequences produces > integers, it is sometimes desirable to iterate over length 1 bytes objects > instead. > > To handle this situation more obviously (and more efficiently) than would > be > the case with the ``map(bytes.byte, data)`` construct enabled by the above > constructor changes, this PEP proposes the addition of a new ``iterbytes`` > method to ``bytes``, ``bytearray`` and ``memoryview``:: > > for x in data.iterbytes(): > # x is a length 1 ``bytes`` object, rather than an integer > > Third party types and arbitrary containers of integers that lack the new > method can still be handled by combining ``map`` with the new > ``bytes.byte()`` alternate constructor proposed above:: > > for x in map(bytes.byte, data): > # x is a length 1 ``bytes`` object, rather than an integer > # This works with *any* container of integers in the range > # 0 to 255 inclusive > > > Open questions > ^^^^^^^^^^^^^^ > > * The fallback case above suggests that this could perhaps be better > handled > as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()`` > if defined, but otherwise fell back to ``map(bytes.byte, data)``:: > > for x in iterbytes(data): > # x is a length 1 ``bytes`` object, rather than an integer > # This works with *any* container of integers in the range > # 0 to 255 inclusive > > > References > ========== > > .. [ideas-thread1] > https://mail.python.org/pipermail/python-ideas/2014-March/027295.html > .. [empty-buffer-issue] http://bugs.python.org/issue20895 > .. [GvR-initial-feedback] > https://mail.python.org/pipermail/python-ideas/2014-March/027376.html > > > Copyright > ========= > > This document has been placed in the public domain. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Fri Aug 15 21:54:22 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Fri, 15 Aug 2014 22:54:22 +0300 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: 15.08.14 08:50, Nick Coghlan ???????(??): > * add bytes.zeros() and bytearray.zeros() as a replacement b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need to learn new method. And it works right now. > * add bytes.iterbytes(), bytearray.iterbytes() and memoryview.iterbytes() What are use cases for this? I suppose that main use case may be writing the code compatible with 2.7 and 3.x. But in this case you need a wrapper (because these types in 2.7 have no the iterbytes() method). And how larger would be an advantage of this method over the ``map(bytes.byte, data)``? From victor.stinner at gmail.com Fri Aug 15 21:59:40 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 15 Aug 2014 21:59:40 +0200 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: 2014-08-15 21:54 GMT+02:00 Serhiy Storchaka : > 15.08.14 08:50, Nick Coghlan ???????(??): >> * add bytes.zeros() and bytearray.zeros() as a replacement > > b'\0' * n and bytearray(b'\0') * n look good replacements to me. No need to > learn new method. And it works right now. FYI there is a pending patch for bytearray(int) to use calloc() instead of malloc(). It's faster for buffer for n larger than 1 MB: http://bugs.python.org/issue21644 I'm not sure that the optimization is really useful. Victor From victor.stinner at gmail.com Fri Aug 15 21:55:46 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Fri, 15 Aug 2014 21:55:46 +0200 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: 2014-08-15 7:50 GMT+02:00 Nick Coghlan : > As far as I am aware, that last item poses the only open question, > with the alternative being to add an "iterbytes" builtin (...) Do you have examples of use cases for a builtin function? I only found 5 usages of bytes((byte,)) constructor in the standard library: $ grep -E 'bytes\(\([^)]+, *\)\)' $(find -name "*.py") ./Lib/quopri.py: c = bytes((c,)) ./Lib/quopri.py: c = bytes((c,)) ./Lib/base64.py: b32tab = [bytes((i,)) for i in _b32alphabet] ./Lib/base64.py: _a85chars = [bytes((i,)) for i in range(33, 118)] ./Lib/base64.py: _b85chars = [bytes((i,)) for i in _b85alphabet] bytes.iterbytes() can be used in 4 cases on 5. Adding a new builtin for a single line in the whole standard library doesn't look right. Victor From ethan at stoneleaf.us Fri Aug 15 23:03:40 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 15 Aug 2014 14:03:40 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140813033855.GH4525@ando> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <20140813033855.GH4525@ando> Message-ID: <53EE75AC.3040106@stoneleaf.us> On 08/12/2014 08:38 PM, Steven D'Aprano wrote: > > [1] Technically not, since it's the comma, not the ( ), which makes a > tuple, but a lot of people don't know that and treat it as if it the > parens were compulsary. It might as well be, because if there can be a non-tuple way to interpret the comma that way takes precedence, and then the parens /are/ required to disambiguate and get the tuple you wanted. -- ~Ethan~ From ethan at stoneleaf.us Fri Aug 15 23:08:42 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 15 Aug 2014 14:08:42 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140813173225.GL4525@ando> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> Message-ID: <53EE76DA.1060908@stoneleaf.us> On 08/13/2014 10:32 AM, Steven D'Aprano wrote: > > (2) Also note that *this is already the case*, since tuples are made by > the commas, not the parentheses. E.g. this succeeds: > > # Not a tuple, actually two context managers. > with open("/tmp/foo"), open("/tmp/bar", "w"): > pass Thanks for proving my point! A comma, and yet we did *not* get a tuple from it. -- ~Ethan~ From g.brandl at gmx.net Fri Aug 15 23:34:32 2014 From: g.brandl at gmx.net (Georg Brandl) Date: Fri, 15 Aug 2014 23:34:32 +0200 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <53EE76DA.1060908@stoneleaf.us> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> Message-ID: On 08/15/2014 11:08 PM, Ethan Furman wrote: > On 08/13/2014 10:32 AM, Steven D'Aprano wrote: >> >> (2) Also note that *this is already the case*, since tuples are made by >> the commas, not the parentheses. E.g. this succeeds: >> >> # Not a tuple, actually two context managers. >> with open("/tmp/foo"), open("/tmp/bar", "w"): >> pass > > Thanks for proving my point! A comma, and yet we did *not* get a tuple from it. Clearly the rule is that the comma makes the tuple, except when it doesn't :) Georg From steve at pearwood.info Sat Aug 16 05:08:48 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 16 Aug 2014 13:08:48 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <53EE76DA.1060908@stoneleaf.us> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> Message-ID: <20140816030847.GD4525@ando> On Fri, Aug 15, 2014 at 02:08:42PM -0700, Ethan Furman wrote: > On 08/13/2014 10:32 AM, Steven D'Aprano wrote: > > > >(2) Also note that *this is already the case*, since tuples are made by > >the commas, not the parentheses. E.g. this succeeds: > > > ># Not a tuple, actually two context managers. > >with open("/tmp/foo"), open("/tmp/bar", "w"): > > pass > > Thanks for proving my point! A comma, and yet we did *not* get a tuple > from it. Um, sorry, I don't quite get you. Are you agreeing or disagreeing with me? I spent half of yesterday reading the static typing thread over on Python-ideas and it's possible my brain has melted down *wink* but I'm confused by your response. Normally when people say "Thanks for proving my point", the implication is that the person being thanked (in this case me) has inadvertently undercut their own argument. I don't think I have. I'm suggesting that the argument *against* the proposal: "Multi-line with statements should not be allowed, because: with (spam, eggs, cheese): ... is syntactically a tuple" is a poor argument (that is, I'm disagreeing with it), since *single* line parens-free with statements are already syntactically a tuple: with spam, eggs, cheese: # Commas make a tuple, not parens. ... I think the OP's suggestion is a sound one, and while Nick's point that bulky with-statements *may* be a sign that some re-factoring is needed, there are many things that are a sign that re-factoring is needed and I don't think this particular one warrents rejecting what is otherwise an obvious and clear way of using multiple context managers. -- Steven From ethan at stoneleaf.us Sat Aug 16 05:29:09 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 15 Aug 2014 20:29:09 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140816030847.GD4525@ando> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> Message-ID: <53EED005.5020407@stoneleaf.us> On 08/15/2014 08:08 PM, Steven D'Aprano wrote: > On Fri, Aug 15, 2014 at 02:08:42PM -0700, Ethan Furman wrote: >> On 08/13/2014 10:32 AM, Steven D'Aprano wrote: >>> >>> (2) Also note that *this is already the case*, since tuples are made by >>> the commas, not the parentheses. E.g. this succeeds: >>> >>> # Not a tuple, actually two context managers. >>> with open("/tmp/foo"), open("/tmp/bar", "w"): >>> pass >> >> Thanks for proving my point! A comma, and yet we did *not* get a tuple >> from it. > > Um, sorry, I don't quite get you. Are you agreeing or disagreeing with > me? I spent half of yesterday reading the static typing thread over on > Python-ideas and it's possible my brain has melted down *wink* but I'm > confused by your response. My point is that commas don't always make a tuple, and your example above is a case in point: we have a comma separating two context managers, but we do not have a tuple, and your comment even says so. > is a poor argument (that is, I'm disagreeing with it), since *single* > line parens-free with statements are already syntactically a tuple: > > with spam, eggs, cheese: # Commas make a tuple, not parens. This point I do not understand -- commas /can/ create a tuple, but don't /necessarily/ create a tuple. So, semantically: no tuple. Syntactically: I don't think there's a tuple there this way either. I suppose one of us should look it up in the lexar. ;) -- ~Ethan~ From ncoghlan at gmail.com Sat Aug 16 07:17:35 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 16 Aug 2014 15:17:35 +1000 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: On 16 August 2014 03:48, Guido van Rossum wrote: > This feels chatty. I'd like the PEP to call out the specific proposals and > put the more verbose motivation later. I realised that some of that history was actually completely irrelevant now, so I culled a fair bit of it entirely. > It took me a long time to realize > that you don't want to deprecate bytes([1, 2, 3]), but only bytes(3). I've split out the four subproposals into their own sections, so hopefully this is clearer now. > Also > your mention of bytes.byte() as the counterpart to ord() confused me -- I > think it's more similar to chr(). This was just a case of me using the wrong word - I meant "inverse" rather than "counterpart". > I don't like iterbytes as a builtin, let's > keep it as a method on affected types. Done. I also added an explanation of the benefits it offers over the more generic "map(bytes.byte, data)", as well as more precise semantics for how it will work with memoryview objects. New draft is live at http://www.python.org/dev/peps/pep-0467/, as well as being included inline below. Regards, Nick. =================================== PEP: 467 Title: Minor API improvements for bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-03-30 Python-Version: 3.5 Post-History: 2014-03-30 2014-08-15 2014-08-16 Abstract ======== During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series. This PEP proposes four small adjustments to the APIs of the ``bytes``, ``bytearray`` and ``memoryview`` types to make it easier to operate entirely in the binary domain: * Deprecate passing single integer values to ``bytes`` and ``bytearray`` * Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and ``memoryview.iterbytes`` alternative iterators Proposals ========= Deprecation of current "zero-initialised sequence" behaviour ------------------------------------------------------------ Currently, the ``bytes`` and ``bytearray`` constructors accept an integer argument and interpret it as meaning to create a zero-initialised sequence of the given size:: >>> bytes(3) b'\x00\x00\x00' >>> bytearray(3) bytearray(b'\x00\x00\x00') This PEP proposes to deprecate that behaviour in Python 3.5, and remove it entirely in Python 3.6. No other changes are proposed to the existing constructors. Addition of explicit "zero-initialised sequence" constructors ------------------------------------------------------------- To replace the deprecated behaviour, this PEP proposes the addition of an explicit ``zeros`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00') It will behave just as the current constructors behave when passed a single integer. The specific choice of ``zeros`` as the alternative constructor name is taken from the corresponding initialisation function in NumPy (although, as these are 1-dimensional sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple) Addition of explicit "single byte" constructors ----------------------------------------------- As binary counterparts to the text ``chr`` function, this PEP proposes the addition of an explicit ``byte`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03') These methods will only accept integers in the range 0 to 255 (inclusive):: >>> bytes.byte(512) Traceback (most recent call last): File "", line 1, in ValueError: bytes must be in range(0, 256) >>> bytes.byte(1.0) Traceback (most recent call last): File "", line 1, in TypeError: 'float' object cannot be interpreted as an integer The documentation of the ``ord`` builtin will be updated to explicitly note that ``bytes.byte`` is the inverse operation for binary data, while ``chr`` is the inverse operation for text data. Behaviourally, ``bytes.byte(x)`` will be equivalent to the current ``bytes([x])`` (and similarly for ``bytearray``). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types). As a separate method, the new spelling will also work better with higher order functions like ``map``. Addition of optimised iterator methods that produce ``bytes`` objects --------------------------------------------------------------------- This PEP proposes that ``bytes``, ``bytearray`` and ``memoryview`` gain an optimised ``iterbytes`` method that produces length 1 ``bytes`` objects rather than integers:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer The method can be used with arbitrary buffer exporting objects by wrapping them in a ``memoryview`` instance first:: for x in memoryview(data).iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer For ``memoryview``, the semantics of ``iterbytes()`` are defined such that:: memview.tobytes() == b''.join(memview.iterbytes()) This allows the raw bytes of the memory view to be iterated over without needing to make a copy, regardless of the defined shape and format. The main advantage this method offers over the ``map(bytes.byte, data)`` approach is that it is guaranteed *not* to fail midstream with a ``ValueError`` or ``TypeError``. By contrast, when using the ``map`` based approach, the type and value of the individual items in the iterable are only checked as they are retrieved and passed through the ``bytes.byte`` constructor. Design discussion ================= Why not rely on sequence repetition to create zero-initialised sequences? ------------------------------------------------------------------------- Zero-initialised sequences can be created via sequence repetition:: >>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00') However, this was also the case when the ``bytearray`` type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable ``bytes`` type then inherited that feature when it was introduced in PEP 3137. This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behaviour of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that ``bytes(x)`` (where ``x`` is an integer) should behave like the ``bytes.byte(x)`` proposal in this PEP. Providing both behaviours as separate class methods avoids that ambiguity. References ========== .. [1] Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) .. [2] Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html) .. [3] Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895) .. [4] Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644) .. [5] August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve at pearwood.info Sat Aug 16 07:41:47 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 16 Aug 2014 15:41:47 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <53EED005.5020407@stoneleaf.us> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> Message-ID: <20140816054147.GG4525@ando> On Fri, Aug 15, 2014 at 08:29:09PM -0700, Ethan Furman wrote: > On 08/15/2014 08:08 PM, Steven D'Aprano wrote: [...] > >is a poor argument (that is, I'm disagreeing with it), since *single* > >line parens-free with statements are already syntactically a tuple: > > > > with spam, eggs, cheese: # Commas make a tuple, not parens. > > This point I do not understand -- commas /can/ create a tuple, but don't > /necessarily/ create a tuple. So, semantically: no tuple. Right! I think we are in agreement. It's not that with statements actually generate a tuple, but that they *look* like they include a tuple. That's what I meant by "syntactically a tuple", sorry if that was confusing. I didn't mean to suggest that Python necessarily builds a tuple of context managers. If people were going to be prone to mistake with (a, b, c): ... as including a tuple, they would have already mistaken: with a, b, c: ... the same way. But they haven't. -- Steven From ben+python at benfinney.id.au Sat Aug 16 09:25:33 2014 From: ben+python at benfinney.id.au (Ben Finney) Date: Sat, 16 Aug 2014 17:25:33 +1000 Subject: [Python-Dev] Multiline with statement line continuation References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> Message-ID: <85bnrk6glu.fsf@benfinney.id.au> Steven D'Aprano writes: > If people were going to be prone to mistake > > with (a, b, c): ... > > as including a tuple ? because the parens are a strong signal ?this is an expression to be evaluated, resulting in a single value to use in the statement?. > they would have already mistaken: > > with a, b, c: ... > > the same way. But they haven't. Right. The presence or absence of parens make a big semantic difference. -- \ ?The process by which banks create money is so simple that the | `\ mind is repelled.? ?John Kenneth Galbraith, _Money: Whence It | _o__) Came, Where It Went_, 1975 | Ben Finney From jeanpierreda at gmail.com Sat Aug 16 10:04:13 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Sat, 16 Aug 2014 01:04:13 -0700 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <85bnrk6glu.fsf@benfinney.id.au> References: <20140811230800.GA12210@gensokyo> <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> <85bnrk6glu.fsf@benfinney.id.au> Message-ID: On Sat, Aug 16, 2014 at 12:25 AM, Ben Finney wrote: > Steven D'Aprano writes: > >> If people were going to be prone to mistake >> >> with (a, b, c): ... >> >> as including a tuple > > ? because the parens are a strong signal ?this is an expression to be > evaluated, resulting in a single value to use in the statement?. > >> they would have already mistaken: >> >> with a, b, c: ... >> >> the same way. But they haven't. > > Right. The presence or absence of parens make a big semantic difference. At least historically so, since "except a, b:" and "except (a, b):" used to be different things (only the latter constructs a tuple in 2.x). OTOH, consider "from .. import (..., ..., ...)". Pretty sure at this point parens can be used for non-expressions quite reasonably -- although I'd still prefer just allowing newlines without requiring extra syntax. -- Devin From senthil at uthcode.com Sat Aug 16 12:40:02 2014 From: senthil at uthcode.com (Senthil Kumaran) Date: Sat, 16 Aug 2014 16:10:02 +0530 Subject: [Python-Dev] [Python-checkins] cpython (2.7): Fix Issue #8797: Raise HTTPError on failed Basic Authentication immediately. In-Reply-To: <3hZwDy6FF8z7Ljj@mail.python.org> References: <3hZwDy6FF8z7Ljj@mail.python.org> Message-ID: I added some extra coverage for basic auth in the tests and I notice that in buildbots, some of them are throwing "error: [Errno 32] Broken pipe" error. I am looking into this and will fix this. Thanks, Senthil On Sat, Aug 16, 2014 at 2:19 PM, senthil.kumaran wrote: > http://hg.python.org/cpython/rev/e0510a3bdf8f > changeset: 92111:e0510a3bdf8f > branch: 2.7 > parent: 92097:6d41f139709b > user: Senthil Kumaran > date: Sat Aug 16 14:16:14 2014 +0530 > summary: > Fix Issue #8797: Raise HTTPError on failed Basic Authentication > immediately. Initial patch by Sam Bull. > > files: > Lib/test/test_urllib2_localnet.py | 86 ++++++++++++++++++- > Lib/urllib2.py | 19 +--- > Misc/NEWS | 3 + > 3 files changed, 90 insertions(+), 18 deletions(-) > > > diff --git a/Lib/test/test_urllib2_localnet.py > b/Lib/test/test_urllib2_localnet.py > --- a/Lib/test/test_urllib2_localnet.py > +++ b/Lib/test/test_urllib2_localnet.py > @@ -1,6 +1,8 @@ > +import base64 > import urlparse > import urllib2 > import BaseHTTPServer > +import SimpleHTTPServer > import unittest > import hashlib > > @@ -66,6 +68,48 @@ > > # Authentication infrastructure > > + > +class BasicAuthHandler(SimpleHTTPServer.SimpleHTTPRequestHandler): > + """Handler for performing Basic Authentication.""" > + # Server side values > + USER = "testUser" > + PASSWD = "testPass" > + REALM = "Test" > + USER_PASSWD = "%s:%s" % (USER, PASSWD) > + ENCODED_AUTH = base64.b64encode(USER_PASSWD) > + > + def __init__(self, *args, **kwargs): > + SimpleHTTPServer.SimpleHTTPRequestHandler.__init__(self, *args, > + **kwargs) > + > + def log_message(self, format, *args): > + # Supress the HTTP Console log output > + pass > + > + def do_HEAD(self): > + self.send_response(200) > + self.send_header("Content-type", "text/html") > + self.end_headers() > + > + def do_AUTHHEAD(self): > + self.send_response(401) > + self.send_header("WWW-Authenticate", "Basic realm=\"%s\"" % > self.REALM) > + self.send_header("Content-type", "text/html") > + self.end_headers() > + > + def do_GET(self): > + if self.headers.getheader("Authorization") == None: > + self.do_AUTHHEAD() > + self.wfile.write("No Auth Header Received") > + elif self.headers.getheader( > + "Authorization") == "Basic " + self.ENCODED_AUTH: > + SimpleHTTPServer.SimpleHTTPRequestHandler.do_GET(self) > + else: > + self.do_AUTHHEAD() > + self.wfile.write(self.headers.getheader("Authorization")) > + self.wfile.write("Not Authenticated") > + > + > class DigestAuthHandler: > """Handler for performing digest authentication.""" > > @@ -228,6 +272,45 @@ > test_support.threading_cleanup(*self._threads) > > > +class BasicAuthTests(BaseTestCase): > + USER = "testUser" > + PASSWD = "testPass" > + INCORRECT_PASSWD = "Incorrect" > + REALM = "Test" > + > + def setUp(self): > + super(BasicAuthTests, self).setUp() > + # With Basic Authentication > + def http_server_with_basic_auth_handler(*args, **kwargs): > + return BasicAuthHandler(*args, **kwargs) > + self.server = > LoopbackHttpServerThread(http_server_with_basic_auth_handler) > + self.server_url = 'http://127.0.0.1:%s' % self.server.port > + self.server.start() > + self.server.ready.wait() > + > + def tearDown(self): > + self.server.stop() > + super(BasicAuthTests, self).tearDown() > + > + def test_basic_auth_success(self): > + ah = urllib2.HTTPBasicAuthHandler() > + ah.add_password(self.REALM, self.server_url, self.USER, > self.PASSWD) > + urllib2.install_opener(urllib2.build_opener(ah)) > + try: > + self.assertTrue(urllib2.urlopen(self.server_url)) > + except urllib2.HTTPError: > + self.fail("Basic Auth Failed for url: %s" % self.server_url) > + except Exception as e: > + raise e > + > + def test_basic_auth_httperror(self): > + ah = urllib2.HTTPBasicAuthHandler() > + ah.add_password(self.REALM, self.server_url, self.USER, > + self.INCORRECT_PASSWD) > + urllib2.install_opener(urllib2.build_opener(ah)) > + self.assertRaises(urllib2.HTTPError, urllib2.urlopen, > self.server_url) > + > + > class ProxyAuthTests(BaseTestCase): > URL = "http://localhost" > > @@ -240,6 +323,7 @@ > self.digest_auth_handler = DigestAuthHandler() > self.digest_auth_handler.set_users({self.USER: self.PASSWD}) > self.digest_auth_handler.set_realm(self.REALM) > + # With Digest Authentication > def create_fake_proxy_handler(*args, **kwargs): > return FakeProxyHandler(self.digest_auth_handler, *args, > **kwargs) > > @@ -544,7 +628,7 @@ > # the next line. > #test_support.requires("network") > > - test_support.run_unittest(ProxyAuthTests, TestUrlopen) > + test_support.run_unittest(BasicAuthTests, ProxyAuthTests, TestUrlopen) > > if __name__ == "__main__": > test_main() > diff --git a/Lib/urllib2.py b/Lib/urllib2.py > --- a/Lib/urllib2.py > +++ b/Lib/urllib2.py > @@ -843,10 +843,7 @@ > password_mgr = HTTPPasswordMgr() > self.passwd = password_mgr > self.add_password = self.passwd.add_password > - self.retried = 0 > > - def reset_retry_count(self): > - self.retried = 0 > > def http_error_auth_reqed(self, authreq, host, req, headers): > # host may be an authority (without userinfo) or a URL with an > @@ -854,13 +851,6 @@ > # XXX could be multiple headers > authreq = headers.get(authreq, None) > > - if self.retried > 5: > - # retry sending the username:password 5 times before failing. > - raise HTTPError(req.get_full_url(), 401, "basic auth failed", > - headers, None) > - else: > - self.retried += 1 > - > if authreq: > mo = AbstractBasicAuthHandler.rx.search(authreq) > if mo: > @@ -869,17 +859,14 @@ > warnings.warn("Basic Auth Realm was unquoted", > UserWarning, 2) > if scheme.lower() == 'basic': > - response = self.retry_http_basic_auth(host, req, > realm) > - if response and response.code != 401: > - self.retried = 0 > - return response > + return self.retry_http_basic_auth(host, req, realm) > > def retry_http_basic_auth(self, host, req, realm): > user, pw = self.passwd.find_user_password(realm, host) > if pw is not None: > raw = "%s:%s" % (user, pw) > auth = 'Basic %s' % base64.b64encode(raw).strip() > - if req.headers.get(self.auth_header, None) == auth: > + if req.get_header(self.auth_header, None) == auth: > return None > req.add_unredirected_header(self.auth_header, auth) > return self.parent.open(req, timeout=req.timeout) > @@ -895,7 +882,6 @@ > url = req.get_full_url() > response = self.http_error_auth_reqed('www-authenticate', > url, req, headers) > - self.reset_retry_count() > return response > > > @@ -911,7 +897,6 @@ > authority = req.get_host() > response = self.http_error_auth_reqed('proxy-authenticate', > authority, req, headers) > - self.reset_retry_count() > return response > > > diff --git a/Misc/NEWS b/Misc/NEWS > --- a/Misc/NEWS > +++ b/Misc/NEWS > @@ -19,6 +19,9 @@ > Library > ------- > > +- Issue #8797: Raise HTTPError on failed Basic Authentication immediately. > + Initial patch by Sam Bull. > + > - Issue #21448: Changed FeedParser feed() to avoid O(N**2) behavior when > parsing long line. Original patch by Raymond Hettinger. > > > -- > Repository URL: http://hg.python.org/cpython > > _______________________________________________ > Python-checkins mailing list > Python-checkins at python.org > https://mail.python.org/mailman/listinfo/python-checkins > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at pearwood.info Sat Aug 16 13:16:52 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 16 Aug 2014 21:16:52 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <85bnrk6glu.fsf@benfinney.id.au> References: <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> <85bnrk6glu.fsf@benfinney.id.au> Message-ID: <20140816111652.GI4525@ando> On Sat, Aug 16, 2014 at 05:25:33PM +1000, Ben Finney wrote: [...] > > they would have already mistaken: > > > > with a, b, c: ... > > > > the same way. But they haven't. > > Right. The presence or absence of parens make a big semantic difference. from silly.mistakes.programmers.make import ( hands, up, anyone, who, thinks, this, is_, a, tuple) def function(how, about, this, one): ... But quite frankly, even if there is some person somewhere who gets confused and tries to write: context_managers = (open("a"), open("b", "w"), open("c", "w")) with context_managers as things: text = things[0].read() things[1].write(text) things[2].write(text.upper()) I simply don't care. They will try it, discover that tuples are not context managers, fix their code, and move on. (I've made sillier mistakes, and became a better programmer from it.) We cannot paralyse ourselves out of fear that somebody, somewhere, will make a silly mistake. You can try that "with tuple" code right now, and you will get nice runtime exception. I admit that the error message is not the most descriptive I've ever seen, but I've seen worse, and any half-decent programmer can do what they do for any other unexpected exception: read the Fine Manual, or ask for help, or otherwise debug the problem. Why should this specific exception be treated as so harmful that we have to forgo a useful piece of functionality to avoid it? Some designs are bug-magnets, like the infamous "except A,B" syntax, which fails silently, doing the wrong thing. Unless someone has a convincing rationale for how and why this multi-line with will likewise be a bug-magnet, I don't think that some vague similarity between it and tuples is justification for rejecting the proposal. -- Steven From marko at pacujo.net Sat Aug 16 14:47:06 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sat, 16 Aug 2014 15:47:06 +0300 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <20140816111652.GI4525@ando> (Steven D'Aprano's message of "Sat, 16 Aug 2014 21:16:52 +1000") References: <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> <85bnrk6glu.fsf@benfinney.id.au> <20140816111652.GI4525@ando> Message-ID: <8738cwa9f9.fsf@elektro.pacujo.net> Steven D'Aprano : > I simply don't care. They will try it, discover that tuples are not > context managers, fix their code, and move on. *Could* tuples (and lists and sequences) be context managers? *Should* tuples (and lists and sequences) be context managers? > I don't think that some vague similarity between it and tuples is > justification for rejecting the proposal. You might be able to have it bothways. You could have: with (open(name) for name in os.listdir("config")) as files: ... Marko From rosuav at gmail.com Sat Aug 16 23:42:25 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sun, 17 Aug 2014 07:42:25 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: <8738cwa9f9.fsf@elektro.pacujo.net> References: <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> <85bnrk6glu.fsf@benfinney.id.au> <20140816111652.GI4525@ando> <8738cwa9f9.fsf@elektro.pacujo.net> Message-ID: On Sat, Aug 16, 2014 at 10:47 PM, Marko Rauhamaa wrote: > > You might be able to have it bothways. You could have: > > with (open(name) for name in os.listdir("config")) as files: But that's not a tuple, it's a generator. Should generators be context managers? Is anyone seriously suggesting this? I don't think so. Is this solutions looking for problems? ChrisA From ncoghlan at gmail.com Sun Aug 17 03:10:00 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 11:10:00 +1000 Subject: [Python-Dev] Multiline with statement line continuation In-Reply-To: References: <20140812121541.GG4525@ando> <87sil08k8p.fsf@gmail.com> <20140813173225.GL4525@ando> <53EE76DA.1060908@stoneleaf.us> <20140816030847.GD4525@ando> <53EED005.5020407@stoneleaf.us> <20140816054147.GG4525@ando> <85bnrk6glu.fsf@benfinney.id.au> <20140816111652.GI4525@ando> <8738cwa9f9.fsf@elektro.pacujo.net> Message-ID: On 17 August 2014 07:42, Chris Angelico wrote: > On Sat, Aug 16, 2014 at 10:47 PM, Marko Rauhamaa wrote: >> >> You might be able to have it bothways. You could have: >> >> with (open(name) for name in os.listdir("config")) as files: > > But that's not a tuple, it's a generator. Should generators be context > managers? Is anyone seriously suggesting this? I don't think so. Is > this solutions looking for problems? Yes. We have a whole programming language to play with, when "X is hard to read" becomes a problem, it may be time to reach for a better tool. If the context manager line is getting unwieldy, it's often a sign it's time to factor it out to a dedicated helper, or break it up into multiple with statements :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Aug 17 03:28:48 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 11:28:48 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? Message-ID: I've seen a few people on python-ideas express the assumption that there will be another Py3k style compatibility break for Python 4.0. I've also had people express the concern that "you broke compatibility in a major way once, how do we know you won't do it again?". Both of those contrast strongly with Guido's stated position that he never wants to go through a transition like the 2->3 one again. Barry wrote PEP 404 to make it completely explicit that python-dev had no plans to create a Python 2.8 release. Would it be worth writing a similarly explicit "not an option" PEP explaining that the regular deprecation and removal process (roughly documented in PEP 387) is the *only* deprecation and removal process? It could also point to the fact that we now have PEP 411 (provisional APIs) to help reduce our chances of being locked indefinitely into design decisions we aren't happy with. If folks (most signficantly, Guido) are amenable to the idea, it shouldn't take long to put such a PEP together, and I think it could help reduce some of the confusions around the expectations for Python 4.0 and the evolution of 3.x in general. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From steve at pearwood.info Sun Aug 17 04:39:02 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sun, 17 Aug 2014 12:39:02 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: <20140817023902.GM4525@ando> On Sun, Aug 17, 2014 at 11:28:48AM +1000, Nick Coghlan wrote: > I've seen a few people on python-ideas express the assumption that > there will be another Py3k style compatibility break for Python 4.0. I used to refer to Python 4000 as the hypothetical compatibility break version. Now I refer to Python 5000. > I've also had people express the concern that "you broke compatibility > in a major way once, how do we know you won't do it again?". Even languages with ISO standards behind them and release schedules measured in decades make backward-incompatible changes. For example, I see that Fortran 95 (despite being classified as a minor revision) deleted at least six language features. To expect Python to never break compatibility again is asking too much. But I think it is fair to promise that Python won't make *so many* backwards incompatible changes all at once again, and has no concrete plans to make backwards incompatible changes to syntax in the foreseeable future. (That is, not before Python 5000 :-) [...] > If folks (most signficantly, Guido) are amenable to the idea, it > shouldn't take long to put such a PEP together, and I think it could > help reduce some of the confusions around the expectations for Python > 4.0 and the evolution of 3.x in general. I think it's a good idea, so long as there's no implied or explicit promise that Python language is now set in stone never to change. -- Steven From guido at python.org Sun Aug 17 04:43:39 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 16 Aug 2014 19:43:39 -0700 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan wrote: > I've seen a few people on python-ideas express the assumption that > there will be another Py3k style compatibility break for Python 4.0. > There used to be only joking references to 4.0 or py4k -- how things have changed! I've seen nothing that a gentle correction on the list couldn't fix though. > I've also had people express the concern that "you broke compatibility > in a major way once, how do we know you won't do it again?". > Well, they won't, really. You can't predict the future. But really, that's a pretty poor way to say "please don't do it again." I'm not sure why, but I hate when someone starts a suggestion or a question with "why doesn't Python ..." and I have to fight the urge to reply in a flippant way without answering the real question. (And just now I did it again.) I suppose this phrasing may actually be meant as a form of politeness, but to me it often sounds passive-aggressive, pretend-polite. (Could it be a matter of cultural difference? The internet is full of broken English, my own often included.) > Both of those contrast strongly with Guido's stated position that he > never wants to go through a transition like the 2->3 one again. > Right. What's more, when I say that, I don't mean that you should wait until I retire -- I think it's genuinely a bad idea. I also don't expect that it'll be necessary -- in fact, I am counting on tools (e.g. static analysis!) to improve to the point where there won't be a reason for such a transition. (Don't understand this to mean that we should never deprecate things. Deprecations will happen, they are necessary for the evolution of any programming language. But they won't ever hurt in the way that Python 3 hurt.) > Barry wrote PEP 404 to make it completely explicit that python-dev had > no plans to create a Python 2.8 release. Would it be worth writing a > similarly explicit "not an option" PEP explaining that the regular > deprecation and removal process (roughly documented in PEP 387) is the > *only* deprecation and removal process? It could also point to the > fact that we now have PEP 411 (provisional APIs) to help reduce our > chances of being locked indefinitely into design decisions we aren't > happy with. > > If folks (most significantly, Guido) are amenable to the idea, it > shouldn't take long to put such a PEP together, and I think it could > help reduce some of the confusions around the expectations for Python > 4.0 and the evolution of 3.x in general. > But what should it say? It's easy to say there won't be a 2.8 because we already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there won't be a 4.0? Never? Why not? Who is to say that at some point some folks won't be going off on their own to design a whole new language and name it Python 4, following Larry Wall's Perl 6 example? I think it makes sense to occasionally remind the more eager contributors that we want the future to come gently (that's not to say in our sleep :-). But I'm not sure a PEP is the best form for such a reminder. Even the Pope has a Twitter account. :-) -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From lukasz at langa.pl Sun Aug 17 04:46:45 2014 From: lukasz at langa.pl (=?utf-8?Q?=C5=81ukasz_Langa?=) Date: Sat, 16 Aug 2014 19:46:45 -0700 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: <0C857DB9-7616-4317-BF6B-AF44B4219DEA@langa.pl> On Aug 16, 2014, at 6:28 PM, Nick Coghlan wrote: > I've seen a few people on python-ideas express the assumption that > there will be another Py3k style compatibility break for Python 4.0. Whenever I mention Python 4 or PEP 4000, it?s always a joke. However, saying upfront that we will never break compatibility is a bold statement. Technically even introducing new syntax breaks compatibility. Not to mention fixing long-lasting bugs. So you?d need to split hairs just defining what we mean by a ?major compatibility break?. Worse, if we ever did a change that we feel is within the bounds of the contract, you?d have someone pointing at that PEP saying that they feel we broke the contract. Splitting hairs again. PEP 404 was necessary for some people/organizations to move on. I fail to see how PEP 4000 (or rather PEP 4004? ;-)) would be useful in that context. -- Best regards, ?ukasz Langa WWW: http://lukasz.langa.pl/ Twitter: @llanga IRC: ambv on #python-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From lukasz at langa.pl Sun Aug 17 04:49:18 2014 From: lukasz at langa.pl (=?utf-8?Q?=C5=81ukasz_Langa?=) Date: Sat, 16 Aug 2014 19:49:18 -0700 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: <2412C837-9DD9-4F59-99A9-899524B3D2A1@langa.pl> On Aug 16, 2014, at 7:43 PM, Guido van Rossum wrote: > But can we really say there won't be a 4.0? Never? Why not? Who is to say that at some point some folks won't be going off on their own to design a whole new language and name it Python 4, following Larry Wall's Perl 6 example? If they ever do, please make them not follow the Perl 6 example! -- Best regards, ?ukasz Langa WWW: http://lukasz.langa.pl/ Twitter: @llanga IRC: ambv on #python-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 17 05:48:41 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 13:48:41 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: On 17 August 2014 12:43, Guido van Rossum wrote: > On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan wrote: >> I've also had people express the concern that "you broke compatibility >> in a major way once, how do we know you won't do it again?". > > > Well, they won't, really. You can't predict the future. But really, that's a > pretty poor way to say "please don't do it again." > > I'm not sure why, but I hate when someone starts a suggestion or a question > with "why doesn't Python ..." and I have to fight the urge to reply in a > flippant way without answering the real question. (And just now I did it > again.) > > I suppose this phrasing may actually be meant as a form of politeness, but > to me it often sounds passive-aggressive, pretend-polite. (Could it be a > matter of cultural difference? The internet is full of broken English, my > own often included.) I don't mind it if the typical answers are accepted as valid: * "because it has these downsides, and those are considered to outweigh the benefits" * "because it's difficult, and it never bothered anyone enough for them to put in the work to do something about it" Those aren't always obvious, especially to folks that don't have a lot of experience with long lived software projects (I had only just started high school when Python was first released!), so I don't mind explaining them when I have time. >> Both of those contrast strongly with Guido's stated position that he >> never wants to go through a transition like the 2->3 one again. > > Right. What's more, when I say that, I don't mean that you should wait until > I retire -- I think it's genuinely a bad idea. Absolutely agreed - I think the Unicode change was worthwhile (even with the impact proving to be higher than expected), but there isn't any such fundamental change to the data model lurking for Python 3. > I also don't expect that it'll be necessary -- in fact, I am counting on > tools (e.g. static analysis!) to improve to the point where there won't be a > reason for such a transition. The fact that things like Hylang and MacroPy can already run on the CPython VM also shows that other features (like import hooks and the AST compiler) have evolved to the point where the Python data model and runtime semantics can be more effectively decoupled from syntactic details. > (Don't understand this to mean that we should never deprecate things. > Deprecations will happen, they are necessary for the evolution of any > programming language. But they won't ever hurt in the way that Python 3 > hurt.) Right. I think Python 2 has been stable for so long that I sometimes wonder if folks forget (or never knew?) we used to deprecate things within the Python 2 series as well, such that code that ran on Python 2.x wasn't necessarily guaranteed to run on Python 2.(x+2). "Never deprecate anything" is a recipe for unbounded growth in complexity. Benjamin has made a decent start on documenting that normal deprecation process in PEP 387, so I'd also suggest refining that a bit and getting it to "Accepted" as part of any explicit "Python 4.x won't be as disruptive as 3.x" clarification. >> no plans to create a Python 2.8 release. Would it be worth writing a >> similarly explicit "not an option" PEP explaining that the regular >> deprecation and removal process (roughly documented in PEP 387) is the >> *only* deprecation and removal process? It could also point to the >> fact that we now have PEP 411 (provisional APIs) to help reduce our >> chances of being locked indefinitely into design decisions we aren't >> happy with. >> >> If folks (most significantly, Guido) are amenable to the idea, it >> >> shouldn't take long to put such a PEP together, and I think it could >> help reduce some of the confusions around the expectations for Python >> 4.0 and the evolution of 3.x in general. > > But what should it say? The specific things I was thinking we could point out were: - PEP 387, documenting the normal deprecation process that existed even in Python 2 - highlighting the increased preference for "documented deprecation only" in cases where maintaining something isn't actively causing problems, there are just better alternatives now available - PEP 411, the (still relatively new) provisional API concept - PEP 405, adding pyvenv as a standard part of Python - PEP 453, better integrating PyPI into the recommended way of working with the language Those all help change the way the language evolves, as they reduce the pressure to rush things into the standard library before they're ready, while at the same time giving us a chance to publish "not quite ready to be locked down" features for very broad feedback. I'd also point out that the "variable encodings" to "Unicode" transition for text handling is an industry wide issue, one which even operating systems are still struggling with in some cases. POSIX-only software that only needs to run on modern platforms can assume UTF-8, while modern Windows and Java only software can largely assume UTF-16-LE, but anyone trying to integrate with both is going to have a far more interesting time of things (as we've discovered the hard way). That transition is the core thing that sometimes makes migrating from Python 2 to Python 3 non-trivial - even the changes to dict are relatively simple to address by comparison. > It's easy to say there won't be a 2.8 because we > already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there > won't be a 4.0? Never? Why not? I'm assuming there *will* be a 4.0 - I'd just like to see it be "the release after Python 3.9", rather than being spectacularly different from the preceding 3.x releases. That's similar to the way that the Linux kernel shifted to the 3.x series not because of any particular milestone, but just due to the sheer weight of accumulated changes relative to the early 2.x releases. > Who is to say that at some point some folks > won't be going off on their own to design a whole new language and name it > Python 4, following Larry Wall's Perl 6 example? Based on the examples of both Python 3 and Perl 6, I'd personally strongly advocate for such a project to be a new language with a different name, even if it was created and maintained by python-dev :) > I think it makes sense to occasionally remind the more eager contributors > that we want the future to come gently (that's not to say in our sleep :-). > But I'm not sure a PEP is the best form for such a reminder. Even the Pope > has a Twitter account. :-) Yeah, I'm not sure a PEP is the right way either. However, it seemed to get the point across for both PEP 404 ("no Python 2.8") and PEP 394 ("POSIX platforms: don't make /usr/bin/python refer to Python 3, you break things when you do that"), so I figured I'd at least raise the suggestion on this topic as well. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Sun Aug 17 07:08:37 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 16 Aug 2014 22:08:37 -0700 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: I think this would be a great topic for a blog post. Once you've written it I can even bless it by Tweeting about it. :-) PS. Why isn't PEP 387 accepted yet? On Sat, Aug 16, 2014 at 8:48 PM, Nick Coghlan wrote: > On 17 August 2014 12:43, Guido van Rossum wrote: > > On Sat, Aug 16, 2014 at 6:28 PM, Nick Coghlan > wrote: > >> I've also had people express the concern that "you broke compatibility > >> in a major way once, how do we know you won't do it again?". > > > > > > Well, they won't, really. You can't predict the future. But really, > that's a > > pretty poor way to say "please don't do it again." > > > > I'm not sure why, but I hate when someone starts a suggestion or a > question > > with "why doesn't Python ..." and I have to fight the urge to reply in a > > flippant way without answering the real question. (And just now I did it > > again.) > > > > I suppose this phrasing may actually be meant as a form of politeness, > but > > to me it often sounds passive-aggressive, pretend-polite. (Could it be a > > matter of cultural difference? The internet is full of broken English, my > > own often included.) > > I don't mind it if the typical answers are accepted as valid: > > * "because it has these downsides, and those are considered to > outweigh the benefits" > * "because it's difficult, and it never bothered anyone enough for > them to put in the work to do something about it" > > Those aren't always obvious, especially to folks that don't have a lot > of experience with long lived software projects (I had only just > started high school when Python was first released!), so I don't mind > explaining them when I have time. > > >> Both of those contrast strongly with Guido's stated position that he > >> never wants to go through a transition like the 2->3 one again. > > > > Right. What's more, when I say that, I don't mean that you should wait > until > > I retire -- I think it's genuinely a bad idea. > > Absolutely agreed - I think the Unicode change was worthwhile (even > with the impact proving to be higher than expected), but there isn't > any such fundamental change to the data model lurking for Python 3. > > > I also don't expect that it'll be necessary -- in fact, I am counting on > > tools (e.g. static analysis!) to improve to the point where there won't > be a > > reason for such a transition. > > The fact that things like Hylang and MacroPy can already run on the > CPython VM also shows that other features (like import hooks and the > AST compiler) have evolved to the point where the Python data model > and runtime semantics can be more effectively decoupled from syntactic > details. > > > (Don't understand this to mean that we should never deprecate things. > > Deprecations will happen, they are necessary for the evolution of any > > programming language. But they won't ever hurt in the way that Python 3 > > hurt.) > > Right. I think Python 2 has been stable for so long that I sometimes > wonder if folks forget (or never knew?) we used to deprecate things > within the Python 2 series as well, such that code that ran on Python > 2.x wasn't necessarily guaranteed to run on Python 2.(x+2). "Never > deprecate anything" is a recipe for unbounded growth in complexity. > > Benjamin has made a decent start on documenting that normal > deprecation process in PEP 387, so I'd also suggest refining that a > bit and getting it to "Accepted" as part of any explicit "Python 4.x > won't be as disruptive as 3.x" clarification. > > >> no plans to create a Python 2.8 release. Would it be worth writing a > >> similarly explicit "not an option" PEP explaining that the regular > >> deprecation and removal process (roughly documented in PEP 387) is the > >> *only* deprecation and removal process? It could also point to the > >> fact that we now have PEP 411 (provisional APIs) to help reduce our > >> chances of being locked indefinitely into design decisions we aren't > >> happy with. > >> > >> If folks (most significantly, Guido) are amenable to the idea, it > >> > >> shouldn't take long to put such a PEP together, and I think it could > >> help reduce some of the confusions around the expectations for Python > >> 4.0 and the evolution of 3.x in general. > > > > But what should it say? > > The specific things I was thinking we could point out were: > > - PEP 387, documenting the normal deprecation process that existed > even in Python 2 > - highlighting the increased preference for "documented deprecation > only" in cases where maintaining something isn't actively causing > problems, there are just better alternatives now available > - PEP 411, the (still relatively new) provisional API concept > - PEP 405, adding pyvenv as a standard part of Python > - PEP 453, better integrating PyPI into the recommended way of working > with the language > > Those all help change the way the language evolves, as they reduce the > pressure to rush things into the standard library before they're > ready, while at the same time giving us a chance to publish "not quite > ready to be locked down" features for very broad feedback. > > I'd also point out that the "variable encodings" to "Unicode" > transition for text handling is an industry wide issue, one which even > operating systems are still struggling with in some cases. POSIX-only > software that only needs to run on modern platforms can assume UTF-8, > while modern Windows and Java only software can largely assume > UTF-16-LE, but anyone trying to integrate with both is going to have a > far more interesting time of things (as we've discovered the hard > way). That transition is the core thing that sometimes makes migrating > from Python 2 to Python 3 non-trivial - even the changes to dict are > relatively simple to address by comparison. > > > It's easy to say there won't be a 2.8 because we > > already have 3.0 (and 3.1, and 3.2, and ...). But can we really say there > > won't be a 4.0? Never? Why not? > > I'm assuming there *will* be a 4.0 - I'd just like to see it be "the > release after Python 3.9", rather than being spectacularly different > from the preceding 3.x releases. That's similar to the way that the > Linux kernel shifted to the 3.x series not because of any particular > milestone, but just due to the sheer weight of accumulated changes > relative to the early 2.x releases. > > > Who is to say that at some point some folks > > won't be going off on their own to design a whole new language and name > it > > Python 4, following Larry Wall's Perl 6 example? > > Based on the examples of both Python 3 and Perl 6, I'd personally > strongly advocate for such a project to be a new language with a > different name, even if it was created and maintained by python-dev :) > > > I think it makes sense to occasionally remind the more eager contributors > > that we want the future to come gently (that's not to say in our sleep > :-). > > But I'm not sure a PEP is the best form for such a reminder. Even the > Pope > > has a Twitter account. :-) > > Yeah, I'm not sure a PEP is the right way either. However, it seemed > to get the point across for both PEP 404 ("no Python 2.8") and PEP 394 > ("POSIX platforms: don't make /usr/bin/python refer to Python 3, you > break things when you do that"), so I figured I'd at least raise the > suggestion on this topic as well. > > Cheers, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 17 07:34:16 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 15:34:16 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: On 17 August 2014 15:08, Guido van Rossum wrote: > I think this would be a great topic for a blog post. Once you've written it > I can even bless it by Tweeting about it. :-) Sounds like a plan - I'll try to put together something coherent this week :) > PS. Why isn't PEP 387 accepted yet? Not sure - it mostly looks correct to me. I suspect it just fell off the radar since it's a "describe what we're already doing anyway" kind of document. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From raymond.hettinger at gmail.com Sun Aug 17 10:13:39 2014 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Sun, 17 Aug 2014 01:13:39 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: On Aug 14, 2014, at 10:50 PM, Nick Coghlan wrote: > Key points in the proposal: > > * deprecate passing integers to bytes() and bytearray() I'm opposed to removing this part of the API. It has proven useful and the alternative isn't very nice. Declaring the size of fixed length arrays is not a new concept and is widely adopted in other languages. One principal use case for the bytearray is creating and manipulating binary data. Initializing to zero is common operation and should remain part of the core API (consider why we now have list.copy() even though copying with a slice remains possible and efficient). I and my clients have taken advantage of this feature and it reads nicely. The proposed deprecation would break our code and not actually make anything better. Another thought is that the core devs should be very reluctant to deprecate anything we don't have to while the 2 to 3 transition is still in progress. Every new deprecation of APIs that existed in Python 2.7 just adds another obstacle to converting code. Individually, the differences are trivial. Collectively, they present a good reason to never migrate code to Python 3. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 17 10:28:17 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 18:28:17 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: On 17 August 2014 15:34, Nick Coghlan wrote: > On 17 August 2014 15:08, Guido van Rossum wrote: >> I think this would be a great topic for a blog post. Once you've written it >> I can even bless it by Tweeting about it. :-) > > Sounds like a plan - I'll try to put together something coherent this week :) OK, make that "this afternoon": http://www.curiousefficiency.org/posts/2014/08/python-4000.html :) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Aug 17 10:41:05 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 17 Aug 2014 18:41:05 +1000 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: On 17 August 2014 18:13, Raymond Hettinger wrote: > > On Aug 14, 2014, at 10:50 PM, Nick Coghlan wrote: > > Key points in the proposal: > > * deprecate passing integers to bytes() and bytearray() > > > I'm opposed to removing this part of the API. It has proven useful > and the alternative isn't very nice. Declaring the size of fixed length > arrays is not a new concept and is widely adopted in other languages. > One principal use case for the bytearray is creating and manipulating > binary data. Initializing to zero is common operation and should remain > part of the core API (consider why we now have list.copy() even though > copying with a slice remains possible and efficient). That's why the PEP proposes adding a "zeros" method, based on the name of the corresponding NumPy construct. The status quo has some very ugly failure modes when an integer is passed unexpectedly, and tries to create a large buffer, rather than throwing a type error. > I and my clients have taken advantage of this feature and it reads nicely. If I see "bytearray(10)" there is nothing there that suggests "this creates an array of length 10 and initialises it to zero" to me. I'd be more inclined to guess it would be equivalent to "bytearray([10])". "bytearray.zeros(10)", on the other hand, is relatively clear, independently of user expectations. > The proposed deprecation would break our code and not actually make > anything better. > > Another thought is that the core devs should be very reluctant to deprecate > anything we don't have to while the 2 to 3 transition is still in progress. > Every new deprecation of APIs that existed in Python 2.7 just adds another > obstacle to converting code. Individually, the differences are trivial. > Collectively, they present a good reason to never migrate code to Python 3. This is actually one of the inconsistencies between the Python 2 and 3 binary APIs: Python 2.7.5 (default, Jun 25 2014, 10:19:55) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> bytes(10) '10' >>> bytearray(10) bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00') Users wanting well-behaved binary sequences in Python 2.7 would be well advised to use the "future" module to get a full backport of the actual Python 3 bytes type, rather than the approximation that is the 8-bit str in Python 2. And once they do that, they'll be able to track the evolution of the Python 3 binary sequence behaviour without any further trouble. That said, I don't really mind how long the deprecation cycle is. I'd be fine with fully supporting both in 3.5 (2015), deprecating the main constructor in favour of the explicit zeros() method in 3.6 (2017) and dropping the legacy behaviour in 3.7 (2018) Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From senthil at uthcode.com Sun Aug 17 11:37:44 2014 From: senthil at uthcode.com (Senthil Kumaran) Date: Sun, 17 Aug 2014 15:07:44 +0530 Subject: [Python-Dev] [Python-checkins] cpython (merge 3.4 -> default): Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales. In-Reply-To: <3hbXwq1vw8z7LjQ@mail.python.org> References: <3hbXwq1vw8z7LjQ@mail.python.org> Message-ID: This change is okay and not harmful. But I think, It might still not fix the encoding issue that we encountered on Mac. [localhost cpython]$ hg log -l 1 changeset: 92128:7cdc941d5180 tag: tip parent: 92126:3153a400b739 parent: 92127:a894b629bbea user: Serhiy Storchaka date: Sun Aug 17 12:21:06 2014 +0300 description: Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales. [localhost cpython]$ ./python.exe -m test.regrtest test_httpservers [1/1] test_httpservers test test_httpservers failed -- Traceback (most recent call last): File "/Users/skumaran/python/cpython/Lib/test/test_httpservers.py", line 283, in test_undecodable_filename .encode(enc, 'surrogateescape'), body) AssertionError: b'href="%40test_5809_tmp%ED%B3%A7w%ED%B3%B0.txt"' not found in b'\n\n\n\nDirectory listing for tmpj54lc8m1/\n\n\n

Directory listing for tmpj54lc8m1/

\n
\n\n
\n\n\n' 1 test failed: test_httpservers The underlying problem seems to be difference in which os.listdir() which uses C-API and os.fsdecode represent the decoded chars. Ref: http://bugs.python.org/issue22165#msg225428 On Sun, Aug 17, 2014 at 2:52 PM, serhiy.storchaka < python-checkins at python.org> wrote: > http://hg.python.org/cpython/rev/7cdc941d5180 > changeset: 92128:7cdc941d5180 > parent: 92126:3153a400b739 > parent: 92127:a894b629bbea > user: Serhiy Storchaka > date: Sun Aug 17 12:21:06 2014 +0300 > summary: > Issue #22165: Fixed test_undecodable_filename on non-UTF-8 locales. > > files: > Lib/test/test_httpservers.py | 5 +++-- > 1 files changed, 3 insertions(+), 2 deletions(-) > > > diff --git a/Lib/test/test_httpservers.py b/Lib/test/test_httpservers.py > --- a/Lib/test/test_httpservers.py > +++ b/Lib/test/test_httpservers.py > @@ -272,6 +272,7 @@ > @unittest.skipUnless(support.TESTFN_UNDECODABLE, > 'need support.TESTFN_UNDECODABLE') > def test_undecodable_filename(self): > + enc = sys.getfilesystemencoding() > filename = os.fsdecode(support.TESTFN_UNDECODABLE) + '.txt' > with open(os.path.join(self.tempdir, filename), 'wb') as f: > f.write(support.TESTFN_UNDECODABLE) > @@ -279,9 +280,9 @@ > body = self.check_status_and_reason(response, 200) > quotedname = urllib.parse.quote(filename, errors='surrogatepass') > self.assertIn(('href="%s"' % quotedname) > - .encode('utf-8', 'surrogateescape'), body) > + .encode(enc, 'surrogateescape'), body) > self.assertIn(('>%s<' % html.escape(filename)) > - .encode('utf-8', 'surrogateescape'), body) > + .encode(enc, 'surrogateescape'), body) > response = self.request(self.tempdir_name + '/' + quotedname) > self.check_status_and_reason(response, 200, > data=support.TESTFN_UNDECODABLE) > > -- > Repository URL: http://hg.python.org/cpython > > _______________________________________________ > Python-checkins mailing list > Python-checkins at python.org > https://mail.python.org/mailman/listinfo/python-checkins > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From francismb at email.de Sun Aug 17 11:50:36 2014 From: francismb at email.de (francis) Date: Sun, 17 Aug 2014 11:50:36 +0200 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: <53F07AEC.8040908@email.de> On 08/17/2014 03:28 AM, Nick Coghlan wrote: > I've seen a few people on python-ideas express the assumption that > there will be another Py3k style compatibility break for Python 4.0. > > I've also had people express the concern that "you broke compatibility > in a major way once, how do we know you won't do it again?". > Why not just allow those changes that can be automatically changed by a tool/script applied on the code (a la go, 2to3, 3.Ato3.B, ...)? From barry at python.org Sun Aug 17 15:29:19 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 17 Aug 2014 09:29:19 -0400 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: Message-ID: <20140817092919.1f41e88a@limelight.wooz.org> On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote: >(Don't understand this to mean that we should never deprecate things. >Deprecations will happen, they are necessary for the evolution of any >programming language. But they won't ever hurt in the way that Python 3 >hurt.) It would be useful to explore what causes the most pain in the 2->3 transition? IMHO, it's not the deprecations or changes such as print -> print(). It's the bytes/str split - a fundamental change to core and common data types. The question then is whether you foresee any similar looming pervasive change? [*] -Barry [*] I was going to add a joke about mandatory static type checking, but sometimes jokes are blown up into apocalyptic prophesy around here. ;) From storchaka at gmail.com Sun Aug 17 16:47:57 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Sun, 17 Aug 2014 17:47:57 +0300 Subject: [Python-Dev] "embedded NUL character" exceptions Message-ID: Currently most functions which accepts string argument which then passed to C function as NUL-terminated string, reject strings with embedded NUL character and raise TypeError. ValueError looks more appropriate here, because argument type is correct (str), only its value is wrong. But this is backward incompatible change. I think that we should get rid of this legacy inconsistency sooner or later. Why not fix it right now? I have opened an issue on the tracker [1], but this issue requires more broad discussion. [1] http://bugs.python.org/issue22215 From guido at python.org Sun Aug 17 17:13:52 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 17 Aug 2014 08:13:52 -0700 Subject: [Python-Dev] "embedded NUL character" exceptions In-Reply-To: References: Message-ID: Sounds good to me. On Sun, Aug 17, 2014 at 7:47 AM, Serhiy Storchaka wrote: > Currently most functions which accepts string argument which then passed > to C function as NUL-terminated string, reject strings with embedded NUL > character and raise TypeError. ValueError looks more appropriate here, > because argument type is correct (str), only its value is wrong. But this > is backward incompatible change. > > I think that we should get rid of this legacy inconsistency sooner or > later. Why not fix it right now? I have opened an issue on the tracker [1], > but this issue requires more broad discussion. > > [1] http://bugs.python.org/issue22215 > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ > guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond.hettinger at gmail.com Sun Aug 17 19:07:09 2014 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Sun, 17 Aug 2014 10:07:09 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> On Aug 17, 2014, at 1:41 AM, Nick Coghlan wrote: > If I see "bytearray(10)" there is nothing there that suggests "this > creates an array of length 10 and initialises it to zero" to me. I'd > be more inclined to guess it would be equivalent to "bytearray([10])". > > "bytearray.zeros(10)", on the other hand, is relatively clear, > independently of user expectations. Zeros would have been great but that should have been done originally. The time to get API design right is at inception. Now, you're just breaking code and invalidating any published examples. >> >> Another thought is that the core devs should be very reluctant to deprecate >> anything we don't have to while the 2 to 3 transition is still in progress. >> Every new deprecation of APIs that existed in Python 2.7 just adds another >> obstacle to converting code. Individually, the differences are trivial. >> Collectively, they present a good reason to never migrate code to Python 3. > > This is actually one of the inconsistencies between the Python 2 and 3 > binary APIs: However, bytearray(n) is the same in both Python 2 and Python 3. Changing it in Python 3 increases the gulf between the two. The further we let Python 3 diverge from Python 2, the less likely that people will convert their code and the harder you make it to write code that runs under both. FWIW, I've been teaching Python full time for three years. I cover the use of bytearray(n) in my classes and not a single person out of 3000+ engineers have had a problem with it. I seriously question the PEP's assertion that there is a real problem to be solved (i.e. that people are baffled by bytearray(bufsiz)) and that the problem is sufficiently painful to warrant the headaches that go along with API changes. The other proposal to add bytearray.byte(3) should probably be named bytearray.from_byte(3) for clarity. That said, I question whether there is actually a use case for this. I have never seen seen code that has a need to create a byte array of length one from a single integer. For the most part, the API will be easiest to learn if it matches what we do for lists and for array.array. Sorry Nick, but I think you're making the API worse instead of better. This API isn't perfect but it isn't flat-out broken either. There is some unfortunate asymmetry between bytes() and bytearray() in Python 2, but that ship has sailed. The current API for Python 3 is pretty good (though there is still a tension between wanting to be like lists and like strings both at the same time). Raymond P.S. The most important problem in the Python world now is getting Python 2 users to adopt Python 3. The core devs need to develop a strong distaste for anything that makes that problem harder. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sun Aug 17 19:16:31 2014 From: donald at stufft.io (Donald Stufft) Date: Sun, 17 Aug 2014 13:16:31 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> Message-ID: <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> > On Aug 17, 2014, at 1:07 PM, Raymond Hettinger wrote: > > > On Aug 17, 2014, at 1:41 AM, Nick Coghlan > wrote: > >> If I see "bytearray(10)" there is nothing there that suggests "this >> creates an array of length 10 and initialises it to zero" to me. I'd >> be more inclined to guess it would be equivalent to "bytearray([10])". >> >> "bytearray.zeros(10)", on the other hand, is relatively clear, >> independently of user expectations. > > Zeros would have been great but that should have been done originally. > The time to get API design right is at inception. > Now, you're just breaking code and invalidating any published examples. > >>> >>> Another thought is that the core devs should be very reluctant to deprecate >>> anything we don't have to while the 2 to 3 transition is still in progress. >>> Every new deprecation of APIs that existed in Python 2.7 just adds another >>> obstacle to converting code. Individually, the differences are trivial. >>> Collectively, they present a good reason to never migrate code to Python 3. >> >> This is actually one of the inconsistencies between the Python 2 and 3 >> binary APIs: > > However, bytearray(n) is the same in both Python 2 and Python 3. > Changing it in Python 3 increases the gulf between the two. > > The further we let Python 3 diverge from Python 2, the less likely that > people will convert their code and the harder you make it to write code > that runs under both. > > FWIW, I've been teaching Python full time for three years. I cover the > use of bytearray(n) in my classes and not a single person out of 3000+ > engineers have had a problem with it. I seriously question the PEP's > assertion that there is a real problem to be solved (i.e. that people > are baffled by bytearray(bufsiz)) and that the problem is sufficiently > painful to warrant the headaches that go along with API changes. > > The other proposal to add bytearray.byte(3) should probably be named > bytearray.from_byte(3) for clarity. That said, I question whether there is > actually a use case for this. I have never seen seen code that has a > need to create a byte array of length one from a single integer. > For the most part, the API will be easiest to learn if it matches what > we do for lists and for array.array. > > Sorry Nick, but I think you're making the API worse instead of better. > This API isn't perfect but it isn't flat-out broken either. There is some > unfortunate asymmetry between bytes() and bytearray() in Python 2, > but that ship has sailed. The current API for Python 3 is pretty good > (though there is still a tension between wanting to be like lists and like > strings both at the same time). > > > Raymond > > > P.S. The most important problem in the Python world now is getting > Python 2 users to adopt Python 3. The core devs need to develop > a strong distaste for anything that makes that problem harder. > For the record I?ve had all of the problems that Nick states and I?m +1 on this change. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Sun Aug 17 20:33:52 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Sun, 17 Aug 2014 11:33:52 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> Message-ID: <53F0F590.7040806@stoneleaf.us> On 08/17/2014 10:16 AM, Donald Stufft wrote: > > For the record I?ve had all of the problems that Nick states and I?m > +1 on this change. I've had many of the problems Nick states and I'm also +1. -- ~Ethan~ From graffatcolmingov at gmail.com Sun Aug 17 20:40:34 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Sun, 17 Aug 2014 13:40:34 -0500 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> Message-ID: On Aug 17, 2014 12:17 PM, "Donald Stufft" wrote: >> On Aug 17, 2014, at 1:07 PM, Raymond Hettinger wrote: >> >> >> On Aug 17, 2014, at 1:41 AM, Nick Coghlan wrote: >> >>> If I see "bytearray(10)" there is nothing there that suggests "this >>> creates an array of length 10 and initialises it to zero" to me. I'd >>> be more inclined to guess it would be equivalent to "bytearray([10])". >>> >>> "bytearray.zeros(10)", on the other hand, is relatively clear, >>> independently of user expectations. >> >> >> Zeros would have been great but that should have been done originally. >> The time to get API design right is at inception. >> Now, you're just breaking code and invalidating any published examples. >> >>>> >>>> Another thought is that the core devs should be very reluctant to deprecate >>>> anything we don't have to while the 2 to 3 transition is still in progress. >>>> Every new deprecation of APIs that existed in Python 2.7 just adds another >>>> obstacle to converting code. Individually, the differences are trivial. >>>> Collectively, they present a good reason to never migrate code to Python 3. >>> >>> >>> This is actually one of the inconsistencies between the Python 2 and 3 >>> binary APIs: >> >> >> However, bytearray(n) is the same in both Python 2 and Python 3. >> Changing it in Python 3 increases the gulf between the two. >> >> The further we let Python 3 diverge from Python 2, the less likely that >> people will convert their code and the harder you make it to write code >> that runs under both. >> >> FWIW, I've been teaching Python full time for three years. I cover the >> use of bytearray(n) in my classes and not a single person out of 3000+ >> engineers have had a problem with it. I seriously question the PEP's >> assertion that there is a real problem to be solved (i.e. that people >> are baffled by bytearray(bufsiz)) and that the problem is sufficiently >> painful to warrant the headaches that go along with API changes. >> >> The other proposal to add bytearray.byte(3) should probably be named >> bytearray.from_byte(3) for clarity. That said, I question whether there is >> actually a use case for this. I have never seen seen code that has a >> need to create a byte array of length one from a single integer. >> For the most part, the API will be easiest to learn if it matches what >> we do for lists and for array.array. >> >> Sorry Nick, but I think you're making the API worse instead of better. >> This API isn't perfect but it isn't flat-out broken either. There is some >> unfortunate asymmetry between bytes() and bytearray() in Python 2, >> but that ship has sailed. The current API for Python 3 is pretty good >> (though there is still a tension between wanting to be like lists and like >> strings both at the same time). >> >> >> Raymond >> >> >> P.S. The most important problem in the Python world now is getting >> Python 2 users to adopt Python 3. The core devs need to develop >> a strong distaste for anything that makes that problem harder. >> > > For the record I?ve had all of the problems that Nick states and I?m > +1 on this change. I've run into these problems as well, but I'm swayed by Raymond's argument regarding bytearray's constructor. I wouldn't be adverse to adding zeroes (for some parity between bytes and bytearray) to that but I'm not sure deprecating te behaviour of bytearray's constructor is necessary. (Whilst on my phone I only replied to Donald, so I'm forwarding this to the list.) From raymond.hettinger at gmail.com Sun Aug 17 23:19:17 2014 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Sun, 17 Aug 2014 14:19:17 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <53F0F590.7040806@stoneleaf.us> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <53F0F590.7040806@stoneleaf.us> Message-ID: <37DD82D8-CC83-4D03-A854-3229BAAF8C1D@gmail.com> On Aug 17, 2014, at 11:33 AM, Ethan Furman wrote: > I've had many of the problems Nick states and I'm also +1. There are two code snippets below which were taken from the standard library. Are you saying that: 1) you don't understand the code (as the pep suggests) 2) you are willing to break that code and everything like it 3) and it would be more elegantly expressed as: charmap = bytearray.zeros(256) and mapping = bytearray.zeros(256) At work, I have network engineers creating IPv4 headers and other structures with bytearrays initialized to zeros. Do you really want to break all their code? No where else in Python do we create buffers that way. Code like "msg, who = s.recvfrom(256)" is the norm. Also, it is unclear if you're saying that you have an actual use case for this part of the proposal? ba = bytearray.byte(65) And than the code would be better, clearer, and faster than the currently working form? ba = bytearray([65]) Does there really need to be a special case for constructing a single byte? To me, that is akin to proposing "list.from_int(65)" as an important special case to replace "[65]". If you must muck with the ever changing bytes() API, then please leave the bytearray() API alone. I think we should show some respect for code that is currently working and is cleanly expressible in both Python 2 and Python 3. We aren't winning users with API churn. FWIW, I guessing that the differing view points in the thread stem mainly from the proponents experiences with bytes() rather than from experience with bytearray() which doesn't seem to have any usage problems in the wild. I've never seen a developer say they didn't understand what "buf = bytearray(1024)" means. That is not an actual problem that needs solving (or breaking). What may be an actual problem is code like "char = bytes(1024)" though I'm unclear what a user might have actually been trying to do with code like that. Raymond ----------- excerpts from Lib/sre_compile.py --------------- charmap = bytearray(256) for op, av in charset: while True: try: if op is LITERAL: charmap[fixup(av)] = 1 elif op is RANGE: for i in range(fixup(av[0]), fixup(av[1])+1): charmap[i] = 1 elif op is NEGATE: out.append((op, av)) else: tail.append((op, av)) ... charmap = bytes(charmap) # should be hashable comps = {} mapping = bytearray(256) block = 0 data = bytearray() for i in range(0, 65536, 256): chunk = charmap[i: i + 256] if chunk in comps: mapping[i // 256] = comps[chunk] else: mapping[i // 256] = comps[chunk] = block block += 1 data += chunk data = _mk_bitmap(data) data[0:0] = [block] + _bytes_to_codes(mapping) out.append((BIGCHARSET, data)) out += tail return out -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Sun Aug 17 23:41:10 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 17 Aug 2014 17:41:10 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> Message-ID: <20140817174110.5ddd3d90@limelight.wooz.org> I think the biggest API "problem" is that default iteration returns integers instead of bytes. That's a real pain. I'm not sure .iterbytes() is the best name for spelling iteration over bytes instead of integers though. Given that we can't change __iter__(), I personally would perhaps prefer a simple .bytes property over which if you iterated you would receive bytes, e.g. >>> data = bytes([1, 2, 3]) >>> for i in data: ... print(i) ... 1 2 3 >>> for b in data.bytes: ... print(b) ... b'\x01' b'\x02' b'\x03' There are no backward compatibility issues with this of course. As for the single-int-ctor forms, they're inconvenient and arguably "wrong", but I think we can live with it. OTOH, I don't see any harm by adding the .zeros() alternative constructor. I'd probably want to spell the .byte() alternative constructor .from_int() but I also don't think the status quo (or .byte()) is that much of a usability problem. The API churn problem comes about when you start wanting to deprecate the single-int-ctor form. *If* that part gets adopted, it should have a really long deprecation cycle, IMO. Cheers, -Barry From donald at stufft.io Sun Aug 17 23:55:45 2014 From: donald at stufft.io (Donald Stufft) Date: Sun, 17 Aug 2014 17:55:45 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <37DD82D8-CC83-4D03-A854-3229BAAF8C1D@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <53F0F590.7040806@stoneleaf.us> <37DD82D8-CC83-4D03-A854-3229BAAF8C1D@gmail.com> Message-ID: <0022B2C7-6D1B-4F23-93F6-BB0AE42146BC@stufft.io> > On Aug 17, 2014, at 5:19 PM, Raymond Hettinger wrote: > > > On Aug 17, 2014, at 11:33 AM, Ethan Furman > wrote: > >> I've had many of the problems Nick states and I'm also +1. > > There are two code snippets below which were taken from the standard library. > Are you saying that: > 1) you don't understand the code (as the pep suggests) > 2) you are willing to break that code and everything like it > 3) and it would be more elegantly expressed as: > charmap = bytearray.zeros(256) > and > mapping = bytearray.zeros(256) > > At work, I have network engineers creating IPv4 headers and other structures > with bytearrays initialized to zeros. Do you really want to break all their code? > No where else in Python do we create buffers that way. Code like > "msg, who = s.recvfrom(256)" is the norm. > > Also, it is unclear if you're saying that you have an actual use case for this > part of the proposal? > > ba = bytearray.byte(65) > > And than the code would be better, clearer, and faster than the currently working form? > > ba = bytearray([65]) > > Does there really need to be a special case for constructing a single byte? > To me, that is akin to proposing "list.from_int(65)" as an important special > case to replace "[65]". > > If you must muck with the ever changing bytes() API, then please > leave the bytearray() API alone. I think we should show some respect > for code that is currently working and is cleanly expressible in both > Python 2 and Python 3. We aren't winning users with API churn. > > FWIW, I guessing that the differing view points in the thread stem > mainly from the proponents experiences with bytes() rather than > from experience with bytearray() which doesn't seem to have any > usage problems in the wild. I've never seen a developer say they > didn't understand what "buf = bytearray(1024)" means. That is > not an actual problem that needs solving (or breaking). > > What may be an actual problem is code like "char = bytes(1024)" > though I'm unclear what a user might have actually been trying > to do with code like that. I think this is probably correct. I generally don?t think that bytes(1024) makes much sense at all, especially not as a default constructor. Most likely it exists to be similar to bytearray(). I don't have a specific problem with bytearray(1024), though I do think it's more elegantly and clearly described as bytearray.zeros(1024), but not by much. I find bytes.byte()/bytearray to be needed as long as there isn't a simple way to iterate over a bytes or bytearray in a way that yields bytes or bytearrays instead of integers. To be honest I can't think of a time when I'd actually *want* to iterate over a bytes/bytearray as integers. Although I realize there is unlikely to be a reasonable method to change that now. If iterbytes is added I'm not sure where i'd personally use either bytes.byte() or bytearray.byte(). In general though I think that overloading a single constructor method to do something conceptually different based on the type of the parameter leads to these kind of confusing scenarios and that having differently named constructors for the different concepts is far clearer. So given all that, I am: * +10000 for some method of iterating over both types as bytes instead of integers. * +1 on adding .zeros to both types as an alternative and preferred method of creating a zero filled instance and deprecating the original method[1]. * -0 on adding .byte to both types as an alternative method of creating a single byte instance. * -1 On changing the meaning of bytearray(1024). * +/-0 on changing the meaning of bytes(1024), I think that bytes(1024) is likely to *not* be what someone wants and that what they really want is bytes([N]). I also think that the number one reason for someone to be doing bytes(N) is because they were attempting to iterate over a bytes or bytearray object and they got an integer. I also think that it's bad that this changes from 2.x to 3.x and I wish it hadn't. However I can't decide if it's worth reverting this at this time or not. [1] By deprecating I mean, raise a deprecation warning, or something but my thoughts on actually removing the other methods are listed explicitly. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus at unterwaditzer.net Sun Aug 17 23:55:47 2014 From: markus at unterwaditzer.net (Markus Unterwaditzer) Date: Sun, 17 Aug 2014 23:55:47 +0200 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817174110.5ddd3d90@limelight.wooz.org> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> Message-ID: <20140817215547.GA9919@chromebot.unti> On Sun, Aug 17, 2014 at 05:41:10PM -0400, Barry Warsaw wrote: > I think the biggest API "problem" is that default iteration returns integers > instead of bytes. That's a real pain. I agree, this behavior required some helper functions while porting Werkzeug to Python 3 AFAIK. > > I'm not sure .iterbytes() is the best name for spelling iteration over bytes > instead of integers though. Given that we can't change __iter__(), I > personally would perhaps prefer a simple .bytes property over which if you > iterated you would receive bytes, e.g. I'd rather be for a .bytes() method, to match the .values(), and .keys() methods on dictionaries. -- Markus From antoine at python.org Mon Aug 18 00:33:01 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 17 Aug 2014 18:33:01 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> Message-ID: Le 17/08/2014 13:07, Raymond Hettinger a ?crit : > > FWIW, I've been teaching Python full time for three years. I cover the > use of bytearray(n) in my classes and not a single person out of 3000+ > engineers have had a problem with it. This is less about bytearray() than bytes(), IMO. bytearray() is sufficiently specialized that only experienced people will encounter it. And while preallocating a bytearray of a certain size makes sense, it's completely pointless for a bytes object. Regards Antoine. From ncoghlan at gmail.com Mon Aug 18 00:48:08 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 08:48:08 +1000 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817215547.GA9919@chromebot.unti> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> Message-ID: On 18 Aug 2014 08:04, "Markus Unterwaditzer" wrote: > > On Sun, Aug 17, 2014 at 05:41:10PM -0400, Barry Warsaw wrote: > > I think the biggest API "problem" is that default iteration returns integers > > instead of bytes. That's a real pain. > > I agree, this behavior required some helper functions while porting Werkzeug to > Python 3 AFAIK. > > > > > I'm not sure .iterbytes() is the best name for spelling iteration over bytes > > instead of integers though. Given that we can't change __iter__(), I > > personally would perhaps prefer a simple .bytes property over which if you > > iterated you would receive bytes, e.g. > > I'd rather be for a .bytes() method, to match the .values(), and .keys() > methods on dictionaries. Calling it bytes is too confusing: for x in bytes(data): ... for x in bytes(data).bytes() When referring to bytes, which bytes do you mean, the builtin or the method? iterbytes() isn't especially attractive as a method name, but it's far more explicit about its purpose. Cheers, Nick. > > -- Markus > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Mon Aug 18 00:52:36 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 17 Aug 2014 18:52:36 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> Message-ID: <20140817185236.77228385@limelight.wooz.org> On Aug 18, 2014, at 08:48 AM, Nick Coghlan wrote: >Calling it bytes is too confusing: > > for x in bytes(data): > ... > > for x in bytes(data).bytes() > >When referring to bytes, which bytes do you mean, the builtin or the method? > >iterbytes() isn't especially attractive as a method name, but it's far more >explicit about its purpose. I don't know. How often do you really instantiate the bytes object there in the for loop? -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: From ncoghlan at gmail.com Mon Aug 18 01:08:09 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 09:08:09 +1000 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> Message-ID: On 18 Aug 2014 03:07, "Raymond Hettinger" wrote: > > > On Aug 17, 2014, at 1:41 AM, Nick Coghlan wrote: > >> If I see "bytearray(10)" there is nothing there that suggests "this >> creates an array of length 10 and initialises it to zero" to me. I'd >> be more inclined to guess it would be equivalent to "bytearray([10])". >> >> "bytearray.zeros(10)", on the other hand, is relatively clear, >> independently of user expectations. > > > Zeros would have been great but that should have been done originally. > The time to get API design right is at inception. > Now, you're just breaking code and invalidating any published examples. I'm fine with postponing the deprecation elements indefinitely (or just deprecating bytes(int) and leaving bytearray(int) alone). > >>> >>> Another thought is that the core devs should be very reluctant to deprecate >>> anything we don't have to while the 2 to 3 transition is still in progress. >>> Every new deprecation of APIs that existed in Python 2.7 just adds another >>> obstacle to converting code. Individually, the differences are trivial. >>> Collectively, they present a good reason to never migrate code to Python 3. >> >> >> This is actually one of the inconsistencies between the Python 2 and 3 >> binary APIs: > > > However, bytearray(n) is the same in both Python 2 and Python 3. > Changing it in Python 3 increases the gulf between the two. > > The further we let Python 3 diverge from Python 2, the less likely that > people will convert their code and the harder you make it to write code > that runs under both. > > FWIW, I've been teaching Python full time for three years. I cover the > use of bytearray(n) in my classes and not a single person out of 3000+ > engineers have had a problem with it. I seriously question the PEP's > assertion that there is a real problem to be solved (i.e. that people > are baffled by bytearray(bufsiz)) and that the problem is sufficiently > painful to warrant the headaches that go along with API changes. Yes, I'd expect engineers and networking folks to be fine with it. It isn't how this mode of the constructor *works* that worries me, it's how it *fails* (i.e. silently producing unexpected data rather than a type error). Purely deprecating the bytes case and leaving bytearray alone would likely address my concerns. > > The other proposal to add bytearray.byte(3) should probably be named > bytearray.from_byte(3) for clarity. That said, I question whether there is > actually a use case for this. I have never seen seen code that has a > need to create a byte array of length one from a single integer. > For the most part, the API will be easiest to learn if it matches what > we do for lists and for array.array. This part of the proposal came from a few things: * many of the bytes and bytearray methods only accept bytes-like objects, but iteration and indexing produce integers * to mitigate the impact of the above, some (but not all) bytes and bytearray methods now accept integers in addition to bytes-like objects * ord() in Python 3 is only documented as accepting length 1 strings, but also accepts length 1 bytes-like objects Adding bytes.byte() makes it practical to document the binary half of ord's behaviour, and eliminates any temptation to expand the "also accepts integers" behaviour out to more types. bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2 had both chr() and unichr(). I don't recall ever needing chr() in a real program either, but I still consider it an important part of clearly articulating the data model. > Sorry Nick, but I think you're making the API worse instead of better. > This API isn't perfect but it isn't flat-out broken either. There is some > unfortunate asymmetry between bytes() and bytearray() in Python 2, > but that ship has sailed. The current API for Python 3 is pretty good > (though there is still a tension between wanting to be like lists and like > strings both at the same time). Yes. It didn't help that the docs previously expected readers to infer the behaviour of the binary sequence methods from the string documentation - while the new docs could still use some refinement, I've at least addressed that part of the problem. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Aug 18 01:12:39 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 09:12:39 +1000 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817185236.77228385@limelight.wooz.org> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> Message-ID: On 18 Aug 2014 08:55, "Barry Warsaw" wrote: > > On Aug 18, 2014, at 08:48 AM, Nick Coghlan wrote: > > >Calling it bytes is too confusing: > > > > for x in bytes(data): > > ... > > > > for x in bytes(data).bytes() > > > >When referring to bytes, which bytes do you mean, the builtin or the method? > > > >iterbytes() isn't especially attractive as a method name, but it's far more > >explicit about its purpose. > > I don't know. How often do you really instantiate the bytes object there in > the for loop? I'm talking more generally - do you *really* want to be explaining that "bytes" behaves like a tuple of integers, while "bytes.bytes" behaves like a tuple of bytes? Namespaces are great and all, but using the same name for two different concepts is still inherently confusing. Cheers, Nick. > > -Barry > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Aug 18 01:23:00 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 17 Aug 2014 19:23:00 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: Message-ID: Le 16/08/2014 01:17, Nick Coghlan a ?crit : > > * Deprecate passing single integer values to ``bytes`` and ``bytearray`` I'm neutral. Ideally we wouldn't have done that mistake at the beginning. > * Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors > * Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors > * Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and > ``memoryview.iterbytes`` alternative iterators +0.5. "iterbytes" isn't really great as a name. Regards Antoine. From raymond.hettinger at gmail.com Mon Aug 18 01:41:38 2014 From: raymond.hettinger at gmail.com (Raymond Hettinger) Date: Sun, 17 Aug 2014 16:41:38 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> Message-ID: <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> On Aug 17, 2014, at 4:08 PM, Nick Coghlan wrote: > Purely deprecating the bytes case and leaving bytearray alone would likely address my concerns. That is good progress. Thanks :-) Would a warning for the bytes case suffice, do you need an actual deprecation? > bytes.byte() thus becomes the binary equivalent of chr(), just as Python 2 had both chr() and unichr(). > > I don't recall ever needing chr() in a real program either, but I still consider it an important part of clearly articulating the data model. > > "I don't recall having ever needed this" greatly weakens the premise that this is needed :-) The APIs have been around since 2.6 and AFAICT there have been zero demonstrated need for a special case for a single byte. We already have a perfectly good spelling: NUL = bytes([0]) The Zen tells us we really don't need a second way to do it (actually a third since you can also write b'\x00') and it suggests that this special case isn't special enough. I encourage restraint against adding an unneeded class method that has no parallel elsewhere. Right now, the learning curve is mitigated because bytes is very str-like and because bytearray is list-like (i.e. the method names have been used elsewhere and likely already learned before encountering bytes() or bytearray()). Putting in new, rarely used funky method adds to the learning burden. If you do press forward with adding it (and I don't see why), then as an alternate constructor, the name should be from_int() or some such to avoid ambiguity and to make clear that it is a class method. > iterbytes() isn't especially attractive as a method name, but it's far more > explicit about its purpose. I concur. In this case, explicitness matters. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Aug 18 01:51:40 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 09:51:40 +1000 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> Message-ID: On 18 Aug 2014 09:41, "Raymond Hettinger" wrote: > > > I encourage restraint against adding an unneeded class method that has no parallel > elsewhere. Right now, the learning curve is mitigated because bytes is very str-like > and because bytearray is list-like (i.e. the method names have been used elsewhere > and likely already learned before encountering bytes() or bytearray()). Putting in new, > rarely used funky method adds to the learning burden. > > If you do press forward with adding it (and I don't see why), then as an alternate > constructor, the name should be from_int() or some such to avoid ambiguity > and to make clear that it is a class method. If I remember the sequence of events correctly, I thought of map(bytes.byte, data) first, and then Guido suggested a dedicated iterbytes() method later. The step I hadn't taken (until now) was realising that the new memoryview(data).iterbytes() capability actually combines with the existing (bytes([b]) for b in data) to make the original bytes.byte idea unnecessary. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Mon Aug 18 01:55:02 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 17 Aug 2014 19:55:02 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> Message-ID: <20140817195502.0e9acee3@limelight.wooz.org> On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote: >I'm talking more generally - do you *really* want to be explaining that >"bytes" behaves like a tuple of integers, while "bytes.bytes" behaves like >a tuple of bytes? I would explain it differently though, using concrete examples. data = bytes(...) for i in data: # iterate over data as integers for i in data.bytes: # iterate over data as bytes But whatever. I just wish there was something better than iterbytes. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: From ncoghlan at gmail.com Mon Aug 18 02:08:24 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 10:08:24 +1000 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817195502.0e9acee3@limelight.wooz.org> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> Message-ID: On 18 Aug 2014 09:57, "Barry Warsaw" wrote: > > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote: > > >I'm talking more generally - do you *really* want to be explaining that > >"bytes" behaves like a tuple of integers, while "bytes.bytes" behaves like > >a tuple of bytes? > > I would explain it differently though, using concrete examples. > > data = bytes(...) > for i in data: # iterate over data as integers > for i in data.bytes: # iterate over data as bytes > > But whatever. I just wish there was something better than iterbytes. There's actually another aspect to your idea, independent of the naming: exposing a view rather than just an iterator. I'm going to have to look at the implications for memoryview, but it may be a good way to go (and would align with the iterator -> view changes in dict). Cheers, Nick. > > -Barry > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From barry at python.org Mon Aug 18 02:22:07 2014 From: barry at python.org (Barry Warsaw) Date: Sun, 17 Aug 2014 20:22:07 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> Message-ID: <20140817202207.092a665d@limelight.wooz.org> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote: >There's actually another aspect to your idea, independent of the naming: >exposing a view rather than just an iterator. I'm going to have to look at >the implications for memoryview, but it may be a good way to go (and would >align with the iterator -> view changes in dict). Yep! Maybe that will inspire a better spelling. :) Cheers, -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: From guido at python.org Mon Aug 18 02:45:32 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 17 Aug 2014 17:45:32 -0700 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817202207.092a665d@limelight.wooz.org> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> <20140817202207.092a665d@limelight.wooz.org> Message-ID: On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw wrote: > On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote: > > >There's actually another aspect to your idea, independent of the naming: > >exposing a view rather than just an iterator. I'm going to have to look at > >the implications for memoryview, but it may be a good way to go (and would > >align with the iterator -> view changes in dict). > > Yep! Maybe that will inspire a better spelling. :) > +1. It's just as much about b[i] as it is about "for c in b", so a view sounds right. (The view would have to be mutable for bytearrays and for writable memoryviews.) On the rest, it's sounding more and more as if we will just need to live with both bytes(1000) and bytearray(1000). A warning sounds worse than a deprecation to me. bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes and bytearray pretty highly. I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a size one list is (or at least feels) more expensive to allocate than a size one bytes object. So, okay. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Mon Aug 18 03:02:18 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 17 Aug 2014 18:02:18 -0700 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: <20140817092919.1f41e88a@limelight.wooz.org> References: <20140817092919.1f41e88a@limelight.wooz.org> Message-ID: On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw wrote: > On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote: > > >(Don't understand this to mean that we should never deprecate things. > >Deprecations will happen, they are necessary for the evolution of any > >programming language. But they won't ever hurt in the way that Python 3 > >hurt.) > > It would be useful to explore what causes the most pain in the 2->3 > transition? IMHO, it's not the deprecations or changes such as print -> > print(). It's the bytes/str split - a fundamental change to core and > common > data types. The question then is whether you foresee any similar looming > pervasive change? [*] > I'm unsure about what's the single biggest pain moving to Python 3. In the past I would have said that it's for sure the bytes/str split (which both the biggest pain and the biggest payoff). But if I look carefully into the soul of teams that are still on 2.7 (I know a few... :-), I think the real reason is that Python 3 changes so many different things, you have to actually understand your code to port it (unlike with minor version transitions, where the changes usually spike in one specific area, and you can leave the rest to normal attrition and periodic maintenance). -Barry > > [*] I was going to add a joke about mandatory static type checking, but > sometimes jokes are blown up into apocalyptic prophesy around here. ;) > Heh. :-) -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Mon Aug 18 03:14:46 2014 From: donald at stufft.io (Donald Stufft) Date: Sun, 17 Aug 2014 21:14:46 -0400 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: <20140817092919.1f41e88a@limelight.wooz.org> Message-ID: <1408324486.2238358.153743005.7C19896E@webmail.messagingengine.com> On Sun, Aug 17, 2014, at 09:02 PM, Guido van Rossum wrote: > On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw wrote: >> On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote: >> >> >(Don't understand this to mean that we should never deprecate things. >> >Deprecations will happen, they are necessary for the evolution of any >> >programming language. But they won't ever hurt in the way that Python 3 >> >hurt.) >> >> It would be useful to explore what causes the most pain in the 2->3 >> transition?? IMHO, it's not the deprecations or changes such as print -> >> print().? It's the bytes/str split - a fundamental change to core and common >> data types.? The question then is whether you foresee any similar looming >> pervasive change? [*] > > I'm unsure about what's the single biggest pain moving to Python 3. In the past I would have said that it's for sure the bytes/str split (which both the biggest pain and the biggest payoff). > > But if I look carefully into the soul of teams that are still on 2.7 (I know a few... :-), I think the real reason is that Python 3 changes so many different things, you have to actually understand your code to port it (unlike with minor version transitions, where the changes usually spike in one specific area, and you can leave the rest to normal attrition and periodic maintenance). > In my experience bytes/str is the single biggest change that causes the most problems. Most of the other changes can be mechanically transformed and/or papered over using helpers like six. The bytes/str change is the main one that requires understanding code and where it requires a serious untangling of things in code bases where str/bytes are freely used intechangingbly. Often times this requires making a decision about what *should* be bytes or str as well which requires having some deep knowledge about the APIs in question too. From antoine at python.org Mon Aug 18 03:39:31 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 17 Aug 2014 21:39:31 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> Message-ID: Le 17/08/2014 19:41, Raymond Hettinger a ?crit : > > The APIs have been around since 2.6 and AFAICT there have been zero > demonstrated > need for a special case for a single byte. We already have a perfectly > good spelling: > NUL = bytes([0]) That is actually a very cumbersome spelling. Why should I first create a one-element list in order to create a one-byte bytes object? > The Zen tells us we really don't need a second way to do it (actually a > third since you > can also write b'\x00') and it suggests that this special case isn't > special enough. b'\x00' is obviously the right way to do it in this case, but we're concerned about the non-constant case. The reason to instantiate bytes from non-constant integer comes from the unfortunate indexing and iteration behaviour of bytes objects. Regards Antoine. From ethan at stoneleaf.us Mon Aug 18 03:44:23 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Sun, 17 Aug 2014 18:44:23 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <37DD82D8-CC83-4D03-A854-3229BAAF8C1D@gmail.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <53F0F590.7040806@stoneleaf.us> <37DD82D8-CC83-4D03-A854-3229BAAF8C1D@gmail.com> Message-ID: <53F15A77.3010403@stoneleaf.us> On 08/17/2014 02:19 PM, Raymond Hettinger wrote: > On Aug 17, 2014, at 11:33 AM, Ethan Furman wrote: > >> I've had many of the problems Nick states and I'm also +1. > > There are two code snippets below which were taken from the standard library. [...] My issues are with 'bytes', not 'bytearray'. 'bytearray(10)' actually makes sense. I certainly have no problem with bytearray and bytes not being exactly the same. My primary issues with bytes is not being able to do b'abc'[2] == b'c', and with not being able to do x = b'abc'[2]; y = bytes(x); assert y == b'c'. And because of the backwards compatibility issues I would deprecate, because we have a new 'better' way, but not remove, the current functionality. I pretty much agree exactly with what Donald Stufft said about it. -- ~Ethan~ From antoine at python.org Mon Aug 18 03:40:50 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 17 Aug 2014 21:40:50 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> Message-ID: Le 17/08/2014 20:08, Nick Coghlan a ?crit : > > On 18 Aug 2014 09:57, "Barry Warsaw" > wrote: > > > > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote: > > > > >I'm talking more generally - do you *really* want to be explaining that > > >"bytes" behaves like a tuple of integers, while "bytes.bytes" > behaves like > > >a tuple of bytes? > > > > I would explain it differently though, using concrete examples. > > > > data = bytes(...) > > for i in data: # iterate over data as integers > > for i in data.bytes: # iterate over data as bytes > > > > But whatever. I just wish there was something better than iterbytes. > > There's actually another aspect to your idea, independent of the naming: > exposing a view rather than just an iterator. So that view would actually be the bytes object done right? Funny :-) Will it have lazy slicing? Regards Antoine. From donald at stufft.io Mon Aug 18 03:48:21 2014 From: donald at stufft.io (Donald Stufft) Date: Sun, 17 Aug 2014 21:48:21 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> Message-ID: <1408326501.2246487.153751541.07C44893@webmail.messagingengine.com> from __future__ import bytesdoneright? :D -- Donald Stufft donald at stufft.io On Sun, Aug 17, 2014, at 09:40 PM, Antoine Pitrou wrote: > Le 17/08/2014 20:08, Nick Coghlan a ?crit : > > > > On 18 Aug 2014 09:57, "Barry Warsaw" > > wrote: > > > > > > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote: > > > > > > >I'm talking more generally - do you *really* want to be explaining that > > > >"bytes" behaves like a tuple of integers, while "bytes.bytes" > > behaves like > > > >a tuple of bytes? > > > > > > I would explain it differently though, using concrete examples. > > > > > > data = bytes(...) > > > for i in data: # iterate over data as integers > > > for i in data.bytes: # iterate over data as bytes > > > > > > But whatever. I just wish there was something better than iterbytes. > > > > There's actually another aspect to your idea, independent of the naming: > > exposing a view rather than just an iterator. > > So that view would actually be the bytes object done right? Funny :-) > Will it have lazy slicing? > > Regards > > Antoine. > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/donald%40stufft.io From ethan at stoneleaf.us Mon Aug 18 03:52:10 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Sun, 17 Aug 2014 18:52:10 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> Message-ID: <53F15C4A.5090006@stoneleaf.us> On 08/17/2014 04:08 PM, Nick Coghlan wrote: > > I'm fine with postponing the deprecation elements indefinitely (or just deprecating bytes(int) and leaving > bytearray(int) alone). +1 on both pieces. -- ~Ethan~ From graffatcolmingov at gmail.com Mon Aug 18 04:02:52 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Sun, 17 Aug 2014 21:02:52 -0500 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <53F15C4A.5090006@stoneleaf.us> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <53F15C4A.5090006@stoneleaf.us> Message-ID: On Sun, Aug 17, 2014 at 8:52 PM, Ethan Furman wrote: > On 08/17/2014 04:08 PM, Nick Coghlan wrote: >> >> >> I'm fine with postponing the deprecation elements indefinitely (or just >> deprecating bytes(int) and leaving >> bytearray(int) alone). > > > +1 on both pieces. Perhaps postpone the deprecation to Python 4000 ;) From alex.gaynor at gmail.com Mon Aug 18 04:14:01 2014 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Mon, 18 Aug 2014 02:14:01 +0000 (UTC) Subject: [Python-Dev] =?utf-8?q?PEP_467=3A_Minor_API_improvements_for_byte?= =?utf-8?q?s_=26=09bytearray?= References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> Message-ID: Donald Stufft stufft.io> writes: > > > > For the record I?ve had all of the problems that Nick states and I?m > +1 on this change. > > > --- > Donald Stufft > PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA > I've hit basically every problem everyone here has stated, and in no uncertain terms am I completely opposed to deprecating anything. The Python 2 to 3 migration is already hard enough, and already proceeding far too slowly for many of our tastes. Making that migration even more complex would drive me to the point of giving up. Alex From chrism at plope.com Mon Aug 18 04:51:26 2014 From: chrism at plope.com (Chris McDonough) Date: Sun, 17 Aug 2014 22:51:26 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> Message-ID: <53F16A2E.1080806@plope.com> On 08/17/2014 09:40 PM, Antoine Pitrou wrote: > Le 17/08/2014 20:08, Nick Coghlan a ?crit : >> >> On 18 Aug 2014 09:57, "Barry Warsaw" > > wrote: >> > >> > On Aug 18, 2014, at 09:12 AM, Nick Coghlan wrote: >> > >> > >I'm talking more generally - do you *really* want to be explaining >> that >> > >"bytes" behaves like a tuple of integers, while "bytes.bytes" >> behaves like >> > >a tuple of bytes? >> > >> > I would explain it differently though, using concrete examples. >> > >> > data = bytes(...) >> > for i in data: # iterate over data as integers >> > for i in data.bytes: # iterate over data as bytes >> > >> > But whatever. I just wish there was something better than iterbytes. >> >> There's actually another aspect to your idea, independent of the naming: >> exposing a view rather than just an iterator. > > So that view would actually be the bytes object done right? Funny :-) > Will it have lazy slicing? bytes.sorry()? ;-) - C From jeanpierreda at gmail.com Mon Aug 18 05:50:40 2014 From: jeanpierreda at gmail.com (Devin Jeanpierre) Date: Sun, 17 Aug 2014 20:50:40 -0700 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> Message-ID: On Sun, Aug 17, 2014 at 7:14 PM, Alex Gaynor wrote: > I've hit basically every problem everyone here has stated, and in no uncertain > terms am I completely opposed to deprecating anything. The Python 2 to 3 > migration is already hard enough, and already proceeding far too slowly for > many of our tastes. Making that migration even more complex would drive me to > the point of giving up. Could you elaborate what problems you are thinking this will cause for you? It seems to me that avoiding a bug-prone API is not particularly complex, and moving it back to its 2.x semantics or making it not work entirely, rather than making it work differently, would make porting applications easier. If, during porting to 3.x, you find a deprecation warning for bytes(n), then rather than being annoying code churny extra changes, this is actually a bug that's been identified. So it's helpful even during the deprecation period. -- Devin From ncoghlan at gmail.com Mon Aug 18 08:45:27 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 18 Aug 2014 16:45:27 +1000 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: <1408324486.2238358.153743005.7C19896E@webmail.messagingengine.com> References: <20140817092919.1f41e88a@limelight.wooz.org> <1408324486.2238358.153743005.7C19896E@webmail.messagingengine.com> Message-ID: On 18 August 2014 11:14, Donald Stufft wrote: > On Sun, Aug 17, 2014, at 09:02 PM, Guido van Rossum wrote: >> I'm unsure about what's the single biggest pain moving to Python 3. In the past I would have said that it's for sure the bytes/str split (which both the biggest pain and the biggest payoff). >> >> But if I look carefully into the soul of teams that are still on 2.7 (I know a few... :-), I think the real reason is that Python 3 changes so many different things, you have to actually understand your code to port it (unlike with minor version transitions, where the changes usually spike in one specific area, and you can leave the rest to normal attrition and periodic maintenance). >> > > In my experience bytes/str is the single biggest change that causes the > most problems. Most of the other changes can be mechanically transformed > and/or papered over using helpers like six. The bytes/str change is the > main one that requires understanding code and where it requires a > serious untangling of things in code bases where str/bytes are freely > used intechangingbly. Often times this requires making a decision about > what *should* be bytes or str as well which requires having some deep > knowledge about the APIs in question too. It's certainly the one that has caused the most churn in CPython and the standard library - the ripples still haven't entirely settled on that front :) I think Guido's right that there's also a "death of a thousand cuts" aspect for large existing code bases, though, especially those that are lacking comprehensive test suites. By definition, existing large Python 2 applications are OK with the restrictions imposed by Python 2, and we're deliberately not forcing the issue by halting Python 2 maintenance. That's where Steve Dower's idea of being able to progressively declare a code base "Python 3 compatible" on a file by file basis and have some means of programmatically enforcing that is interesting - it opens the door to "opportunistic and incremental" porting, where modules are progressively updated to run on both, until an application reaches a point where it can switch to Python 3 and leave Python 2 behind. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From barry at python.org Mon Aug 18 15:50:00 2014 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Aug 2014 09:50:00 -0400 Subject: [Python-Dev] PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <2DD0161B-0234-4900-AE4F-A5E1D273C16A@gmail.com> Message-ID: <20140818095000.684ecc7b@limelight.wooz.org> On Aug 17, 2014, at 09:39 PM, Antoine Pitrou wrote: >> need for a special case for a single byte. We already have a perfectly >> good spelling: >> NUL = bytes([0]) > >That is actually a very cumbersome spelling. Why should I first create a >one-element list in order to create a one-byte bytes object? I feel the same way every time I have to write `set(['foo'])`. -Barry From barry at python.org Mon Aug 18 16:12:59 2014 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Aug 2014 10:12:59 -0400 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: <20140817092919.1f41e88a@limelight.wooz.org> Message-ID: <20140818101259.7a18ce2f@limelight.wooz.org> On Aug 17, 2014, at 06:02 PM, Guido van Rossum wrote: >I'm unsure about what's the single biggest pain moving to Python 3. In the >past I would have said that it's for sure the bytes/str split (which both >the biggest pain and the biggest payoff). > >But if I look carefully into the soul of teams that are still on 2.7 (I >know a few... :-), I think the real reason is that Python 3 changes so many >different things, you have to actually understand your code to port it >(unlike with minor version transitions, where the changes usually spike in >one specific area, and you can leave the rest to normal attrition and >periodic maintenance). The latter is a good point, and sometimes it's a huge challenge to understand the code being ported. A good test suite (and dare I say, doctests :) help a lot with this. I've ported a ton of stuff, and failed at a few. I think all the little changes are mostly tractable, and we've assembled a pretty good stack of documents to help[*]. Sometimes a seemingly easy and mechanical port will produce odd failures, where more domain expertise needs to be brought to bear to get just the right bilingual invocation. But if the underlying code does not itself have a clear bytes/str distinction, then you're doomed. One of my failures was a Python binding for a large C++ project that deeply conflated data and text. Another was a pure Python library that essentially did the same. In both cases, I ended up in a situation where some core types could be neither str nor bytes without some part of the test suite failing miserably. Those are the types of projects that largely get left unported since it's much harder to justify the costs vs. benefits. Cheers, -Barry [*] https://wiki.python.org/moin/PortingToPy3k/BilingualQuickRef -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: From barry at python.org Mon Aug 18 16:17:23 2014 From: barry at python.org (Barry Warsaw) Date: Mon, 18 Aug 2014 10:17:23 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <1408326501.2246487.153751541.07C44893@webmail.messagingengine.com> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> <1408326501.2246487.153751541.07C44893@webmail.messagingengine.com> Message-ID: <20140818101723.39457c84@limelight.wooz.org> On Aug 17, 2014, at 09:48 PM, Donald Stufft wrote: >from __future__ import bytesdoneright? :D Synonymous to: bytes = bytesdoneright or maybe from bytesdoneright import bytes :) -Barry From andreas.r.maier at gmx.de Mon Aug 18 13:34:32 2014 From: andreas.r.maier at gmx.de (Andreas Maier) Date: Mon, 18 Aug 2014 13:34:32 +0200 Subject: [Python-Dev] Review needed for patch for issue #12067 Message-ID: <53F1E4C8.5030801@gmx.de> Hello, a patch for issue #12067 (targeting Py 3.5) is available since a few weeks and is ready for review. From my perspective, it is ready for commit. Could the community please review the patch? https://bugs.python.org/issue12067 Thanks, Andy From dickinsm at gmail.com Mon Aug 18 19:22:26 2014 From: dickinsm at gmail.com (Mark Dickinson) Date: Mon, 18 Aug 2014 18:22:26 +0100 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: <20140817023902.GM4525@ando> References: <20140817023902.GM4525@ando> Message-ID: [Moderately off-topic] On Sun, Aug 17, 2014 at 3:39 AM, Steven D'Aprano wrote: > I used to refer to Python 4000 as the hypothetical compatibility break > version. Now I refer to Python 5000. > I personally think it should be Python 5000000, or Py5M. When we come to create the mercurial branch, that should of course, following tradition, be called p5ym. -- Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Aug 18 19:49:06 2014 From: antoine at python.org (Antoine Pitrou) Date: Mon, 18 Aug 2014 13:49:06 -0400 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: <20140817023902.GM4525@ando> Message-ID: Le 18/08/2014 13:22, Mark Dickinson a ?crit : > [Moderately off-topic] > > On Sun, Aug 17, 2014 at 3:39 AM, Steven D'Aprano > wrote: > > I used to refer to Python 4000 as the hypothetical compatibility break > version. Now I refer to Python 5000. > > > I personally think it should be Python 5000000, or Py5M. When we come > to create the mercurial branch, that should of course, following > tradition, be called p5ym. I would suggest "NaV", for "not-a-version". It would compare greater than all other version numbers (in the spirit of Numpy's "not-a-time", slightly tweaked). Regards Antoine. From chris.barker at noaa.gov Mon Aug 18 18:04:06 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 18 Aug 2014 09:04:06 -0700 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: <20140817174110.5ddd3d90@limelight.wooz.org> References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> Message-ID: On Sun, Aug 17, 2014 at 2:41 PM, Barry Warsaw wrote: > I think the biggest API "problem" is that default iteration returns > integers > instead of bytes. That's a real pain. > what is really needed for this NOT to be a pain is a byte scalar. numpy has a scalar type for every type it supports -- this is a GOOD THING (tm): In [53]: a = np.array((3,4,5), dtype=np.uint8) In [54]: a Out[54]: array([3, 4, 5], dtype=uint8) In [55]: a[1] Out[55]: 4 In [56]: type(a[1]) Out[56]: numpy.uint8 In [57]: a[1].shape Out[57]: () The lack of a character type is a major source of "type errors" in python (the whole list of strings vs a single string problem -- both return a sequence when you index into them or iterate over them) Anyway, the character ship has long since sailed, but maybe a byte scalar would be a good idea? And FWIW, I think the proposal does make for a better, cleaner API. Whether that's worth the deprecation is not clear to me, though as someone whose been on the verge of making the leap to 3.* for ages, this isn't going to make any difference. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Mon Aug 18 22:06:06 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 18 Aug 2014 16:06:06 -0400 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> Message-ID: On 8/18/2014 12:04 PM, Chris Barker wrote: > On Sun, Aug 17, 2014 at 2:41 PM, Barry Warsaw > wrote: > > I think the biggest API "problem" is that default iteration returns > integers > instead of bytes. That's a real pain. > > > what is really needed for this NOT to be a pain is a byte scalar. The byte scalar is an int in range(256). Bytes is an array of such. > numpy has a scalar type for every type it supports -- this is a GOOD > THING (tm): > > In [53]: a = np.array((3,4,5), dtype=np.uint8) > > In [54]: a > Out[54]: array([3, 4, 5], dtype=uint8) > > In [55]: a[1] > Out[55]: 4 > > In [56]: type(a[1]) > Out[56]: numpy.uint8 > > In [57]: a[1].shape > Out[57]: () > > > The lack of a character type is a major source of "type errors" in > python (the whole list of strings vs a single string problem -- both > return a sequence when you index into them or iterate over them) This is exactly what iterbytes would do -- yields bytes of size 1. > Anyway, the character ship has long since sailed, but maybe a byte > scalar would be a good idea? > > And FWIW, I think the proposal does make for a better, cleaner API. -- Terry Jan Reedy From tjreedy at udel.edu Mon Aug 18 22:12:22 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Mon, 18 Aug 2014 16:12:22 -0400 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) Message-ID: Firefox does not want to connect to https:bugs.python.org. Plain bugs.python.org works fine. Has the certificate expired? -- Terry Jan Reedy From phd at phdru.name Mon Aug 18 22:19:48 2014 From: phd at phdru.name (Oleg Broytman) Date: Mon, 18 Aug 2014 22:19:48 +0200 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: Message-ID: <20140818201948.GB1782@phdru.name> On Mon, Aug 18, 2014 at 04:12:22PM -0400, Terry Reedy wrote: > Firefox does not want to connect to https:bugs.python.org. Works for me (FF 31). Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From benjamin at python.org Mon Aug 18 22:22:01 2014 From: benjamin at python.org (Benjamin Peterson) Date: Mon, 18 Aug 2014 13:22:01 -0700 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: Message-ID: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> It uses a CACert certificate, which your system probably doesn't trust. On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote: > Firefox does not want to connect to https:bugs.python.org. Plain > bugs.python.org works fine. Has the certificate expired? > > -- > Terry Jan Reedy > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/benjamin%40python.org From graffatcolmingov at gmail.com Mon Aug 18 22:26:48 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Mon, 18 Aug 2014 15:26:48 -0500 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> Message-ID: On Mon, Aug 18, 2014 at 3:22 PM, Benjamin Peterson wrote: > It uses a CACert certificate, which your system probably doesn't trust. > > On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote: >> Firefox does not want to connect to https:bugs.python.org. Plain >> bugs.python.org works fine. Has the certificate expired? >> >> -- >> Terry Jan Reedy >> >> _______________________________________________ >> Python-Dev mailing list >> Python-Dev at python.org >> https://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> https://mail.python.org/mailman/options/python-dev/benjamin%40python.org Benjamin that looks accurate. I see the same thing as Terry (on Firefox 31) and the reason is: bugs.python.org uses an invalid security certificate. The certificate is not trusted because no issuer chain was provided. (Error code: sec_error_unknown_issuer) From phd at phdru.name Mon Aug 18 22:30:43 2014 From: phd at phdru.name (Oleg Broytman) Date: Mon, 18 Aug 2014 22:30:43 +0200 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> Message-ID: <20140818203043.GC1782@phdru.name> On Mon, Aug 18, 2014 at 03:26:48PM -0500, Ian Cordasco wrote: > On Mon, Aug 18, 2014 at 3:22 PM, Benjamin Peterson wrote: > > It uses a CACert certificate, which your system probably doesn't trust. > > > > On Mon, Aug 18, 2014, at 13:12, Terry Reedy wrote: > >> Firefox does not want to connect to https:bugs.python.org. Plain > >> bugs.python.org works fine. Has the certificate expired? > > Benjamin that looks accurate. I see the same thing as Terry (on > Firefox 31) and the reason is: > > bugs.python.org uses an invalid security certificate. The certificate > is not trusted because no issuer chain was provided. (Error code: > sec_error_unknown_issuer) Aha, I see now -- the signing certificate is CAcert, which I've installed manually. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From robertc at robertcollins.net Tue Aug 19 02:07:59 2014 From: robertc at robertcollins.net (Robert Collins) Date: Tue, 19 Aug 2014 12:07:59 +1200 Subject: [Python-Dev] os.walk() is going to be *fast* with scandir In-Reply-To: <53E70D1D.3040306@hastings.org> References: <53E70D1D.3040306@hastings.org> Message-ID: Indeed - my suggestion is applicable to people using the library -Rob On 10 Aug 2014 18:21, "Larry Hastings" wrote: > On 08/09/2014 10:40 PM, Robert Collins wrote: > > A small tip from my bzr days - cd into the directory before scanning it > > > I doubt that's permissible for a library function like os.scandir(). > > > */arry* > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/robertc%40robertcollins.net > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Mon Aug 18 22:37:32 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Mon, 18 Aug 2014 13:37:32 -0700 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> Message-ID: On Mon, Aug 18, 2014 at 1:06 PM, Terry Reedy wrote: > The byte scalar is an int in range(256). Bytes is an array of such. > then why the complaint about iterating over bytes producing ints? Ye,s a byte owuld be pretty much teh same as an int, but it would have restrictions - useful ones. numpy has a scalar type for every type it supports -- this is a GOOD >> THING (tm): >> In [56]: type(a[1]) >> Out[56]: numpy.uint8 >> >> In [57]: a[1].shape >> Out[57]: () >> >> >> The lack of a character type is a major source of "type errors" in >> python (the whole list of strings vs a single string problem -- both >> return a sequence when you index into them or iterate over them) >> > > This is exactly what iterbytes would do -- yields bytes of size 1. as I understand it, it would yield a bytes object of length one -- that is a sequence that _happens_ to only have one item in it -- not the same thing. Note above. In numpy, when you index out of a 1-d array you get a scalar -- with shape == () -- not a 1-d array of length 1. And this is useful, as it provide s clear termination point when you drill down through multiple dimensions. I often wish I could do that with nested lists with strings at the bottom. [1,2,3] is a sequence of numbers "this" is a sequence of characters -- oops, not it's not, it's a sequence of sequences of sequences of ... I think it would be cleaner if bytes was a sequence of a scalar byte object. This is a bigger deal for numpy, what with its n-dimensional arrays and many reducing operations, but the same principles apply. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From storchaka at gmail.com Tue Aug 19 10:37:00 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 19 Aug 2014 11:37:00 +0300 Subject: [Python-Dev] Bytes path support Message-ID: Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib? [1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797 From ncoghlan at gmail.com Tue Aug 19 14:25:48 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 19 Aug 2014 22:25:48 +1000 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> <20140817202207.092a665d@limelight.wooz.org> Message-ID: On 18 August 2014 10:45, Guido van Rossum wrote: > On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw wrote: >> >> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote: >> >> >There's actually another aspect to your idea, independent of the naming: >> >exposing a view rather than just an iterator. I'm going to have to look >> > at >> >the implications for memoryview, but it may be a good way to go (and >> > would >> >align with the iterator -> view changes in dict). >> >> Yep! Maybe that will inspire a better spelling. :) > > > +1. It's just as much about b[i] as it is about "for c in b", so a view > sounds right. (The view would have to be mutable for bytearrays and for > writable memoryviews.) > > On the rest, it's sounding more and more as if we will just need to live > with both bytes(1000) and bytearray(1000). A warning sounds worse than a > deprecation to me. I'm fine with keeping bytearray(1000), since that works the same way in both Python 2 & 3, and doesn't seem likely to be invoked inadvertently. I'd still like to deprecate "bytes(1000)", since that does different things in Python 2 & 3, while "b'\x00' * 1000" does the same thing in both. $ python -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))' '10' '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' $ python3 -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))' b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' Hitting the deprecation warning in single-source code would seem to be a strong hint that you have a bug in one version or the other rather than being intended behaviour. > bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes and > bytearray pretty highly. With "bytearray(1000)" sticking around indefinitely, I'm less concerned about adding a "zeros" constructor. > I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a size > one list is (or at least feels) more expensive to allocate than a size one > bytes object. So, okay. So, here's an interesting thing I hadn't previously registered: we actually already have a fairly capable "bytesview" option, and have done since Stefan implemented "memoryview.cast" in 3.3. The trick lies in the 'c' format character for the struct module, which is parsed as a length 1 bytes object rather than as an integer: >>> data = bytearray(b"Hello world") >>> bytesview = memoryview(data).cast('c') >>> list(bytesview) [b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd'] >>> b''.join(bytesview) b'Hello world' >>> bytesview[0:5] = memoryview(b"olleH").cast('c') >>> list(bytesview) [b'o', b'l', b'l', b'e', b'H', b' ', b'w', b'o', b'r', b'l', b'd'] >>> b''.join(bytesview) b'olleH world' For the read-only case, it covers everything (iteration, indexing, slicing), for the writable view case, it doesn't cover changing the shape of the target array, and it doesn't cover assigning arbitrary buffer objects (you need to wrap them in a similar cast for memoryview to allow the assignment). It's hardly the most *intuitive* spelling though - I was one of the reviewers for Stefan's memoryview rewrite back in 3.3, and I only made the connection today when looking to see how a view object like the one we were discussing elsewhere in the thread might be implemented as a facade over arbitrary memory buffers, rather than being specific to bytes and bytearray. If we went down the "bytesview" path, then a single new facade would cover not only the 3 builtins (bytes, bytearray, memoryview) but also any *other* buffer exporting type. If we so chose (at some point in the future, not as part of this PEP), such a type could allow additional bytes operations (like "count", "startswith" or "index") to be applied to arbitrary regions of memory without making a copy. We can't add those other operations to memoryview, since they don't make sense for an n-dimensional array. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Tue Aug 19 18:46:24 2014 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Aug 2014 09:46:24 -0700 Subject: [Python-Dev] Fwd: PEP 467: Minor API improvements for bytes & bytearray In-Reply-To: References: <20ABA271-203D-403E-9015-CC839B0CE02C@gmail.com> <295B5648-6A87-4E45-B632-36C09F2AEF04@stufft.io> <20140817174110.5ddd3d90@limelight.wooz.org> <20140817215547.GA9919@chromebot.unti> <20140817185236.77228385@limelight.wooz.org> <20140817195502.0e9acee3@limelight.wooz.org> <20140817202207.092a665d@limelight.wooz.org> Message-ID: On Tue, Aug 19, 2014 at 5:25 AM, Nick Coghlan wrote: > On 18 August 2014 10:45, Guido van Rossum wrote: > > On Sun, Aug 17, 2014 at 5:22 PM, Barry Warsaw wrote: > >> > >> On Aug 18, 2014, at 10:08 AM, Nick Coghlan wrote: > >> > >> >There's actually another aspect to your idea, independent of the > naming: > >> >exposing a view rather than just an iterator. I'm going to have to look > >> > at > >> >the implications for memoryview, but it may be a good way to go (and > >> > would > >> >align with the iterator -> view changes in dict). > >> > >> Yep! Maybe that will inspire a better spelling. :) > > > > > > +1. It's just as much about b[i] as it is about "for c in b", so a view > > sounds right. (The view would have to be mutable for bytearrays and for > > writable memoryviews.) > > > > On the rest, it's sounding more and more as if we will just need to live > > with both bytes(1000) and bytearray(1000). A warning sounds worse than a > > deprecation to me. > > I'm fine with keeping bytearray(1000), since that works the same way > in both Python 2 & 3, and doesn't seem likely to be invoked > inadvertently. > > I'd still like to deprecate "bytes(1000)", since that does different > things in Python 2 & 3, while "b'\x00' * 1000" does the same thing in > both. > I think any argument based on what "bytes" does in Python 2 is pretty weak, since Python 2's bytes is just an alias for str, so it has tons of behavior that differ -- why single this out? In Python 3, I really like bytes and bytearray to be as similar as possible, and that includes the constructor. > $ python -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))' > '10' > '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > $ python3 -c 'print("{!r}\n{!r}".format(bytes(10), b"\x00" * 10))' > b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > > Hitting the deprecation warning in single-source code would seem to be > a strong hint that you have a bug in one version or the other rather > than being intended behaviour. > > > bytes.zeros(n) sounds fine to me; I value similar interfaces for bytes > and > > bytearray pretty highly. > > With "bytearray(1000)" sticking around indefinitely, I'm less > concerned about adding a "zeros" constructor. > That's fine. > > I'm lukewarm on bytes.byte(c); but bytes([c]) does bother me because a > size > > one list is (or at least feels) more expensive to allocate than a size > one > > bytes object. So, okay. > > So, here's an interesting thing I hadn't previously registered: we > actually already have a fairly capable "bytesview" option, and have > done since Stefan implemented "memoryview.cast" in 3.3. The trick lies > in the 'c' format character for the struct module, which is parsed as > a length 1 bytes object rather than as an integer: > > >>> data = bytearray(b"Hello world") > >>> bytesview = memoryview(data).cast('c') > >>> list(bytesview) > [b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd'] > >>> b''.join(bytesview) > b'Hello world' > >>> bytesview[0:5] = memoryview(b"olleH").cast('c') > >>> list(bytesview) > [b'o', b'l', b'l', b'e', b'H', b' ', b'w', b'o', b'r', b'l', b'd'] > >>> b''.join(bytesview) > b'olleH world' > > For the read-only case, it covers everything (iteration, indexing, > slicing), for the writable view case, it doesn't cover changing the > shape of the target array, and it doesn't cover assigning arbitrary > buffer objects (you need to wrap them in a similar cast for memoryview > to allow the assignment). > > It's hardly the most *intuitive* spelling though - I was one of the > reviewers for Stefan's memoryview rewrite back in 3.3, and I only made > the connection today when looking to see how a view object like the > one we were discussing elsewhere in the thread might be implemented as > a facade over arbitrary memory buffers, rather than being specific to > bytes and bytearray. > Maybe the 'future' package can offer an iterbytes or bytesview implemented this way? > If we went down the "bytesview" path, then a single new facade would > cover not only the 3 builtins (bytes, bytearray, memoryview) but also > any *other* buffer exporting type. If we so chose (at some point in > the future, not as part of this PEP), such a type could allow > additional bytes operations (like "count", "startswith" or "index") to > be applied to arbitrary regions of memory without making a copy. Why call out "without making a copy" for operations that naturally don't have to copy anything? > We > can't add those other operations to memoryview, since they don't make > sense for an n-dimensional array. > I'm sorry for your efforts, but I'm getting more and more lukewarm about the entire PEP. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido at python.org Tue Aug 19 19:02:32 2014 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Aug 2014 10:02:32 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: The official policy is that we want them to go away, but reality so far has not budged. We will continue to hold our breath though. :-) On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka wrote: > Builting open(), io classes, os and os.path functions and some other > functions in the stdlib support bytes paths as well as str paths. But many > functions doesn't. There are requests about adding this support ([1], [2]) > in some modules. It is easy (just call os.fsdecode() on argument) but I'm > not sure it is worth to do. Pathlib doesn't support bytes path and it looks > intentional. What is general policy about support of bytes path in the > stdlib? > > [1] http://bugs.python.org/issue19997 > [2] http://bugs.python.org/issue20797 > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ > guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From benhoyt at gmail.com Tue Aug 19 19:31:54 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Tue, 19 Aug 2014 13:31:54 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) Does that mean that new APIs should explicitly not support bytes? I'm thinking of os.scandir() (PEP 471), which I'm implementing at the moment. I was originally going to make it support bytes so it was compatible with listdir, but maybe that's a bad idea. Bytes paths are essentially broken on Windows. -Ben > On Tue, Aug 19, 2014 at 1:37 AM, Serhiy Storchaka wrote: >> >> Builting open(), io classes, os and os.path functions and some other functions in the stdlib support bytes paths as well as str paths. But many functions doesn't. There are requests about adding this support ([1], [2]) in some modules. It is easy (just call os.fsdecode() on argument) but I'm not sure it is worth to do. Pathlib doesn't support bytes path and it looks intentional. What is general policy about support of bytes path in the stdlib? >> >> [1] http://bugs.python.org/issue19997 >> [2] http://bugs.python.org/issue20797 From storchaka at gmail.com Tue Aug 19 19:34:03 2014 From: storchaka at gmail.com (Serhiy Storchaka) Date: Tue, 19 Aug 2014 20:34:03 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: 19.08.14 20:02, Guido van Rossum ???????(??): > The official policy is that we want them to go away, but reality so far > has not budged. We will continue to hold our breath though. :-) Does it mean that we should reject all propositions about adding bytes path support in existing functions (in particular issue19997 (imghdr) and issue20797 (zipfile))? From benjamin at python.org Tue Aug 19 19:40:29 2014 From: benjamin at python.org (Benjamin Peterson) Date: Tue, 19 Aug 2014 10:40:29 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> On Tue, Aug 19, 2014, at 10:31, Ben Hoyt wrote: > > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) > > Does that mean that new APIs should explicitly not support bytes? I'm > thinking of os.scandir() (PEP 471), which I'm implementing at the > moment. I was originally going to make it support bytes so it was > compatible with listdir, but maybe that's a bad idea. Bytes paths are > essentially broken on Windows. Bytes paths are "essential" on Unix, though, so I don't think we should create new low-level APIs that don't support bytes. From benhoyt at gmail.com Tue Aug 19 19:43:07 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Tue, 19 Aug 2014 13:43:07 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: >> > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) >> >> Does that mean that new APIs should explicitly not support bytes? I'm >> thinking of os.scandir() (PEP 471), which I'm implementing at the >> moment. I was originally going to make it support bytes so it was >> compatible with listdir, but maybe that's a bad idea. Bytes paths are >> essentially broken on Windows. > > Bytes paths are "essential" on Unix, though, so I don't think we should > create new low-level APIs that don't support bytes. Fair enough. I don't quite understand, though -- why is the "official policy" to kill something that's "essential" on *nix? -Ben From tseaver at palladion.com Tue Aug 19 19:56:16 2014 From: tseaver at palladion.com (Tres Seaver) Date: Tue, 19 Aug 2014 13:56:16 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/19/2014 01:43 PM, Ben Hoyt wrote: >>>> The official policy is that we want them [support for bytes >>>> paths in stdlib functions] to go away, but reality so far has >>>> not budged. We will continue to hold our breath though. :-) >>> >>> Does that mean that new APIs should explicitly not support bytes? >>> I'm thinking of os.scandir() (PEP 471), which I'm implementing at >>> the moment. I was originally going to make it support bytes so it >>> was compatible with listdir, but maybe that's a bad idea. Bytes >>> paths are essentially broken on Windows. >> >> Bytes paths are "essential" on Unix, though, so I don't think we >> should create new low-level APIs that don't support bytes. > > Fair enough. I don't quite understand, though -- why is the "official > policy" to kill something that's "essential" on *nix? ISTM that the policy is based on a fantasy that "it looks like text to me in my use cases, so therefore it must be text for everyone." Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEARECAAYFAlPzj8AACgkQ+gerLs4ltQ6AjACgzSC6kBXssnzNhVTdahWIi48u 5SwAn3+ytO/bh1YrVzCbVJqU/wIs7WiA =qGLR -----END PGP SIGNATURE----- From benjamin at python.org Tue Aug 19 20:00:35 2014 From: benjamin at python.org (Benjamin Peterson) Date: Tue, 19 Aug 2014 11:00:35 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: <1408471235.2401664.154475973.533FA5D4@webmail.messagingengine.com> On Tue, Aug 19, 2014, at 10:43, Ben Hoyt wrote: > >> > The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) > >> > >> Does that mean that new APIs should explicitly not support bytes? I'm > >> thinking of os.scandir() (PEP 471), which I'm implementing at the > >> moment. I was originally going to make it support bytes so it was > >> compatible with listdir, but maybe that's a bad idea. Bytes paths are > >> essentially broken on Windows. > > > > Bytes paths are "essential" on Unix, though, so I don't think we should > > create new low-level APIs that don't support bytes. > > Fair enough. I don't quite understand, though -- why is the "official > policy" to kill something that's "essential" on *nix? Well, notice the official policy is desperately *wanting* them to go away with the implication that we grudgingly bow to reality. :) From antoine at python.org Tue Aug 19 20:06:29 2014 From: antoine at python.org (Antoine Pitrou) Date: Tue, 19 Aug 2014 14:06:29 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: Le 19/08/2014 13:43, Ben Hoyt a ?crit : >>>> The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) >>> >>> Does that mean that new APIs should explicitly not support bytes? I'm >>> thinking of os.scandir() (PEP 471), which I'm implementing at the >>> moment. I was originally going to make it support bytes so it was >>> compatible with listdir, but maybe that's a bad idea. Bytes paths are >>> essentially broken on Windows. >> >> Bytes paths are "essential" on Unix, though, so I don't think we should >> create new low-level APIs that don't support bytes. > > Fair enough. I don't quite understand, though -- why is the "official > policy" to kill something that's "essential" on *nix? PEP 383 should actually work on Unix quite well, AFAIR. Regards Antoine. From marko at pacujo.net Tue Aug 19 20:16:40 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Tue, 19 Aug 2014 21:16:40 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: (Tres Seaver's message of "Tue, 19 Aug 2014 13:56:16 -0400") References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: <87wqa45oqf.fsf@elektro.pacujo.net> Tres Seaver : > On 08/19/2014 01:43 PM, Ben Hoyt wrote: >> Fair enough. I don't quite understand, though -- why is the "official >> policy" to kill something that's "essential" on *nix? > > ISTM that the policy is based on a fantasy that "it looks like text to > me in my use cases, so therefore it must be text for everyone." What I like about Python is that it allows me to write native linux code without having to make portability compromises that plague, say, Java. I have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The "textualization" of Python3 seems part of a conscious effort to make Python more Java-esque. Marko From stephen at xemacs.org Tue Aug 19 20:44:14 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 20 Aug 2014 03:44:14 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> Ben Hoyt writes: > Fair enough. I don't quite understand, though -- why is the "official > policy" to kill something that's "essential" on *nix? They're not essential on *nix. Unix paths at the OS level are "just bytes" (even on Mac, although the most common Mac filesystem does enforce UTF-8 Unicode NFD). This use case is now perfectly well served by codecs. However, there are a lot of applications that involve reading a file name from a directory, and passing it verbatim to another OS function. This case can be handled now using the surrogateescape error handler, but when these APIs were introduced we didn't even have a reliable way to roundtrip filenames because a Unix filename doesn't need to be a string of characters from *any* character set. And there's the undeniable convenience of treating file names as opaque objects in those applications. Regards, From greg.ewing at canterbury.ac.nz Wed Aug 20 00:01:11 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Aug 2014 10:01:11 +1200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: <53F3C927.4040807@canterbury.ac.nz> Ben Hoyt wrote: > Does that mean that new APIs should explicitly not support bytes? > ... Bytes paths are essentially broken on Windows. But on Unix, paths are essentially bytes. What's the official policy for dealing with that? -- Greg From greg.ewing at canterbury.ac.nz Wed Aug 20 00:09:24 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Wed, 20 Aug 2014 10:09:24 +1200 Subject: [Python-Dev] Bytes path support In-Reply-To: <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <53F3CB14.2010101@canterbury.ac.nz> Stephen J. Turnbull wrote: > This case can be handled now using the surrogateescape > error handler, So maybe the way to make bytes paths go away is to always use surrogateescape for paths on unix? -- Greg From guido at python.org Wed Aug 20 01:44:05 2014 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Aug 2014 16:44:05 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F3CB14.2010101@canterbury.ac.nz> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> Message-ID: I'm sorry my moment of levity was taken so seriously. With my serious hat on, I would like to claim that *conceptually* filenames are most definitely text. Due to various historical accidents the UNIX system calls often encoded text as arguments, and we sometimes need to control that encoding. Hence the occasional need for bytes arguments. But most of the time you don't have to think about that, and forcing users to worry about it is mostly as counter-productive as forcing to think about the encoding of every text file. -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Wed Aug 20 07:01:10 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 20 Aug 2014 14:01:10 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F3CB14.2010101@canterbury.ac.nz> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> Message-ID: <87wqa3eovd.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > Stephen J. Turnbull wrote: > > > This case can be handled now using the surrogateescape > > error handler, > > So maybe the way to make bytes paths go away is to always > use surrogateescape for paths on unix? Backward compatibility rules that out, I think. I certainly would recommend that for new code, but even for new code there are many users who vehemently object to using Unicode as an intermediate representation of things they think of as binary blobs. Not worth the hassle to even seriously propose removing those APIs IMO. From guido at python.org Wed Aug 20 07:06:03 2014 From: guido at python.org (Guido van Rossum) Date: Tue, 19 Aug 2014 22:06:03 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <87wqa3eovd.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> <87wqa3eovd.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Tuesday, August 19, 2014, Stephen J. Turnbull wrote: > Greg Ewing writes: > > Stephen J. Turnbull wrote: > > > > > This case can be handled now using the surrogateescape > > > error handler, > > > > So maybe the way to make bytes paths go away is to always > > use surrogateescape for paths on unix? > > Backward compatibility rules that out, I think. I certainly would > recommend that for new code, but even for new code there are many > users who vehemently object to using Unicode as an intermediate > representation of things they think of as binary blobs. Not worth the > hassle to even seriously propose removing those APIs IMO. But maybe we don't have to add new ones? --Guido -- --Guido van Rossum (on iPad) -------------- next part -------------- An HTML attachment was scrubbed... URL: From marko at pacujo.net Wed Aug 20 07:52:19 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Wed, 20 Aug 2014 08:52:19 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: (Guido van Rossum's message of "Tue, 19 Aug 2014 16:44:05 -0700") References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> Message-ID: <87y4uj4sj0.fsf@elektro.pacujo.net> Guido van Rossum : > With my serious hat on, I would like to claim that *conceptually* > filenames are most definitely text. Due to various historical > accidents the UNIX system calls often encoded text as arguments, and > we sometimes need to control that encoding. Due to historical accidents, text (in the Python sense) is not a first-class data type in Unix. Text, machine language, XML, Python etc are interpretations of bytes. Bytes are the first-class data type recognized by the kernel. That reality cannot be wished away. > Hence the occasional need for bytes arguments. But most of the time > you don't have to think about that, and forcing users to worry about > it is mostly as counter-productive as forcing to think about the > encoding of every text file. The users of Python programs can often be given higher-level facades. Unix programmers, though, shouldn't be shielded from bytes. Marko From stephen at xemacs.org Wed Aug 20 08:38:01 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 20 Aug 2014 15:38:01 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> <87wqa3eovd.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87sikrekdy.fsf@uwakimon.sk.tsukuba.ac.jp> Guido van Rossum writes: > On Tuesday, August 19, 2014, Stephen J. Turnbull wrote: > > Greg Ewing writes: > > > So maybe the way to make bytes paths go away is to always > > > use surrogateescape for paths on unix? > > > > Backward compatibility rules that out, I think. I certainly would > > recommend that for new code, but even for new code there are many > > users who vehemently object to using Unicode as an intermediate > > representation of things they think of as binary blobs. Not worth the > > hassle to even seriously propose removing those APIs IMO. > > But maybe we don't have to add new ones? IMO, we should avoid it. There may be some use cases. Sergiy mentions two bug reports. http://bugs.python.org/issue19997 imghdr.what doesn't accept bytes paths http://bugs.python.org/issue20797 zipfile.extractall should accept bytes path as parameter I'm very unsympathetic to these. In both cases the bytes are coming from outside of module in question. Why are they in bytes? That question should scare you, because from the point of view of end users there are no good answers: they all mean that the end user is going to end up with uninterpretable bytes in their directories, for the convenience of the programmer. In the case of issue20797, I'd be a *little* sympathetic if the RFE were for the *members* argument. zipfiles evidently have no way to specify the encodings of the name(s) of their members (and the zipfile module doesn't have APIs for it!), so the programmer is kind of stuck, especially if the requirement is that the extraction require no user intervention. But again, this is rarely what the user wants. I would be sympathetic to an internal, bytes-based, "kids these stunts are performed by trained professionals do NOT try this at home" API, with a sane user-oriented str-based API for ordinary use for this module. I suppose it might be useful for such a multi-type API to be polymorphic, but it would have to be a "if there are bytes anywhere, everything must be bytes and return values will be bytes" and similarly for str kind of polymorphism. No mixing bytes and strings, period. From stephen at xemacs.org Wed Aug 20 08:43:32 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 20 Aug 2014 15:43:32 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <87y4uj4sj0.fsf@elektro.pacujo.net> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> <87y4uj4sj0.fsf@elektro.pacujo.net> Message-ID: <87r40bek4r.fsf@uwakimon.sk.tsukuba.ac.jp> Marko Rauhamaa writes: > Unix programmers, though, shouldn't be shielded from bytes. Nobody's trying to do that. But Python users should be shielded from Unix programmers. From ben+python at benfinney.id.au Wed Aug 20 08:53:26 2014 From: ben+python at benfinney.id.au (Ben Finney) Date: Wed, 20 Aug 2014 16:53:26 +1000 Subject: [Python-Dev] Bytes path support References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> <87y4uj4sj0.fsf@elektro.pacujo.net> <87r40bek4r.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <8561hn3b4p.fsf@benfinney.id.au> "Stephen J. Turnbull" writes: > Marko Rauhamaa writes: > > Unix programmers, though, shouldn't be shielded from bytes. > > Nobody's trying to do that. But Python users should be shielded from > Unix programmers. +1 QotW -- \ ?Intellectual property is to the 21st century what the slave | `\ trade was to the 16th.? ?David Mertz | _o__) | Ben Finney From p.f.moore at gmail.com Wed Aug 20 13:00:38 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 20 Aug 2014 12:00:38 +0100 Subject: [Python-Dev] Bytes path support In-Reply-To: <8561hn3b4p.fsf@benfinney.id.au> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87zjf0e2v5.fsf@uwakimon.sk.tsukuba.ac.jp> <53F3CB14.2010101@canterbury.ac.nz> <87y4uj4sj0.fsf@elektro.pacujo.net> <87r40bek4r.fsf@uwakimon.sk.tsukuba.ac.jp> <8561hn3b4p.fsf@benfinney.id.au> Message-ID: On 20 August 2014 07:53, Ben Finney wrote: > "Stephen J. Turnbull" writes: > >> Marko Rauhamaa writes: >> > Unix programmers, though, shouldn't be shielded from bytes. >> >> Nobody's trying to do that. But Python users should be shielded from >> Unix programmers. > > +1 QotW That quote is actually almost a "hidden extra Zen of Python" IMO :-) Both parts of it. Paul From ncoghlan at gmail.com Wed Aug 20 13:08:16 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 20 Aug 2014 21:08:16 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <87wqa45oqf.fsf@elektro.pacujo.net> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: On 20 Aug 2014 04:18, "Marko Rauhamaa" wrote: > > Tres Seaver : > > > On 08/19/2014 01:43 PM, Ben Hoyt wrote: > >> Fair enough. I don't quite understand, though -- why is the "official > >> policy" to kill something that's "essential" on *nix? > > > > ISTM that the policy is based on a fantasy that "it looks like text to > > me in my use cases, so therefore it must be text for everyone." > > What I like about Python is that it allows me to write native linux code > without having to make portability compromises that plague, say, Java. I > have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The > "textualization" of Python3 seems part of a conscious effort to make > Python more Java-esque. It's not just the JVM that says text and binary APIs should be separate - it's every widely used operating system services layer except POSIX. The POSIX way works well *if* everyone reliably encodes things as UTF-8 or always uses encoding detection, but its failure mode is unfortunately silent data corruption. That said, there's a lot of Python software that is POSIX specific, where bytes paths would be the least of the barriers to porting to Windows or Jython. I'm personally +1 on consistently allowing binary paths in lower level APIs, but disallowing them in higher level explicitly cross platform abstractions like pathlib. Regards, Nick. > > > Marko > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Wed Aug 20 15:01:40 2014 From: antoine at python.org (Antoine Pitrou) Date: Wed, 20 Aug 2014 09:01:40 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: Le 20/08/2014 07:08, Nick Coghlan a ?crit : > > It's not just the JVM that says text and binary APIs should be separate > - it's every widely used operating system services layer except POSIX. > The POSIX way works well *if* everyone reliably encodes things as UTF-8 > or always uses encoding detection, but its failure mode is unfortunately > silent data corruption. > > That said, there's a lot of Python software that is POSIX specific, > where bytes paths would be the least of the barriers to porting to > Windows or Jython. I'm personally +1 on consistently allowing binary > paths in lower level APIs, but disallowing them in higher level > explicitly cross platform abstractions like pathlib. I fully agree with Nick's position here. To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators Adding full bytes support to pathlib would have added a lot of complication and fragility in the implementation *and* in the API (is it allowed to combine str and bytes paths? should they have separate classes?), for arguably little benefit. I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs. Regards Antoine. From brett at python.org Wed Aug 20 16:04:20 2014 From: brett at python.org (Brett Cannon) Date: Wed, 20 Aug 2014 14:04:20 +0000 Subject: [Python-Dev] Bytes path support References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: On Wed Aug 20 2014 at 9:02:25 AM Antoine Pitrou wrote: > Le 20/08/2014 07:08, Nick Coghlan a ?crit : > > > > It's not just the JVM that says text and binary APIs should be separate > > - it's every widely used operating system services layer except POSIX. > > The POSIX way works well *if* everyone reliably encodes things as UTF-8 > > or always uses encoding detection, but its failure mode is unfortunately > > silent data corruption. > > > > That said, there's a lot of Python software that is POSIX specific, > > where bytes paths would be the least of the barriers to porting to > > Windows or Jython. I'm personally +1 on consistently allowing binary > > paths in lower level APIs, but disallowing them in higher level > > explicitly cross platform abstractions like pathlib. > > I fully agree with Nick's position here. > > To elaborate specifically about pathlib, it doesn't handle bytes paths > but allows you to generate them if desired: > https://docs.python.org/3/library/pathlib.html#operators > > Adding full bytes support to pathlib would have added a lot of > complication and fragility in the implementation *and* in the API (is it > allowed to combine str and bytes paths? should they have separate > classes?), for arguably little benefit. > > I think if you want low-level features (such as unconverted bytes paths > under POSIX), it is reasonable to point you to low-level APIs. > +1 from me as well. Allowing the low-level stuff work on bytes but keeping high-level actually high-level keeps with our consenting adults policy as well as making things possible, but not at the detriment of the common case. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjreedy at udel.edu Wed Aug 20 20:41:26 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Wed, 20 Aug 2014 14:41:26 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: On 8/20/2014 9:01 AM, Antoine Pitrou wrote: > Le 20/08/2014 07:08, Nick Coghlan a ?crit : >> >> It's not just the JVM that says text and binary APIs should be separate >> - it's every widely used operating system services layer except POSIX. >> The POSIX way works well *if* everyone reliably encodes things as UTF-8 >> or always uses encoding detection, but its failure mode is unfortunately >> silent data corruption. >> >> That said, there's a lot of Python software that is POSIX specific, >> where bytes paths would be the least of the barriers to porting to >> Windows or Jython. I'm personally +1 on consistently allowing binary >> paths in lower level APIs, but disallowing them in higher level >> explicitly cross platform abstractions like pathlib. > > I fully agree with Nick's position here. > > To elaborate specifically about pathlib, it doesn't handle bytes paths > but allows you to generate them if desired: > https://docs.python.org/3/library/pathlib.html#operators > > Adding full bytes support to pathlib would have added a lot of > complication and fragility in the implementation *and* in the API (is it > allowed to combine str and bytes paths? should they have separate > classes?), for arguably little benefit. I am glad you did not recreate the madness of pre 3.0 Python in that regard. > I think if you want low-level features (such as unconverted bytes paths > under POSIX), it is reasonable to point you to low-level APIs. Do our docs somewhere explain the idea that files names are conceptually *names*, not arbitrary bytes; explain the concept of low-level versus high-level API' and point to the two types of APIs in Python? -- Terry Jan Reedy From greg.ewing at canterbury.ac.nz Thu Aug 21 00:18:11 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Thu, 21 Aug 2014 10:18:11 +1200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: <53F51EA3.8050708@canterbury.ac.nz> Antoine Pitrou wrote: > I think if you want low-level features (such as unconverted bytes paths > under POSIX), it is reasonable to point you to low-level APIs. The problem with scandir() in particular is that there is currently *no* low-level API exposed that gives the same functionality. If scandir() is not to support bytes paths, I'd suggest exposing the opendir() and readdir() system calls with bytes path support. -- Greg From ncoghlan at gmail.com Thu Aug 21 00:31:52 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Aug 2014 08:31:52 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F51EA3.8050708@canterbury.ac.nz> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <53F51EA3.8050708@canterbury.ac.nz> Message-ID: On 21 Aug 2014 08:19, "Greg Ewing" wrote: > > Antoine Pitrou wrote: >> >> I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs. > > > The problem with scandir() in particular is that there is > currently *no* low-level API exposed that gives the same > functionality. > > If scandir() is not to support bytes paths, I'd suggest > exposing the opendir() and readdir() system calls with > bytes path support. scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib! Cheers, Nick. > > -- > Greg > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Thu Aug 21 01:04:34 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 20 Aug 2014 16:04:34 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: > > but disallowing them in higher level >> > explicitly cross platform abstractions like pathlib. >> > I think the trick here is that posix-using folks claim that filenames are just bytes, and indeed they can be passed around with a char*, so they seem to be. but you can't possible do anything other than pass them around if you REALLY think they are just bytes. So really, people treat them as "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible" If you assume that, then you could write a pathlib that would work. And in practice, I expect a lot of designed only for posix code works that way. But of course, this gets ugly if you go to a platform where filenames are not "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible", like windows. I'm not sure if it's worth having a pathlib, etc. that uses this assumption -- but it could help us all write code that actually works with this screwy lack of specification. Antoine Pitrou wrote: > To elaborate specifically about pathlib, it doesn't handle bytes paths > but allows you to generate them if desired: > https://docs.python.org/3/library/pathlib.html#operators but that uses os.fsencode: Encode filename to the filesystem encoding As I understand it, the whole problem with some posix systems is that there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are. (At least as I read Armin Ronacher's blog) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Thu Aug 21 01:26:51 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Aug 2014 09:26:51 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: On 21 Aug 2014 09:06, "Chris Barker" wrote: > > As I understand it, the whole problem with some posix systems is that there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are. > > (At least as I read Armin Ronacher's blog) Armin lets his astonishment at the idea we'd expect Linux vendors to fix their broken OS get the better of him at times - he thinks the responsibility lies entirely with us to work around its quirks and limitations :) The "surrogateescape" codec is our main answer to the unreliability of the POSIX encoding model - fsdecode will squirrel away arbitrary bytes in the private use area, and then fsencode will restore them again later. That works for the simple round tripping case, but we currently lack good default tools for "cleaning" strings that may contain surrogates (or even scanning a string to see if surrogates are present). One idea I had along those lines is a surrogatereplace error handler ( http://bugs.python.org/issue22016) that emitted an ASCII question mark for each smuggled byte, rather than propagating the encoding problem. Cheers, Nick. > > -Chris > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ethan at stoneleaf.us Thu Aug 21 01:33:27 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 20 Aug 2014 16:33:27 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <53F51EA3.8050708@canterbury.ac.nz> Message-ID: <53F53047.7020104@stoneleaf.us> On 08/20/2014 03:31 PM, Nick Coghlan wrote: > > On 21 Aug 2014 08:19, "Greg Ewing" > wrote: >> >> Antoine Pitrou wrote: >>> >>> I think if you want low-level features (such as unconverted bytes paths under POSIX), it is reasonable to point you to low-level APIs. >> >> >> The problem with scandir() in particular is that there is >> currently *no* low-level API exposed that gives the same >> functionality. >> >> If scandir() is not to support bytes paths, I'd suggest >> exposing the opendir() and readdir() system calls with >> bytes path support. > > scandir is low level (the entire os module is low level). In fact, aside from pathlib, I'd consider pretty much every > API we have that deals with paths to be low level - that's a large part of the reason we needed pathlib! If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should support bytes paths. Is that what you meant to say? -- ~Ethan~ From ncoghlan at gmail.com Thu Aug 21 02:15:15 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Aug 2014 10:15:15 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F53047.7020104@stoneleaf.us> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <53F51EA3.8050708@canterbury.ac.nz> <53F53047.7020104@stoneleaf.us> Message-ID: On 21 August 2014 09:33, Ethan Furman wrote: > On 08/20/2014 03:31 PM, Nick Coghlan wrote: >> On 21 Aug 2014 08:19, "Greg Ewing" > > wrote: >>> >>> >>> Antoine Pitrou wrote: >>>> >>>> >>>> I think if you want low-level features (such as unconverted bytes paths >>>> under POSIX), it is reasonable to point you to low-level APIs. >>> >>> >>> >>> The problem with scandir() in particular is that there is >>> currently *no* low-level API exposed that gives the same >>> functionality. >>> >>> If scandir() is not to support bytes paths, I'd suggest >>> exposing the opendir() and readdir() system calls with >>> bytes path support. >> >> >> scandir is low level (the entire os module is low level). In fact, aside >> from pathlib, I'd consider pretty much every >> API we have that deals with paths to be low level - that's a large part of >> the reason we needed pathlib! > > > If scandir is low-level, and the low-level API's are the ones that should > support bytes paths, then scandir should support bytes paths. > > Is that what you meant to say? Yes. The discussions around PEP 471 *deferred* discussions of bytes and file descriptor support to their own RFEs (not needing a PEP), they didn't decide definitively not to support them. So Serhiy's thread is entirely pertinent to that question. Note that adding bytes support still *should not* hold up the initial PEP 471 implementation - it should be done as a follow on RFE. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ethan at stoneleaf.us Thu Aug 21 02:25:24 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Wed, 20 Aug 2014 17:25:24 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <53F51EA3.8050708@canterbury.ac.nz> <53F53047.7020104@stoneleaf.us> Message-ID: <53F53C74.7010204@stoneleaf.us> On 08/20/2014 05:15 PM, Nick Coghlan wrote: > On 21 August 2014 09:33, Ethan Furman wrote: >> On 08/20/2014 03:31 PM, Nick Coghlan wrote: >>> >>> scandir is low level (the entire os module is low level). In fact, aside >>> from pathlib, I'd consider pretty much every >>> API we have that deals with paths to be low level - that's a large part of >>> the reason we needed pathlib! >> >> If scandir is low-level, and the low-level API's are the ones that should >> support bytes paths, then scandir should support bytes paths. >> >> Is that what you meant to say? > > Yes. The discussions around PEP 471 *deferred* discussions of bytes > and file descriptor support to their own RFEs (not needing a PEP), > they didn't decide definitively not to support them. So Serhiy's > thread is entirely pertinent to that question. Thanks for clearing that up. I hate feeling confused. ;) > Note that adding bytes support still *should not* hold up the initial > PEP 471 implementation - it should be done as a follow on RFE. Agreed. -- ~Ethan~ From joseph.martinot-lagarde at m4x.org Thu Aug 21 02:27:10 2014 From: joseph.martinot-lagarde at m4x.org (Joseph Martinot-Lagarde) Date: Thu, 21 Aug 2014 02:27:10 +0200 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: <20140817092919.1f41e88a@limelight.wooz.org> Message-ID: <53F53CDE.80601@m4x.org> Le 18/08/2014 03:02, Guido van Rossum a ?crit : > On Sun, Aug 17, 2014 at 6:29 AM, Barry Warsaw > wrote: > > On Aug 16, 2014, at 07:43 PM, Guido van Rossum wrote: > > >(Don't understand this to mean that we should never deprecate things. > >Deprecations will happen, they are necessary for the evolution of any > >programming language. But they won't ever hurt in the way that > Python 3 > >hurt.) > > It would be useful to explore what causes the most pain in the 2->3 > transition? IMHO, it's not the deprecations or changes such as print -> > print(). It's the bytes/str split - a fundamental change to core > and common > data types. The question then is whether you foresee any similar > looming > pervasive change? [*] > > > I'm unsure about what's the single biggest pain moving to Python 3. In > the past I would have said that it's for sure the bytes/str split (which > both the biggest pain and the biggest payoff). The pain was even bigger because in addition to the change in underlying types, the names of the types were not compatible between the python versions. I often try to write compatible code between python2 and 3, and I can't use "str" because it has not the same meaning in both versions, I can not use "unicode" because it disappeared in python3, and I can't use "byte" because it doesn't exist in python2. Add __str__ and __unicode__ to the mix and then you get the real pain. Actually "str" is still usefull in the cases where a library is byte-only in python2 and unicode-only in python3 (hello, locale.setlocale()). Joseph From benhoyt at gmail.com Thu Aug 21 03:22:12 2014 From: benhoyt at gmail.com (Ben Hoyt) Date: Wed, 20 Aug 2014 21:22:12 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <53F51EA3.8050708@canterbury.ac.nz> <53F53047.7020104@stoneleaf.us> Message-ID: >> If scandir is low-level, and the low-level API's are the ones that should >> support bytes paths, then scandir should support bytes paths. >> >> Is that what you meant to say? > > Yes. The discussions around PEP 471 *deferred* discussions of bytes > and file descriptor support to their own RFEs (not needing a PEP), > they didn't decide definitively not to support them. So Serhiy's > thread is entirely pertinent to that question. > > Note that adding bytes support still *should not* hold up the initial > PEP 471 implementation - it should be done as a follow on RFE. I agree with this (that scandir is low level and should support bytes). As it happens, I'm implementing bytes support as well -- what with the path_t support in posixmodule.c and the listdir implementation to go on, it's not really any harder. So I think we'll have it right off the bat. BTW, the Windows implementation of PEP 471 is basically done, and the POSIX implementation is written but not working yet. And then there's tests and docs. -Ben From stephen at xemacs.org Thu Aug 21 04:16:27 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 21 Aug 2014 11:16:27 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> Message-ID: <87k362egec.fsf@uwakimon.sk.tsukuba.ac.jp> Nick Coghlan writes: > One idea I had along those lines is a surrogatereplace error handler ( > http://bugs.python.org/issue22016) that emitted an ASCII question mark for > each smuggled byte, rather than propagating the encoding problem. Please, don't. "Smuggled bytes" are not independent events. They tend to be correlated *within* file names, and this handler would generate names whose human semantics get lost (and there *are* human semantics, otherwise the name would be str(some_counter)). They tend to be correlated across file names, and this handler will generate multiple files with the same munged name (and again, the differentiating human semantics get lost). If you don't know the semantics of the intended file names, you can't generate good replacement names. This has to be an application-level function, and often requires user intervention to get good names. If you want to provide helper functions that applications can use to clean names explicitly, that might be OK. From cs at zip.com.au Thu Aug 21 06:52:19 2014 From: cs at zip.com.au (Cameron Simpson) Date: Thu, 21 Aug 2014 14:52:19 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: <20140821045219.GA81021@cskk.homeip.net> On 20Aug2014 16:04, Chris Barker - NOAA Federal wrote: >> but disallowing them in higher level >>> > explicitly cross platform abstractions like pathlib. >> >I think the trick here is that posix-using folks claim that filenames are >just bytes, and indeed they can be passed around with a char*, so they seem >to be. > >but you can't possible do anything other than pass them around if you >REALLY think they are just bytes. > >So really, people treat them as >"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and >maybe a couple others)-is-ascii-compatible" As someone who fought long and hard in the surrogate-escape listdir() wars, and was won over once the scheme was thoroughly explained to me, I take issue with these assertions: they are bogus or misleading. Firstly, POSIX filenames _are_ just byte strings. The only forbidden character is the NUL byte, which terminates a C string, and the only special character is the slash, which separates pathanme components. Second, a bare low level program cannot do _much_ more than pass them around. It certainly can do things like compute their basename, or other path related operations. The "bytes in some arbitrary encoding where at least the slash character (and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible. I think characterisations such as the one quoted are activately misleading. The way you get UTF-8 (or some other encoding, fortunately getting less and less common) is by convention: you decide in your environment to work in some encoding (say utf-8) via the locale variables, and all your user-facing text gets used in UTF-8 encoding form when turned into bytes for the filename calls because your text<->bytes methods say to do so. I think we'd all agree it is nice to have a system where filenames are all Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore the reality for such systems. I certainly think the Window-side Babel of code pages and multiple code systems is far far worse. (Disclaimer: not a Windows programmer, just based on hearing them complain.) I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the underlying filesystems reject invalid byte sequences). [...] > Antoine Pitrou wrote: >> To elaborate specifically about pathlib, it doesn't handle bytes paths >> but allows you to generate them if desired: >> https://docs.python.org/3/library/pathlib.html#operators > >but that uses >os.fsencode: Encode filename to the filesystem encoding > >As I understand it, the whole problem with some posix systems is that there >is NO filesystem encoding -- i.e. you can't know for sure what encoding a >filename is in. So you need to be able to pass the bytes through as they >are. Yes and no. I made that argument too. There's no _external_ "filesystem encoding" in the sense of something recorded in the filesystem that anyone can inspect. But there is the expressed locale settings, available at runtime to any program that cares to pay attention. It is a workable situation. Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.) Cheers, Cameron Simpson God is real, unless declared integer. - Johan Montald, johan at ingres.com From tjreedy at udel.edu Thu Aug 21 09:32:15 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 21 Aug 2014 03:32:15 -0400 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: <53F53CDE.80601@m4x.org> References: <20140817092919.1f41e88a@limelight.wooz.org> <53F53CDE.80601@m4x.org> Message-ID: On 8/20/2014 8:27 PM, Joseph Martinot-Lagarde wrote: > The pain was even bigger because in addition to the change in underlying > types, the names of the types were not compatible between the python > versions. I often try to write compatible code between python2 and 3, > and I can't use "str" because it has not the same meaning in both > versions, I can not use "unicode" because it disappeared in python3, And bridge library should have the equivalent of if 'py3': unicode = str > I can't use "byte" because it doesn't exist in python2. 2.7 (and 2.6?) already has if 'py2': bytes = str and I presume bridge libraries targeted before that was added include it also. -- Terry Jan Reedy From phd at phdru.name Thu Aug 21 09:45:03 2014 From: phd at phdru.name (Oleg Broytman) Date: Thu, 21 Aug 2014 09:45:03 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821045219.GA81021@cskk.homeip.net> References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: <20140821074503.GA5179@phdru.name> Hi! On Thu, Aug 21, 2014 at 02:52:19PM +1000, Cameron Simpson wrote: > Oh, and I reject Nick's characterisation of POSIX as "broken". It's > perfectly internally consistent. It just doesn't match what he > wants. (Indeed, what I want, and I'm a long time UNIX fanboy.) > > Cheers, > Cameron Simpson +1 from another Unix fanboy. Like an old wine, Unix becomes better with years! ;-) Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From ncoghlan at gmail.com Thu Aug 21 14:26:56 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Aug 2014 22:26:56 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <87k362egec.fsf@uwakimon.sk.tsukuba.ac.jp> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <87wqa45oqf.fsf@elektro.pacujo.net> <87k362egec.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 21 August 2014 12:16, Stephen J. Turnbull wrote: > Nick Coghlan writes: > > > One idea I had along those lines is a surrogatereplace error handler ( > > http://bugs.python.org/issue22016) that emitted an ASCII question mark for > > each smuggled byte, rather than propagating the encoding problem. > > Please, don't. > > "Smuggled bytes" are not independent events. They tend to be > correlated *within* file names, and this handler would generate names > whose human semantics get lost (and there *are* human semantics, > otherwise the name would be str(some_counter)). They tend to be > correlated across file names, and this handler will generate multiple > files with the same munged name (and again, the differentiating human > semantics get lost). > > If you don't know the semantics of the intended file names, you can't > generate good replacement names. This has to be an application-level > function, and often requires user intervention to get good names. > > If you want to provide helper functions that applications can use to > clean names explicitly, that might be OK. Yeah, I was thinking in the context of reproducing sys.stdout's behaviour in Python 2, but that reproduces the bytes faithfully, so 'surrogateescape' is already offers exactly the behaviour we want (sys.stdout will have surrogateescape enabled by default in 3.5). I'll keep pondering the question of possible helper functions in the "string" module. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From martin at v.loewis.de Thu Aug 21 14:40:48 2014 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 21 Aug 2014 14:40:48 +0200 Subject: [Python-Dev] PEP 4000 to explicitly declare we won't be doing a Py3k style compatibility break again? In-Reply-To: References: <20140817092919.1f41e88a@limelight.wooz.org> <1408324486.2238358.153743005.7C19896E@webmail.messagingengine.com> Message-ID: <53F5E8D0.2040206@v.loewis.de> Am 18.08.14 08:45, schrieb Nick Coghlan: > It's certainly the one that has caused the most churn in CPython and > the standard library - the ripples still haven't entirely settled on > that front :) For people porting their libraries and applications, the challenge is often even bigger: they need to learn a new programming concept. For many developers, it is a novel idea that character strings are not just bytes. A similar split is in the number types (integers vs. floats), but most developers have learned the distinction when they learned programming. That a text file is not a file that contains text (but bytes interpreted as text) is surprising. In addition, you also have to learn a lot of facts (what is the ASCII encoding, what is the iso-8859-1 encoding, what is UTF-8 (and how does it differ from Unicode)). When you have all that understood, you *then* run into the design choices to be made for your software. > > I think Guido's right that there's also a "death of a thousand cuts" > aspect for large existing code bases, though, especially those that > are lacking comprehensive test suites. I think the second big challenge is "my dependencies are not ported to Python 3". There is little you can do about it, short of porting the dependencies yourself (fortunately, Python and most of its libraries are free software). Regards, Martin From martin at v.loewis.de Thu Aug 21 14:54:36 2014 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 21 Aug 2014 14:54:36 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> Message-ID: <53F5EC0C.9020803@v.loewis.de> Am 19.08.14 19:43, schrieb Ben Hoyt: >>>> The official policy is that we want them [support for bytes paths in stdlib functions] to go away, but reality so far has not budged. We will continue to hold our breath though. :-) >>> >>> Does that mean that new APIs should explicitly not support bytes? I'm >>> thinking of os.scandir() (PEP 471), which I'm implementing at the >>> moment. I was originally going to make it support bytes so it was >>> compatible with listdir, but maybe that's a bad idea. Bytes paths are >>> essentially broken on Windows. >> >> Bytes paths are "essential" on Unix, though, so I don't think we should >> create new low-level APIs that don't support bytes. > > Fair enough. I don't quite understand, though -- why is the "official > policy" to kill something that's "essential" on *nix? I think the people defending the "Unix file names are just bytes" side often miss an important detail: displaying file names to the user, and allowing the user to enter file names. A script that just needs to traverse a directory tree and look at files by certain criteria can easily do so with not worrying about a text interpretation of the file names. When it comes to user interaction, it becomes apparent that, even on Unix, file names are not just bytes. If you do "ls -l" in your shell, the "system" (not just the kernel - but ultimately the terminal program, which might be the console driver, or an X11 application) will interpret the file name as having an encoding, and render them with a font. So for Python, the question is: which of the use cases (processing all files, vs. showing them to the user) should be better supported? Python 3 took the latter as an answer, under the assumption that this is the more common case. Regards, Martin From ncoghlan at gmail.com Thu Aug 21 14:55:33 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 21 Aug 2014 22:55:33 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821045219.GA81021@cskk.homeip.net> References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: On 21 August 2014 14:52, Cameron Simpson wrote: > > Oh, and I reject Nick's characterisation of POSIX as "broken". It's > perfectly internally consistent. It just doesn't match what he wants. > (Indeed, what I want, and I'm a long time UNIX fanboy.) The part that is broken is the idea that locale encodings are a viable solution to conveying the appropriate encoding to use to talk to the operating system. We've tried trusting them with Python 3, and they're reliably wrong in certain situations. systemd is apparently better than upstart at setting them correctly (e.g. for cron jobs), but even it can't defend against an erroneous (or deliberate!) "LANG=C", or ssh environment forwarding pushing a client's locale to the server. It's worth looking through some of Armin Ronacher's complaints about Python 3 being broken on Linux, and seeing how many of them boil down to "trusting the locale is wrong, Python 3 should just assume UTF-8 on every POSIX system, the same way it does on Mac OS X". (I suspect ShiftJIS, ISO-2022, et al users might object to that approach, but it's at least a more viable choice now than it was back in 2008) I still think we made the right call at least *trying* the idea of trusting the locale encoding (since that's the officially supported way of getting this information from the OS), and in many, many situations it works fine. But I suspect we may eventually need to resolve the technical issues currently preventing us from deciding to ignore the environmental locale during interpreter startup and try something different (such as always assuming UTF-8, or trying to force C.UTF-8 if we detect the C locale, or looking for the systemd config files and using those to set the OS encoding, rather than the environmental locale). Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From antoine at python.org Thu Aug 21 15:20:27 2014 From: antoine at python.org (Antoine Pitrou) Date: Thu, 21 Aug 2014 09:20:27 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821045219.GA81021@cskk.homeip.net> References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: Le 21/08/2014 00:52, Cameron Simpson a ?crit : > > The "bytes in some arbitrary encoding where at least the slash character > (and > maybe a couple others) is ascii compatible" notion is completely bogus. > There's only one special byte, the slash (code 47). There's no OS-level > need that it or anything else be ASCII compatible. Of course there is. Try to split an UTF-16-encoded file path on the byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly mandates an ASCII-compatible encoding for file paths. Regards Antoine. From marko at pacujo.net Thu Aug 21 15:58:03 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Thu, 21 Aug 2014 16:58:03 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F5EC0C.9020803@v.loewis.de> ("Martin v. =?utf-8?Q?L=C3=B6w?= =?utf-8?Q?is=22's?= message of "Thu, 21 Aug 2014 14:54:36 +0200") References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <53F5EC0C.9020803@v.loewis.de> Message-ID: <87tx56hrmc.fsf@elektro.pacujo.net> "Martin v. L?wis" : > I think the people defending the "Unix file names are just bytes" side > often miss an important detail: displaying file names to the user, and > allowing the user to enter file names. The user interface is a real issue and needs to be addressed. It is separate from the OS interface, though. > A script that just needs to traverse a directory tree and look at > files by certain criteria can easily do so with not worrying about a > text interpretation of the file names. A single system often has file names that have been encoded with different schemes. Only today, I have had to deal with the JIS character table () -- you will notice that it doesn't have a backslash character. A coworker uses ISO-8859-1. I use UTF-8. UTF-8, of course, will refuse to deal with some byte sequences. My point is that the poor programmer cannot ignore the possibility of "funny" character sets. If Python tried to protect the programmer from that possibility, the result might be even more intractable: how to act on a file with an non-UTF-8 filename if you are unable to express it as a text string? Marko From ncoghlan at gmail.com Thu Aug 21 16:12:50 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Aug 2014 00:12:50 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <87tx56hrmc.fsf@elektro.pacujo.net> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <53F5EC0C.9020803@v.loewis.de> <87tx56hrmc.fsf@elektro.pacujo.net> Message-ID: On 21 August 2014 23:58, Marko Rauhamaa wrote: > > My point is that the poor programmer cannot ignore the possibility of > "funny" character sets. If Python tried to protect the programmer from > that possibility, the result might be even more intractable: how to act > on a file with an non-UTF-8 filename if you are unable to express it as > a text string? That's what the "surrogateescape" codec is for - we use it by default on most OS interfaces, and it's implicit in the use of "os.fsencode" and "os.fsdecode". Starting with Python 3, it's also enabled on sys.stdout by default, so that "print(os.listdir(dirname))" will pass the original raw bytes through to the terminal the same way Python 2 does. The docs could use additional details as to which interfaces do and don't have surrogateescape enabled by default, but for the time being, the description of the codec error handler just links out to the original definition in PEP 383. It may also be useful to have some tools for detecting and cleaning strings containing surrogate escaped data, but there hasn't been a concrete proposal along those lines as yet. Personally, I'm currently waiting to see if the Fedora or OpenStack folks indicate a need for such tools before proposing any additions. Regards, Nick. > > > Marko > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Thu Aug 21 16:13:37 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Aug 2014 00:13:37 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <53F5EC0C.9020803@v.loewis.de> <87tx56hrmc.fsf@elektro.pacujo.net> Message-ID: On 22 August 2014 00:12, Nick Coghlan wrote: > On 21 August 2014 23:58, Marko Rauhamaa wrote: >> >> My point is that the poor programmer cannot ignore the possibility of >> "funny" character sets. If Python tried to protect the programmer from >> that possibility, the result might be even more intractable: how to act >> on a file with an non-UTF-8 filename if you are unable to express it as >> a text string? > > That's what the "surrogateescape" codec is for Oops, that should say "codec error handled" (I got it right later in the post). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From arigo at tunes.org Thu Aug 21 16:41:23 2014 From: arigo at tunes.org (Armin Rigo) Date: Thu, 21 Aug 2014 16:41:23 +0200 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: <20140818203043.GC1782@phdru.name> References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> Message-ID: Hi, On 18 August 2014 22:30, Oleg Broytman wrote: > Aha, I see now -- the signing certificate is CAcert, which I've > installed manually. I don't suppose anyone is particularly annoyed by this fact? I know for sure two classes of people that will never click "Ignore". The first one is people that, for lack of a less negative term, I'll call "security freaks". The second is "serious business people" to which the shiny new look of python.org appeals; they are likely to heed the warning "Legitimate banks, stores, etc. will never ask you to do this" and would regard an official hint to ignore it as highly unprofessional. (The bug tracker of PyPy used to have the same problem. We fixed the situation recently, but previously, we used to argue that we didn't have a lot of connections with either class of people...) A bient?t, Armin. From ncoghlan at gmail.com Thu Aug 21 17:44:37 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Aug 2014 01:44:37 +1000 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> Message-ID: On 22 August 2014 00:41, Armin Rigo wrote: > Hi, > > On 18 August 2014 22:30, Oleg Broytman wrote: >> Aha, I see now -- the signing certificate is CAcert, which I've >> installed manually. > > I don't suppose anyone is particularly annoyed by this fact? I know > for sure two classes of people that will never click "Ignore". The > first one is people that, for lack of a less negative term, I'll call > "security freaks". The second is "serious business people" to which > the shiny new look of python.org appeals; they are likely to heed the > warning "Legitimate banks, stores, etc. will never ask you to do this" > and would regard an official hint to ignore it as highly > unprofessional. I've now raised this issue with the infrastructure team. The current hosting arrangements for bugs.python.org were put in place when the PSF didn't have any on-call system administrators of its own, but now that we do, it may be time to migrate that service to a location where we can switch to a more appropriate SSL certificate. Anyone interested in following the discussion further may wish to join infrastructure at python.org Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From martin at v.loewis.de Thu Aug 21 18:29:55 2014 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 21 Aug 2014 18:29:55 +0200 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> Message-ID: <53F61E83.6060605@v.loewis.de> Am 21.08.14 17:44, schrieb Nick Coghlan: > I've now raised this issue with the infrastructure team. The current > hosting arrangements for bugs.python.org were put in place when the > PSF didn't have any on-call system administrators of its own, but now > that we do, it may be time to migrate that service to a location where > we can switch to a more appropriate SSL certificate. Just to relay Noah's response: it's actually not the hosting that prevents installation of a proper certificate, it's the limitation that the certificate we could deploy would include "python.org" as a server name, which is considered risky regardless of where the service is hosted. There are solutions to that as well, of course. Regards, Martin From ryan at ryanhiebert.com Thu Aug 21 18:48:11 2014 From: ryan at ryanhiebert.com (Ryan Hiebert) Date: Thu, 21 Aug 2014 11:48:11 -0500 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: <53F61E83.6060605@v.loewis.de> References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> <53F61E83.6060605@v.loewis.de> Message-ID: > On Aug 21, 2014, at 11:29 AM, Martin v. L?wis wrote: > > Am 21.08.14 17:44, schrieb Nick Coghlan: >> I've now raised this issue with the infrastructure team. The current >> hosting arrangements for bugs.python.org were put in place when the >> PSF didn't have any on-call system administrators of its own, but now >> that we do, it may be time to migrate that service to a location where >> we can switch to a more appropriate SSL certificate. > > Just to relay Noah's response: it's actually not the hosting that > prevents installation of a proper certificate, it's the limitation > that the certificate we could deploy would include "python.org" as > a server name, which is considered risky regardless of where the > service is hosted. There are solutions to that as well, of course. That sounds like a limitation I?ve seen with StartSSL. Perhaps there?s a certificate authority that would be willing to sponsor a certificate for Python without this annoying limitation? From stephen at xemacs.org Thu Aug 21 19:27:21 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 22 Aug 2014 02:27:21 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <87tx56hrmc.fsf@elektro.pacujo.net> References: <1408470029.2395886.154468645.430090BA@webmail.messagingengine.com> <53F5EC0C.9020803@v.loewis.de> <87tx56hrmc.fsf@elektro.pacujo.net> Message-ID: <87vbplda86.fsf@uwakimon.sk.tsukuba.ac.jp> Marko Rauhamaa writes: > My point is that the poor programmer cannot ignore the possibility of > "funny" character sets. *Poor* programmers do it all the time. That's why Python codecs raise when they encounter bytes they can't handle. > If Python tried to protect the programmer from that possibility, I don't understand your point. The existing interfaces aren't going anywhere, and they're enough to do anything you need to do. Although there are a few radicals (like me in a past life :-) who might like to see them go away in favor of opt-in to binary encoding via surrogateescape error handling, nobody in their right mind supports us. The question here is not about going backward, it's about whether to add new bytes APIs, and which ones. From benjamin at python.org Thu Aug 21 20:45:06 2014 From: benjamin at python.org (Benjamin Peterson) Date: Thu, 21 Aug 2014 11:45:06 -0700 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> <53F61E83.6060605@v.loewis.de> Message-ID: <1408646706.3182921.155332569.3F0E8F5E@webmail.messagingengine.com> On Thu, Aug 21, 2014, at 09:48, Ryan Hiebert wrote: > > > On Aug 21, 2014, at 11:29 AM, Martin v. L?wis wrote: > > > > Am 21.08.14 17:44, schrieb Nick Coghlan: > >> I've now raised this issue with the infrastructure team. The current > >> hosting arrangements for bugs.python.org were put in place when the > >> PSF didn't have any on-call system administrators of its own, but now > >> that we do, it may be time to migrate that service to a location where > >> we can switch to a more appropriate SSL certificate. > > > > Just to relay Noah's response: it's actually not the hosting that > > prevents installation of a proper certificate, it's the limitation > > that the certificate we could deploy would include "python.org" as > > a server name, which is considered risky regardless of where the > > service is hosted. There are solutions to that as well, of course. > > That sounds like a limitation I?ve seen with StartSSL. Perhaps there?s a > certificate authority that would be willing to sponsor a certificate for > Python without this annoying limitation? Perhaps some board members could comment, but I hope the PSF could just pay a few hundred a year for a proper certificate. From tjreedy at udel.edu Thu Aug 21 21:59:20 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 21 Aug 2014 15:59:20 -0400 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> Message-ID: On 8/21/2014 10:41 AM, Armin Rigo wrote: > Hi, > > On 18 August 2014 22:30, Oleg Broytman wrote: >> Aha, I see now -- the signing certificate is CAcert, which I've >> installed manually. > > I don't suppose anyone is particularly annoyed by this fact? I noticed the issue, and started this thread, because someone posted an https::/bugs.python.org link. I ordinarily just go to bugs.python.org and get the http connection. I have https-anywhere installed, but it must notice the dodgy certificate and silently not switch. So I never knew before tht there was an https connection available, and never thought to try it. Given that we are shipping both login credentials and files over the connection, making https routine, with a proper certificate, might be a good idea. -- Terry Jan Reedy From cs at zip.com.au Fri Aug 22 00:27:21 2014 From: cs at zip.com.au (Cameron Simpson) Date: Fri, 22 Aug 2014 08:27:21 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: Message-ID: <20140821222721.GA13888@cskk.homeip.net> On 21Aug2014 09:20, Antoine Pitrou wrote: >Le 21/08/2014 00:52, Cameron Simpson a ?crit : >>The "bytes in some arbitrary encoding where at least the slash character >>(and >>maybe a couple others) is ascii compatible" notion is completely bogus. >>There's only one special byte, the slash (code 47). There's no OS-level >>need that it or anything else be ASCII compatible. > >Of course there is. Try to split an UTF-16-encoded file path on the >byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly >mandates an ASCII-compatible encoding for file paths. [Rolls eyes.] Looking at the UTF-16 encoding, it looks like it also embeds NUL bytes for various codes below 32768. How are they handled? As remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings. If you imagine you can embed bare UTF-16 freely even excluding code 47, I think one of us is missing something. That's not "ASCII compatible". That's "not all byte codes can be freely used without thought", and any multibyte coding will have to consider such things when embedding itself in another coding scheme. Cheers, Cameron Simpson Microsoft: Committed to putting the "backward" into "backward compatibility." From chris.barker at noaa.gov Fri Aug 22 00:30:20 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Thu, 21 Aug 2014 15:30:20 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821045219.GA81021@cskk.homeip.net> References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: On Wed, Aug 20, 2014 at 9:52 PM, Cameron Simpson wrote: > On 20Aug2014 16:04, Chris Barker - NOAA Federal > wrote: > >> > So really, people treat them as >>> >> "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and >> maybe a couple others)-is-ascii-compatible" >> > > As someone who fought long and hard in the surrogate-escape listdir() > wars, and was won over once the scheme was thoroughly explained to me, I > take issue with these assertions: they are bogus or misleading. > > Firstly, POSIX filenames _are_ just byte strings. The only forbidden > character is the NUL byte, which terminates a C string, and the only > special character is the slash, which separates pathanme components. > so they are "just byte strings", oh, except that you can't have a null, and the "slash" had better be code 47 (and vice versa). How is that different than "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-is-ascii-compatible"? (sorry about the "maybe a couple others", I was too lazy to do my research and be sure). But my point is that python users want to be able to work with paths, and paths on posix are not strictly strings with a clearly defined encoding, but they are also not quite "just arbitrary bytes". So it would be nice if we could have a pathlib that would work with these odd beasts. I've lost track a bit as to whether the surrogate-escape solution allows this to all work now. If it does, then great, sorry for the noise. Second, a bare low level program cannot do _much_ more than pass them > around. It certainly can do things like compute their basename, or other > path related operations. > only if you assume that pesky slash == 47 thing -- it's not much, but it's not raw bytes either. The "bytes in some arbitrary encoding where at least the slash character > (and > maybe a couple others) is ascii compatible" notion is completely bogus. > There's only one special byte, the slash (code 47). There's no OS-level > need that it or anything else be ASCII compatible. I think > characterizations such as the one quoted are activately misleading. > code 47 == "slash" is ascii compatible -- where else did the 47 value come from? > I think we'd all agree it is nice to have a system where filenames are all > Unicode, but since POSIX/UNIX predates it by decades it is a bit late to > ignore the reality for such systems. well, the community could have gone to "if you want anything other than ascii, make it utf-8 -- but always, we're all a bunch of independent thinkers. But none of this is relevant -- systems in the wild do what they do -- clearly we all want Python to work with them as best it can. > There's no _external_ "filesystem encoding" in the sense of something > recorded in the filesystem that anyone can inspect. But there is the > expressed locale settings, available at runtime to any program that cares > to pay attention. It is a workable situation. > I haven't run into it, but it seem the folks that have don't think relying on the locale setting is the least bit workable. If it were, we woldn't be havin this discussion -- use the locale setting to decide how to decode filenames -- done. Oh, and I reject Nick's characterisation of POSIX as "broken". It's > perfectly internally consistent. It just doesn't match what he wants. > (Indeed, what I want, and I'm a long time UNIX fanboy.) > bug or feature? you decide. Internal consistency is a good start, but it punts the whole encoding issue to the client software, without giving it the tools to do it right. I call that "really hard to work with" if not broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Fri Aug 22 00:42:06 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 21 Aug 2014 23:42:06 +0100 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821222721.GA13888@cskk.homeip.net> References: <20140821222721.GA13888@cskk.homeip.net> Message-ID: On 21 August 2014 23:27, Cameron Simpson wrote: > That's not "ASCII compatible". That's "not all byte codes can be freely used > without thought", and any multibyte coding will have to consider such things > when embedding itself in another coding scheme. I wonder how badly a Unix system would break if you specified UTF16 as the system encoding...? Paul From antoine at python.org Fri Aug 22 00:54:47 2014 From: antoine at python.org (Antoine Pitrou) Date: Thu, 21 Aug 2014 18:54:47 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140821222721.GA13888@cskk.homeip.net> References: <20140821222721.GA13888@cskk.homeip.net> Message-ID: Le 21/08/2014 18:27, Cameron Simpson a ?crit : > As > remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX > filename bytes strings. So you admit that POSIX mandates that file paths are expressed in an ASCII-compatible encoding after all? Good. I've nothing to add to your rant. Antoine. From ijmorlan at uwaterloo.ca Fri Aug 22 01:06:55 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Thu, 21 Aug 2014 19:06:55 -0400 (EDT) Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: On Thu, 21 Aug 2014, Chris Barker wrote: > so they are "just byte strings", oh, except that you can't have a ?null, and > the "slash" had better be code 47 (and vice versa). How is that different > than "bytes-in-some-arbitrary-encoding-where-at-least > the-slash-character-is-ascii-compatible"? Actually, slash doesn't need to be code 47. But no matter what code 47 means outside of the context of a filename, it is the path arc separator byte (not character). In fact, this isn't even entirely academic. On a Mac OS X machine, go into Finder and try to create a directory called ":". You'll get an error saying 'The name ?:? can?t be used.'. Now create a directory called "/". No problem, raising the question of what is going on at the filesystem level? Answer: $ ls -al total 0 drwxr-xr-x 3 ijmorlan staff 102 21 Aug 18:57 ./ drwxr-xr-x+ 80 ijmorlan staff 2720 21 Aug 18:57 ../ drwxr-xr-x 2 ijmorlan staff 68 21 Aug 18:57 :/ And of course in shell one would remove the directory with this: rm -rf : not: rm -rf / So in effect the file system path arc encoding on Mac OS X is UTF-8 *except* that : is outlawed and / is encoded as \x3A rather than the usual \x2F. Of course, the path arc separator byte (not character) remains \x2F as always. Just for fun, there are contexts in which one can give a full path at the GUI level, where : is used as the path separator. This is for historical reasons and presumably is the reason for the above-noted behaviour. I think the real tension here is between the POSIX level where filenames are byte strings (except for \x00, which is reserved for string termination) where \x2F has special interpretation, and absolutely every application ever written, in every language, which wants filenames to be character strings. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From ncoghlan at gmail.com Fri Aug 22 01:25:05 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Aug 2014 09:25:05 +1000 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: <1408646706.3182921.155332569.3F0E8F5E@webmail.messagingengine.com> References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> <53F61E83.6060605@v.loewis.de> <1408646706.3182921.155332569.3F0E8F5E@webmail.messagingengine.com> Message-ID: On 22 Aug 2014 04:45, "Benjamin Peterson" wrote: > > Perhaps some board members could comment, but I hope the PSF could just > pay a few hundred a year for a proper certificate. That's exactly what we're doing - MAL reminded me we reached the same conclusion last time this came up, we'll just track it better this time to make sure it doesn't slip through the cracks again. (And yes, switching to forced HTTPS once this is addressed would also be a good idea - we'll add it to the list) Regards, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Fri Aug 22 01:38:55 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 22 Aug 2014 09:38:55 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: On 22 Aug 2014 09:24, "Isaac Morland" wrote: > I think the real tension here is between the POSIX level where filenames are byte strings (except for \x00, which is reserved for string termination) where \x2F has special interpretation, and absolutely every application ever written, in every language, which wants filenames to be character strings. That's one of the best summaries of the situation I've ever seen :) Most languages (including Python 2) throw up their hands and say this is the developer's problem to deal with. Python 3 says it's *our* problem to deal with on behalf of our developers. The "surrogateescape" error handler allows recalcitrant bytes to be dealt with relatively gracefully in most situations. We don't quite cover *everything* yet (hence the complaints from some of the folks that are experts at dealing with Python 2 Unicode handling on POSIX systems), but the remaining problems are a lot more tractable than the "teach every native English speaker everywhere how to handle Unicode properly" problem. Regards, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Fri Aug 22 02:00:02 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 21 Aug 2014 17:00:02 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> Message-ID: <53F68802.7080906@g.nevcal.com> On 8/21/2014 3:42 PM, Paul Moore wrote: > I wonder how badly a Unix system would break if you specified UTF16 as > the system encoding...? > Paul Does Unix even support UTF-16 as an encoding? I suppose, these days, it probably does, for reading contents of files created on Windows, etc. (Unicode was just gaining traction when I last used Unix in a significant manner; yes, my web host runs Linux, and I know enough to do what can be done there... but haven't experimented with encodings other than ASCII & UTF-8 on the web host, and don't intend to). If it allows configuration of UTF-16 or UTF-32 as system encodings, I would consider that a bug, though, as too much of Unix predates Unicode, and would be likely to fail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Fri Aug 22 01:56:59 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 21 Aug 2014 16:56:59 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> Message-ID: <53F6874B.6090103@g.nevcal.com> On 8/21/2014 3:54 PM, Antoine Pitrou wrote: > > Le 21/08/2014 18:27, Cameron Simpson a ?crit : >> As >> remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX >> filename bytes strings. > > So you admit that POSIX mandates that file paths are expressed in an > ASCII-compatible encoding after all? Good. I've nothing to add to your > rant. > > Antoine. 0 and 47 are certainly originally derived from ASCII. However, there could be lots of encodings that are not ASCII compatible (but in practice, probably very few, since most encodings _are_ ASCII compatible) that could be fit those constraints. So while as a technical matter, Cameron is correct that Unix only treats 0 & 47 as special, and that is insufficient to declare that encodings must be ASCII compatible, as a practical matter, since most encodings are ASCII compatible anyway, it would be hard to find very many that could be used successfully with Unix file names that are not ASCII compatible, that could comply with the 0 & 47 requirements. -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Fri Aug 22 03:09:33 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 22 Aug 2014 03:09:33 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F68802.7080906@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> Message-ID: <20140822010933.GA8105@phdru.name> On Thu, Aug 21, 2014 at 05:00:02PM -0700, Glenn Linderman wrote: > On 8/21/2014 3:42 PM, Paul Moore wrote: > >I wonder how badly a Unix system would break if you specified UTF16 as > >the system encoding...? > > Does Unix even support UTF-16 as an encoding? As an encoding of file's content? Certainly yes. As a locale encoding? Definitely no. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From chris.barker at noaa.gov Fri Aug 22 02:30:14 2014 From: chris.barker at noaa.gov (Chris Barker - NOAA Federal) Date: Thu, 21 Aug 2014 17:30:14 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F68802.7080906@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> Message-ID: <5124983344373446869@unknownmsgid> > Does Unix even support UTF-16 as an encoding? I suppose, these days, it probably does, for reading contents of files created on Windows, etc. I don't think Unix supports any encodings at all for the _contents_ of files -- that's up to applications. Of course the command line text processing tools need to know -- I'm guessing those are never going to work w/UTF-16! "System encoding" is a nice idea, but pretty much worthless. Only helpful for files created and processed on the same system -- not rare for that not to be the case. This brings up the other key problem. If file names are (almost) arbitrary bytes, how do you write one to/read one from a text file with a particular encoding? ( or for that matter display it on a terminal) And people still want to say posix isn't broken in this regard? Sigh. -Chris From tjreedy at udel.edu Fri Aug 22 04:32:24 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Thu, 21 Aug 2014 22:32:24 -0400 Subject: [Python-Dev] https:bugs.python.org -- Untrusted Connection (Firefox) In-Reply-To: References: <1408393321.2083664.154095037.3A4EB862@webmail.messagingengine.com> <20140818203043.GC1782@phdru.name> <53F61E83.6060605@v.loewis.de> <1408646706.3182921.155332569.3F0E8F5E@webmail.messagingengine.com> Message-ID: On 8/21/2014 7:25 PM, Nick Coghlan wrote: > > On 22 Aug 2014 04:45, "Benjamin Peterson" > wrote: > > > > Perhaps some board members could comment, but I hope the PSF could just > > pay a few hundred a year for a proper certificate. > > That's exactly what we're doing - MAL reminded me we reached the same > conclusion last time this came up, we'll just track it better this time > to make sure it doesn't slip through the cracks again. > > (And yes, switching to forced HTTPS once this is addressed would also be > a good idea - we'll add it to the list) I just switched from a 'low variety' short password of the sort almost crackable with brute force (today, though not several years ago) to a higher variety longer password. People with admin privileges on the tracker might be reminded to recheck. What was adequate 10 years ago is not so now. -- Terry Jan Reedy From phd at phdru.name Fri Aug 22 04:42:29 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 22 Aug 2014 04:42:29 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <5124983344373446869@unknownmsgid> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> Message-ID: <20140822024229.GA8192@phdru.name> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal wrote: > This brings up the other key problem. If file names are (almost) > arbitrary bytes, how do you write one to/read one from a text file > with a particular encoding? ( or for that matter display it on a > terminal) There is no such thing as an encoding of text files. So we just write those bytes to the file or output them to the terminal. I often do that. My filesystems are full of files with names and content in at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a terminal with koi8 or utf-8 locale and fonts and some file always look weird. But however weird they are it's possible to work with them. The bigger problem is line feeds. A filename with linefeeds can be put to a text file, but cannot be read back. So one has to transform such names. Usually s/\\/\\\\/g and s/\n/\\n/g is enough. (-: > And people still want to say posix isn't broken in this regard? Not at all! And broken or not broken it's what I (for many different reasons) prefer to use for my desktops, servers, notebooks, routers and smartphones, so if Python would stand on my way I'd rather switch to a different tools. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From stephen at xemacs.org Fri Aug 22 07:11:08 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 22 Aug 2014 14:11:08 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <5124983344373446869@unknownmsgid> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> Message-ID: <87ppftcdn7.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Barker - NOAA Federal writes: > This brings up the other key problem. If file names are (almost) > arbitrary bytes, how do you write one to/read one from a text file > with a particular encoding? ( or for that matter display it on a > terminal) "Very carefully." But this is strictly from need. *Nobody* (with the exception of the crackers who like to name their programs things like "\u0007") *wants* to do this. Real people want to name their files in some human language they understand, and spell it in the usual way, and encode those characters as bytes in the usual way. Decoding those characters in the usual way and getting nonsense is the exceptional case, and it must be the application's or user's problem to decide what to do. They know where they got the file from and usually have some idea of what its name should look like. Python doesn't, so Python cannot solve it for them. For that reason, I believe that Python's "normal"/high-level approach to file handling should treat file names as (human-oriented) text. Of course Python should be able to handle bytes straight from the disk, but most programmers shouldn't have to. > And people still want to say posix isn't broken in this regard? Deal with it, bro'. From marko at pacujo.net Fri Aug 22 07:24:42 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Fri, 22 Aug 2014 08:24:42 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: (Nick Coghlan's message of "Fri, 22 Aug 2014 09:38:55 +1000") References: <20140821045219.GA81021@cskk.homeip.net> Message-ID: <87mwaxayg5.fsf@elektro.pacujo.net> Nick Coghlan : > Python 3 says it's *our* problem to deal with on behalf of our > developers. Flik: I was just trying to help. Mr. Soil: Then help us; *don't* help us. Marko From steve at pearwood.info Fri Aug 22 17:19:14 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 23 Aug 2014 01:19:14 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822024229.GA8192@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> Message-ID: <20140822151911.GS25957@ando> On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote: > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal wrote: > > This brings up the other key problem. If file names are (almost) > > arbitrary bytes, how do you write one to/read one from a text file > > with a particular encoding? ( or for that matter display it on a > > terminal) > > There is no such thing as an encoding of text files. I don't understand this comment. It seems to me that *text* files have to have an encoding, otherwise you can't interpret the contents as text. Files, of course, only contain bytes, but to be treated as bytes you need some way of transforming byte N to char C (or multiple bytes to C), which is an encoding. Perhaps you just mean that encodings are not recorded in the text file itself? To answer Chris' question, you typically cannot include arbitrary bytes in text files, and displaying them to the user is likewise problematic. The usual solution is to support some form of escaping, like \t #x0A; or %0D, to give a few examples. -- Steven From martin at v.loewis.de Fri Aug 22 17:25:16 2014 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Fri, 22 Aug 2014 17:25:16 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F6874B.6090103@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F6874B.6090103@g.nevcal.com> Message-ID: <53F760DC.2060906@v.loewis.de> Am 22.08.14 01:56, schrieb Glenn Linderman: > 0 and 47 are certainly originally derived from ASCII. However, there > could be lots of encodings that are not ASCII compatible (but in > practice, probably very few, since most encodings _are_ ASCII > compatible) that could be fit those constraints. > > So while as a technical matter, Cameron is correct that Unix only treats > 0 & 47 as special, and that is insufficient to declare that encodings > must be ASCII compatible, as a practical matter, since most encodings > are ASCII compatible anyway, it would be hard to find very many that > could be used successfully with Unix file names that are not ASCII > compatible, that could comply with the 0 & 47 requirements. More importantly, existing encodings that are distinctively *not* ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47 (instead, it is at 91 at EBCDIC, 47 is the BEL control character). There are boundary cases, of course. VISCII is "mostly ASCII compatible", putting graphic characters into some of the control characters, but using those that aren't used in ASCII, anyway. And then there is the YUSCII family of encodings, which definitely is not ASCII compatible, as it does not contain Latin characters, but still puts the / into 47 (and also keeps the ASCII digits and special characters in their positions). There is also SI 960, which has the slash, the ASCII uppercase letters, digits and special characters, but replaces the lower-case characters with Hebrew. So yes, Unix doesn't mandate ASCII-compatible encodings; but it still mandates ASCII-inspired encodings. I wonder how you would run "gcc", though, on an SI 960 system; you'ld have to type ???. Regards, Martin From phd at phdru.name Fri Aug 22 17:51:04 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 22 Aug 2014 17:51:04 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822151911.GS25957@ando> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> Message-ID: <20140822155104.GA28425@phdru.name> Hi! On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano wrote: > On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote: > > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal wrote: > > > This brings up the other key problem. If file names are (almost) > > > arbitrary bytes, how do you write one to/read one from a text file > > > with a particular encoding? ( or for that matter display it on a > > > terminal) > > > > There is no such thing as an encoding of text files. > > I don't understand this comment. It seems to me that *text* files have > to have an encoding, otherwise you can't interpret the contents as text. What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. > Files, of course, only contain bytes, but to be treated as bytes you > need some way of transforming byte N to char C (or multiple bytes to C), > which is an encoding. But you don't need to treat the entire file in one encoding. Strange characters are clearly visible so you can interpret them differently. I am very much trained to distinguish koi8, cp1251 and utf-8 texts; I cannot translate them mentally but I can recognize them. > Perhaps you just mean that encodings are not recorded in the text file > itself? Yes, that too. > To answer Chris' question, you typically cannot include arbitrary > bytes in text files, and displaying them to the user is likewise > problematic As a person who view utf-8 files in koi8 fonts (and vice versa) every day I'd argue. (-: Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From status at bugs.python.org Fri Aug 22 18:08:12 2014 From: status at bugs.python.org (Python tracker) Date: Fri, 22 Aug 2014 18:08:12 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20140822160812.1F6F45642D@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2014-08-15 - 2014-08-22) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 4621 (+19) closed 29399 (+28) total 34020 (+47) Open issues with patches: 2179 Issues opened (41) ================== #22207: Test for integer overflow on Py_ssize_t: explicitly cast to si http://bugs.python.org/issue22207 opened by haypo #22208: tarfile can't add in memory files (reopened) http://bugs.python.org/issue22208 opened by markgrandi #22209: Idle: add better access to extension information http://bugs.python.org/issue22209 opened by terry.reedy #22210: pdb-run-restarting-a-pdb-session http://bugs.python.org/issue22210 opened by zhengxiexie #22211: Remove VMS specific code in expat.h & xmlrole.h http://bugs.python.org/issue22211 opened by John.Malmberg #22212: zipfile.py fails if zlib.so module fails to build. http://bugs.python.org/issue22212 opened by John.Malmberg #22213: pyvenv style virtual environments unusable in an embedded syst http://bugs.python.org/issue22213 opened by grahamd #22214: Tkinter: Don't stringify callbacks arguments http://bugs.python.org/issue22214 opened by serhiy.storchaka #22215: "embedded NUL character" exceptions http://bugs.python.org/issue22215 opened by serhiy.storchaka #22216: smtplip STARTTLS fails at second attampt due to unsufficiant q http://bugs.python.org/issue22216 opened by zvyn #22217: Reprs for zipfile classes http://bugs.python.org/issue22217 opened by serhiy.storchaka #22218: Fix more compiler warnings "comparison between signed and unsi http://bugs.python.org/issue22218 opened by haypo #22219: python -mzipfile fails to add empty folders to created zip http://bugs.python.org/issue22219 opened by Antony.Lee #22220: Ttk extensions test failure http://bugs.python.org/issue22220 opened by serhiy.storchaka #22221: ast.literal_eval confused by coding declarations http://bugs.python.org/issue22221 opened by jorgenschaefer #22222: dtoa.c: remove custom memory allocator http://bugs.python.org/issue22222 opened by haypo #22223: argparse not including '--' arguments in previous optional REM http://bugs.python.org/issue22223 opened by Jurko.Gospodneti?? #22225: Add SQLite support to http.cookiejar http://bugs.python.org/issue22225 opened by demian.brecht #22226: Refactor dict result handling in Tkinter http://bugs.python.org/issue22226 opened by serhiy.storchaka #22227: Simplify tarfile iterator http://bugs.python.org/issue22227 opened by serhiy.storchaka #22228: Adapt bash readline operate-and-get-next function http://bugs.python.org/issue22228 opened by lelit #22229: wsgiref doesn't appear to ever set REMOTE_HOST in the environ http://bugs.python.org/issue22229 opened by alex #22231: httplib: unicode url will cause an ascii codec error when comb http://bugs.python.org/issue22231 opened by Bob.Chen #22232: str.splitlines splitting on none-\r\n characters http://bugs.python.org/issue22232 opened by scharron #22233: http.client splits headers on none-\r\n characters http://bugs.python.org/issue22233 opened by scharron #22234: urllib.parse.urlparse accepts any falsy value as an url http://bugs.python.org/issue22234 opened by Ztane #22235: httplib: TypeError with file() object in ssl.py http://bugs.python.org/issue22235 opened by erob #22236: Do not use _default_root in Tkinter tests http://bugs.python.org/issue22236 opened by serhiy.storchaka #22237: sorted() docs should state that the sort is stable http://bugs.python.org/issue22237 opened by Wilfred.Hughes #22239: asyncio: nested event loop http://bugs.python.org/issue22239 opened by djarb #22240: argparse support for "python -m module" in help http://bugs.python.org/issue22240 opened by tebeka #22241: strftime/strptime round trip fails even for UTC datetime objec http://bugs.python.org/issue22241 opened by akira #22242: Doc fix in the Import section in language reference. http://bugs.python.org/issue22242 opened by jon.poler #22243: Documentation on try statement incorrectly implies target of e http://bugs.python.org/issue22243 opened by mwilliamson #22244: load_verify_locations fails to handle unicode paths on Python http://bugs.python.org/issue22244 opened by alex #22246: add strptime(s, '%s') http://bugs.python.org/issue22246 opened by akira #22247: More incomplete module.__all__ lists http://bugs.python.org/issue22247 opened by vadmium #22248: urllib.request.urlopen raises exception when 30X-redirect url http://bugs.python.org/issue22248 opened by tomasgroth #22249: Possibly incorrect example is given for socket.getaddrinfo() http://bugs.python.org/issue22249 opened by Alexander.Patrakov #22250: unittest lowercase methods http://bugs.python.org/issue22250 opened by simonzack #22251: Various markup errors in documentation http://bugs.python.org/issue22251 opened by berker.peksag Most recent 15 issues with no replies (15) ========================================== #22251: Various markup errors in documentation http://bugs.python.org/issue22251 #22250: unittest lowercase methods http://bugs.python.org/issue22250 #22249: Possibly incorrect example is given for socket.getaddrinfo() http://bugs.python.org/issue22249 #22246: add strptime(s, '%s') http://bugs.python.org/issue22246 #22244: load_verify_locations fails to handle unicode paths on Python http://bugs.python.org/issue22244 #22242: Doc fix in the Import section in language reference. http://bugs.python.org/issue22242 #22239: asyncio: nested event loop http://bugs.python.org/issue22239 #22234: urllib.parse.urlparse accepts any falsy value as an url http://bugs.python.org/issue22234 #22231: httplib: unicode url will cause an ascii codec error when comb http://bugs.python.org/issue22231 #22229: wsgiref doesn't appear to ever set REMOTE_HOST in the environ http://bugs.python.org/issue22229 #22227: Simplify tarfile iterator http://bugs.python.org/issue22227 #22225: Add SQLite support to http.cookiejar http://bugs.python.org/issue22225 #22216: smtplip STARTTLS fails at second attampt due to unsufficiant q http://bugs.python.org/issue22216 #22212: zipfile.py fails if zlib.so module fails to build. http://bugs.python.org/issue22212 #22211: Remove VMS specific code in expat.h & xmlrole.h http://bugs.python.org/issue22211 Most recent 15 issues waiting for review (15) ============================================= #22251: Various markup errors in documentation http://bugs.python.org/issue22251 #22246: add strptime(s, '%s') http://bugs.python.org/issue22246 #22242: Doc fix in the Import section in language reference. http://bugs.python.org/issue22242 #22240: argparse support for "python -m module" in help http://bugs.python.org/issue22240 #22236: Do not use _default_root in Tkinter tests http://bugs.python.org/issue22236 #22228: Adapt bash readline operate-and-get-next function http://bugs.python.org/issue22228 #22227: Simplify tarfile iterator http://bugs.python.org/issue22227 #22226: Refactor dict result handling in Tkinter http://bugs.python.org/issue22226 #22222: dtoa.c: remove custom memory allocator http://bugs.python.org/issue22222 #22219: python -mzipfile fails to add empty folders to created zip http://bugs.python.org/issue22219 #22218: Fix more compiler warnings "comparison between signed and unsi http://bugs.python.org/issue22218 #22217: Reprs for zipfile classes http://bugs.python.org/issue22217 #22216: smtplip STARTTLS fails at second attampt due to unsufficiant q http://bugs.python.org/issue22216 #22215: "embedded NUL character" exceptions http://bugs.python.org/issue22215 #22214: Tkinter: Don't stringify callbacks arguments http://bugs.python.org/issue22214 Top 10 most discussed issues (10) ================================= #17535: IDLE: Add an option to show line numbers along the left side o http://bugs.python.org/issue17535 9 msgs #22208: tarfile can't add in memory files (reopened) http://bugs.python.org/issue22208 8 msgs #2527: Pass a namespace to timeit http://bugs.python.org/issue2527 6 msgs #22195: Make it easy to replace print() calls with logging calls http://bugs.python.org/issue22195 6 msgs #22241: strftime/strptime round trip fails even for UTC datetime objec http://bugs.python.org/issue22241 6 msgs #22194: access to cdecimal / libmpdec API http://bugs.python.org/issue22194 5 msgs #22198: Odd floor-division corner case http://bugs.python.org/issue22198 5 msgs #22218: Fix more compiler warnings "comparison between signed and unsi http://bugs.python.org/issue22218 5 msgs #20152: Derby #15: Convert 50 sites to Argument Clinic across 9 files http://bugs.python.org/issue20152 4 msgs #20184: Derby #16: Convert 50 sites to Argument Clinic across 9 files http://bugs.python.org/issue20184 4 msgs Issues closed (27) ================== #7283: test_site failure when .local/lib/pythonX.Y/site-packages hasn http://bugs.python.org/issue7283 closed by ned.deily #15696: Correct __sizeof__ support for mmap http://bugs.python.org/issue15696 closed by serhiy.storchaka #16599: unittest: Access test result from tearDown http://bugs.python.org/issue16599 closed by Claudiu.Popa #19628: maxlevels -1 on compileall for unlimited recursion http://bugs.python.org/issue19628 closed by python-dev #19714: Add tests for importlib.machinery.WindowsRegistryFinder http://bugs.python.org/issue19714 closed by brett.cannon #19997: imghdr.what doesn't accept bytes paths http://bugs.python.org/issue19997 closed by serhiy.storchaka #20797: zipfile.extractall should accept bytes path as parameter http://bugs.python.org/issue20797 closed by serhiy.storchaka #21308: PEP 466: backport ssl changes http://bugs.python.org/issue21308 closed by python-dev #21389: The repr of BoundMethod objects sometimes incorrectly identifi http://bugs.python.org/issue21389 closed by python-dev #21549: Add the members parameter for TarFile.list() http://bugs.python.org/issue21549 closed by serhiy.storchaka #22016: Add a new 'surrogatereplace' output only error handler http://bugs.python.org/issue22016 closed by ncoghlan #22068: tkinter: avoid reference loops with Variables and Fonts http://bugs.python.org/issue22068 closed by serhiy.storchaka #22118: urljoin fails with messy relative URLs http://bugs.python.org/issue22118 closed by pitrou #22150: deprecated-removed directive is broken in Sphinx 1.2.2 http://bugs.python.org/issue22150 closed by berker.peksag #22156: Fix compiler warnings "comparison between signed and unsigned http://bugs.python.org/issue22156 closed by haypo #22157: _ctypes on ppc64: libffi/src/powerpc/linux64.o: ABI version 1 http://bugs.python.org/issue22157 closed by doko #22165: Empty response from http.server when directory listing contain http://bugs.python.org/issue22165 closed by serhiy.storchaka #22188: test_gdb fails on invalid gdbinit http://bugs.python.org/issue22188 closed by python-dev #22191: warnings.__all__ incomplete http://bugs.python.org/issue22191 closed by brett.cannon #22200: Remove distutils checks for Python version http://bugs.python.org/issue22200 closed by python-dev #22201: python -mzipfile fails to unzip files with folders created by http://bugs.python.org/issue22201 closed by serhiy.storchaka #22205: debugmallocstats test is cpython only http://bugs.python.org/issue22205 closed by python-dev #22206: PyThread_create_key(): fix comparison between signed and unsig http://bugs.python.org/issue22206 closed by haypo #22224: docs.python.org is prone to political blocking in Russia http://bugs.python.org/issue22224 closed by georg.brandl #22230: 'python -mzipfile -c' does not zip empty directories http://bugs.python.org/issue22230 closed by serhiy.storchaka #22238: fractions.gcd results in infinite loop when nan or inf given a http://bugs.python.org/issue22238 closed by mark.dickinson #22245: test_urllib2_localnet prints out error messages http://bugs.python.org/issue22245 closed by orsenthil From v+python at g.nevcal.com Fri Aug 22 18:37:13 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 22 Aug 2014 09:37:13 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822155104.GA28425@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> Message-ID: <53F771B9.1060203@g.nevcal.com> On 8/22/2014 8:51 AM, Oleg Broytman wrote: > What encoding does have a text file (an HTML, to be precise) with > text in utf-8, ads in cp1251 (ad blocks were included from different > files) and comments in koi8-r? > Well, I must admit the HTML was rather an exception, but having a > text file with some strange characters (binary strings, or paragraphs > in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. If it is named .html and served by the server as UTF-8, then the server is misconfigured, or the file is incorrectly populated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Fri Aug 22 18:52:22 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 22 Aug 2014 18:52:22 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F771B9.1060203@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> Message-ID: <20140822165222.GA2290@phdru.name> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: > On 8/22/2014 8:51 AM, Oleg Broytman wrote: > > What encoding does have a text file (an HTML, to be precise) with > >text in utf-8, ads in cp1251 (ad blocks were included from different > >files) and comments in koi8-r? > > Well, I must admit the HTML was rather an exception, but having a > >text file with some strange characters (binary strings, or paragraphs > >in different encodings) is not that exceptional. > That's not a text file. That's a binary file containing (hopefully > delimited, and documented) sections of encoded text in different > encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From v+python at g.nevcal.com Fri Aug 22 19:09:21 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 22 Aug 2014 10:09:21 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822165222.GA2290@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> Message-ID: <53F77941.3040700@g.nevcal.com> On 8/22/2014 9:52 AM, Oleg Broytman wrote: > On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: >> On 8/22/2014 8:51 AM, Oleg Broytman wrote: >>> What encoding does have a text file (an HTML, to be precise) with >>> text in utf-8, ads in cp1251 (ad blocks were included from different >>> files) and comments in koi8-r? >>> Well, I must admit the HTML was rather an exception, but having a >>> text file with some strange characters (binary strings, or paragraphs >>> in different encodings) is not that exceptional. >> That's not a text file. That's a binary file containing (hopefully >> delimited, and documented) sections of encoded text in different >> encodings. > Allow me to disagree. For me, this is a text file which I can (and > do) view with a pager, edit with a text editor, list on a console, > search with grep and so on. If it is not a text file by strict Python3 > standards then these standards are too strict for me. Either I find a > simple workaround in Python3 to work with such texts or find a different > tool. I cannot avoid such files because my reality is much more complex > than strict text/binary dichotomy in Python3. > > Oleg. I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. Also, if it is an HTML file, I doubt the browser will use multiple different encodings when interpreting it, so it is not clear that the file is of practical use for its intended purpose if it contains text in multiple different encodings, but is served using only a single encoding, unless there is javascript or some programming in the browser that reencodes the data. On the other hand, Python3 provides various facilities for working with such files. The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training. The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files. There may be other technique that I am not aware of. Glenn -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Fri Aug 22 20:50:05 2014 From: phd at phdru.name (Oleg Broytman) Date: Fri, 22 Aug 2014 20:50:05 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F77941.3040700@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> Message-ID: <20140822185005.GA2388@phdru.name> On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman wrote: > On 8/22/2014 9:52 AM, Oleg Broytman wrote: > >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: > >>On 8/22/2014 8:51 AM, Oleg Broytman wrote: > >>> What encoding does have a text file (an HTML, to be precise) with > >>>text in utf-8, ads in cp1251 (ad blocks were included from different > >>>files) and comments in koi8-r? > >>> Well, I must admit the HTML was rather an exception, but having a > >>>text file with some strange characters (binary strings, or paragraphs > >>>in different encodings) is not that exceptional. > >>That's not a text file. That's a binary file containing (hopefully > >>delimited, and documented) sections of encoded text in different > >>encodings. > > Allow me to disagree. For me, this is a text file which I can (and > >do) view with a pager, edit with a text editor, list on a console, > >search with grep and so on. If it is not a text file by strict Python3 > >standards then these standards are too strict for me. Either I find a > >simple workaround in Python3 to work with such texts or find a different > >tool. I cannot avoid such files because my reality is much more complex > >than strict text/binary dichotomy in Python3. > > I was not declaring your file not to be a "text file" from any > definition obtained from Python3 documentation, just from a common > sense definition of "text file". And in my opinion those files are perfect text. The files consist of lines separated by EOL characters (not necessary EOL characters of my OS because it could be a text file produced in a different OS), lines consist of words and words of characters. > Looking at it from Python3, though, it is clear that when opening a > file in "text" mode, an encoding may be specified or will be > assumed. That is one encoding, applying to the whole file, not 3 > encodings, with declarations on when to switch between them. So I > think, in general, Python3 assumes or defines a definition of text > file that matches my "common sense" definition. I don't have problems with Python3 text. I have problems with Python3 trying to get rid of byte strings and treating bytes as strict non-text. > On the other hand, Python3 provides various facilities for working > with such files. > > The first I'll mention is the one that follows from my description > of what your file really is: Python3 allows opening files in binary > mode, and then decoding various sections of it using whatever > encoding you like, using the bytes.decode() operation on various > sections of the file. Determination of which sections are in which > encodings is beyond the scope of this description of the technique, > and is application dependent. This is perhaps the most promising approach. If I can open a text file in binary mode, iterate it line by line, split every line of non-ascii bytes with .split() and process them that'd satisfy my needs. But still there are dragons. If I read a filename from such file I read it as bytes, not str, so I can only use low-level APIs to manipulate with those filenames. Pity. Let see a perfectly normal situation I am quite often in. A person sent me a directory full of MP3 files. The transport doesn't matter; it could be FTP, or rsync, or a zip file sent by email, or bittorrent. What matters is that filenames and content are in alien encodings. Most often it's cp1251 (the encoding used in Russian Windows) but can be koi8 or utf8. There is a playlist among the files -- a text file that lists MP3 files, every file on a single line; usually with full paths ("C:\Audio\some.mp3"). Now I want to read filenames from the file and process the filenames (strip paths) and files (verify existing of files, or renumber the files or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding of the playlist but I know it corresponds to the encoding of filenames so I can expect those files exist on my filesystem; they have strangely looking unreadable names but they exist. Just a small example of why I do want to process filenames from a text file in an alien encoding. Without knowing the encoding in advance. > The second is to specify an error handler, that, like you, is > trained to recognize the other encodings and convert them > appropriately. I'm not aware that such an error handler has been or > could be written, myself not having your training. > > The third is to specify the UTF-8 with the surrogate escape error > handler. This allows non-UTF-8 codes to be loaded into memory. You, > or algorithms as smart as you, could perhaps be developed to detect > and manipulate the resulting "lone surrogate" codes in meaningful > ways, or could simply allow them to ride along without > interpretation, and be emitted as the original, into other files. Yes, these are different workarounds. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From v+python at g.nevcal.com Fri Aug 22 22:17:44 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Fri, 22 Aug 2014 13:17:44 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822185005.GA2388@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> Message-ID: <53F7A568.5090605@g.nevcal.com> On 8/22/2014 11:50 AM, Oleg Broytman wrote: > On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman wrote: >> On 8/22/2014 9:52 AM, Oleg Broytman wrote: >>> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman wrote: >>>> On 8/22/2014 8:51 AM, Oleg Broytman wrote: >>>>> What encoding does have a text file (an HTML, to be precise) with >>>>> text in utf-8, ads in cp1251 (ad blocks were included from different >>>>> files) and comments in koi8-r? >>>>> Well, I must admit the HTML was rather an exception, but having a >>>>> text file with some strange characters (binary strings, or paragraphs >>>>> in different encodings) is not that exceptional. >>>> That's not a text file. That's a binary file containing (hopefully >>>> delimited, and documented) sections of encoded text in different >>>> encodings. >>> Allow me to disagree. For me, this is a text file which I can (and >>> do) view with a pager, edit with a text editor, list on a console, >>> search with grep and so on. If it is not a text file by strict Python3 >>> standards then these standards are too strict for me. Either I find a >>> simple workaround in Python3 to work with such texts or find a different >>> tool. I cannot avoid such files because my reality is much more complex >>> than strict text/binary dichotomy in Python3. >> I was not declaring your file not to be a "text file" from any >> definition obtained from Python3 documentation, just from a common >> sense definition of "text file". > And in my opinion those files are perfect text. The files consist of > lines separated by EOL characters (not necessary EOL characters of my OS > because it could be a text file produced in a different OS), lines > consist of words and words of characters. Until you know or can deduce the encoding of a file, it is binary. If it has multiple, different, embedded encodings of text, it is still binary. In my opinion. So these are just opinions, and naming conventions. If you call it text, you have a different definition of text file than I do. > >> Looking at it from Python3, though, it is clear that when opening a >> file in "text" mode, an encoding may be specified or will be >> assumed. That is one encoding, applying to the whole file, not 3 >> encodings, with declarations on when to switch between them. So I >> think, in general, Python3 assumes or defines a definition of text >> file that matches my "common sense" definition. > I don't have problems with Python3 text. I have problems with Python3 > trying to get rid of byte strings and treating bytes as strict non-text. Python3 is not trying to get rid of byte strings. But to some extent, it is wanting to treat bytes as non-text... bytes can be encoded text, but is not text until it is decoded. There is some processing that can be done on encoded text, but it has to be done differently (in many cases) than processing done on (non-encoded) text. One difference is the interpretation of what character is what varies from encoding to encoding, so if the processing requires understanding the characters, then the character code must be known. On the other hand, if it suffices to detect blocks of opaque text delimited by a known set of delimiters codes (EOL: CR, LF, combinations thereof) then that can be done relatively easily on binary, as long as the encoding doesn't have data puns where a multibyte encoded character might contain the code for the delimiter as one of the bytes of the code for the character. >> On the other hand, Python3 provides various facilities for working >> with such files. >> >> The first I'll mention is the one that follows from my description >> of what your file really is: Python3 allows opening files in binary >> mode, and then decoding various sections of it using whatever >> encoding you like, using the bytes.decode() operation on various >> sections of the file. Determination of which sections are in which >> encodings is beyond the scope of this description of the technique, >> and is application dependent. > This is perhaps the most promising approach. If I can open a text > file in binary mode, iterate it line by line, split every line of > non-ascii bytes with .split() and process them that'd satisfy my needs. > But still there are dragons. If I read a filename from such file I > read it as bytes, not str, so I can only use low-level APIs to > manipulate with those filenames. Pity. If the file names are in an unknown encoding, both in the directory and in the encoded text in the file listing, then unless you can deduce the encoding, you would be limited to doing manipulations with file APIs that support bytes, the low-level ones, yes. If you can deduce the encoding, then you are freed from that limitation. > Let see a perfectly normal situation I am quite often in. A person > sent me a directory full of MP3 files. The transport doesn't matter; it > could be FTP, or rsync, or a zip file sent by email, or bittorrent. What > matters is that filenames and content are in alien encodings. Most often > it's cp1251 (the encoding used in Russian Windows) but can be koi8 or > utf8. There is a playlist among the files -- a text file that lists MP3 > files, every file on a single line; usually with full paths > ("C:\Audio\some.mp3"). > Now I want to read filenames from the file and process the filenames > (strip paths) and files (verify existing of files, or renumber the files > or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are > also in cp1251 of utf-8 encoding]...whatever). "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of". > I don't know the encoding > of the playlist but I know it corresponds to the encoding of filenames > so I can expect those files exist on my filesystem; they have strangely > looking unreadable names but they exist. > Just a small example of why I do want to process filenames from a > text file in an alien encoding. Without knowing the encoding in advance. An interesting example, for sure. Life will be easier when everyone converts to Unicode and UTF-8. > >> The second is to specify an error handler, that, like you, is >> trained to recognize the other encodings and convert them >> appropriately. I'm not aware that such an error handler has been or >> could be written, myself not having your training. >> >> The third is to specify the UTF-8 with the surrogate escape error >> handler. This allows non-UTF-8 codes to be loaded into memory. You, >> or algorithms as smart as you, could perhaps be developed to detect >> and manipulate the resulting "lone surrogate" codes in meaningful >> ways, or could simply allow them to ride along without >> interpretation, and be emitted as the original, into other files. > Yes, these are different workarounds. > > Oleg. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Aug 22 20:51:20 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 22 Aug 2014 11:51:20 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F77941.3040700@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> Message-ID: On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman wrote: > What encoding does have a text file (an HTML, to be precise) with > text in utf-8, ads in cp1251 (ad blocks were included from different > files) and comments in koi8-r? > Well, I must admit the HTML was rather an exception, but having a > text file with some strange characters (binary strings, or paragraphs > in different encodings) is not that exceptional. > > That's not a text file. That's a binary file containing (hopefully > delimited, and documented) sections of encoded text in different > encodings. > > Allow me to disagree. For me, this is a text file which I can (and > do) view with a pager, edit with a text editor, list on a console, > search with grep and so on. If it is not a text file by strict Python3 > standards then these standards are too strict for me. Either I find a > simple workaround in Python3 to work with such texts or find a different > tool. I cannot avoid such files because my reality is much more complex > than strict text/binary dichotomy in Python3. > > First -- we're getting OT here -- this thread was about file and path names, not the contents of files. But I suppose I brought that in when I talked about writing file names to files... The first I'll mention is the one that follows from my description of what > your file really is: Python3 allows opening files in binary mode, and then > decoding various sections of it using whatever encoding you like, using the > bytes.decode() operation on various sections of the file. Determination of > which sections are in which encodings is beyond the scope of this > description of the technique, and is application dependent. > right -- and you would have wanted to open such file in binary mode with py2 as well, but in that case, you's have the contents in py2 string object, which has a few more convenient ways to work with text (at least ascii-compatible) than the py3 bytes object does. The third is to specify the UTF-8 with the surrogate escape error handler. > This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as > smart as you, could perhaps be developed to detect and manipulate the > resulting "lone surrogate" codes in meaningful ways, or could simply allow > them to ride along without interpretation, and be emitted as the original, > into other files. > Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in? I wonder if this would make it hard to preserve byte boundaries, though. By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Aug 22 20:53:01 2014 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 22 Aug 2014 11:53:01 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822024229.GA8192@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> Message-ID: On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman wrote: > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal < > chris.barker at noaa.gov> wrote: > > This brings up the other key problem. If file names are (almost) > > arbitrary bytes, how do you write one to/read one from a text file > > with a particular encoding? ( or for that matter display it on a > > terminal) > > There is no such thing as an encoding of text files. So we just > write those bytes to the file So I write bytes that are encoded one way into a text file that's encoded another way, and expect to be abel to read that later? you're kidding, right? Only if that's he only thing in the file -- usually not the case with my text files. or output them to the terminal. I often do > that. My filesystems are full of files with names and content in > at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a > terminal with koi8 or utf-8 locale and fonts and some file always look > weird. But however weird they are it's possible to work with them. > Not for me (or many other users) -- terminals are sometimes set with ascii-only encoding, so non-ascii barfs -- or you get some weird control characters that mess up your terminal -- dumping arbitrary bytes to a terminal does not always "just work". > > And people still want to say posix isn't broken in this regard? > > Not at all! And broken or not broken it's what I (for many different > reasons) prefer to use for my desktops, servers, notebooks, routers and > smartphones, Sorry -- that's a Red Herring -- I agree, "broken" or "simple and consistent" is irrelevant, we all want Python to work as well as it can on such systems. The point is that if you are reading a file name from the system, and then passing it back to the system, then you can treat it as just bytes -- who cares? And if you add the byte value of 47 thing, then you can even do basic path manipulations. But once you want to do other things with your file name, then you need to know the encoding. And it is very, very common for users to need to do other things with filenames, and they almost always want them as text that they can read and understand. Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in. And apparently that's pretty common -- or common enough that it would be nice for Python to support it well. This trick is how -- we'd like the "just pass it around and do path manipulations" case to work with (almost) arbitrary bytes, but everything else to work naturally with text (unicode text). Which brings us to the "what APIs should accept bytes" question. I think that's been pretty much answered: All the low-level ones, so that protocol and library programmers can write code that works on systems with undefined filename encodings. But: casual users still need to do the normal things with file names and paths, and ideally those should work the same way on all systems. I think the way to do this is to abstract the path concept, like pathlib does. Back in the day, paths were "just strings", and that worked OK with py2 strings, because you could put arbitrary bytes in them. But the "py2 strings were perfect" folks seem to not acknowledge that while they are nice for matching the posix filename model, they were a pain in the neck when you needed to do somethign else like write them in to a JSON file or something. From my personal experience, non-ascii filenames are much easier to deal with if I use unicode for filenames everywhere (py2). Somehow, I have yet to be bitten by mixed encoding in filenames. So will using a surrogate-escape error handling with pathlib make all this just work? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From rosuav at gmail.com Fri Aug 22 23:04:20 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 23 Aug 2014 07:04:20 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F7A568.5090605@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> <53F7A568.5090605@g.nevcal.com> Message-ID: On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman wrote: > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is > utf-8, but it is not both. Maybe you meant "or" instead of "of". I'd assume "or" meant there, rather than "of", it's a common typo. Not sure why 1251, specifically, but it's not uncommon for boundary code to attempt a decode that consists of something like "attempt UTF-8 decode, and if that fails, attempt an eight-bit decode". For my MUD clients, that's pretty much required; one of the servers I frequent is completely bytes-oriented, so whatever encoding one client uses will be dutifully echoed to every other client. There are some that correctly use UTF-8, but others use whatever they feel like; and since those naughty clients are mainly on Windows, I can reasonably guess that they'll be using CP-1252. So that's what I do: UTF-8, fall-back on 1252. (It's also possible some clients will be using Latin-1, but 1252 is a superset of that.) But it's important to note that this is a method of handling junk. It's not a design intention; this is for a situation where I really want to cope with any byte stream and attempt to display it as text. And if I get something that's neither UTF-8 nor CP-1252, I will display it wrongly, and there's nothing can be done about that. ChrisA From phd at phdru.name Sat Aug 23 00:09:31 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 00:09:31 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F7A568.5090605@g.nevcal.com> References: <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> <53F7A568.5090605@g.nevcal.com> Message-ID: <20140822220931.GB2388@phdru.name> On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman wrote: > >in cp1251 of utf-8 encoding > > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or > it is utf-8, but it is not both. Maybe you meant "or" instead of > "of". But of course! Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From phd at phdru.name Sat Aug 23 00:21:18 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 00:21:18 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> Message-ID: <20140822222118.GC2388@phdru.name> On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote: > Back in the day, paths were "just strings", and that worked OK with > py2 strings, because you could put arbitrary bytes in them. But the "py2 > strings were perfect" folks seem to not acknowledge that while they are > nice for matching the posix filename model, they were a pain in the neck > when you needed to do somethign else like write them in to a JSON file or > something. This is the core of the problem. Python2 favors Unix model but Windows people pays the price. Python3 reverses that and I'm still thinking if I want to pay the new price. > So will using a surrogate-escape error handling with pathlib make all this > just work? I'm involved in developing and maintaining a few big commercial projects that will hardly be ported to Python3. So I'm stuck with Python2 for many years and I haven't tried Python3. May be I should try a small personal project, but certainly not this year. May be the next one... Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From phd at phdru.name Sat Aug 23 00:26:37 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 00:26:37 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> <53F7A568.5090605@g.nevcal.com> Message-ID: <20140822222637.GD2388@phdru.name> On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico wrote: > On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman wrote: > > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is > > utf-8, but it is not both. Maybe you meant "or" instead of "of". > > I'd assume "or" meant there, rather than "of", it's a common typo. > > Not sure why 1251, specifically This is the encoding of Russian Windows. Files and emails in Russia are mostly in cp1251 encoding; something like 60-70%, I think. The second popular encoding is cp866 (Russian DOS); it's used by Windows as OEM encoding. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From rosuav at gmail.com Sat Aug 23 00:28:09 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 23 Aug 2014 08:28:09 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822222637.GD2388@phdru.name> References: <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> <53F7A568.5090605@g.nevcal.com> <20140822222637.GD2388@phdru.name> Message-ID: On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman wrote: > On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico wrote: >> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman wrote: >> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is >> > utf-8, but it is not both. Maybe you meant "or" instead of "of". >> >> I'd assume "or" meant there, rather than "of", it's a common typo. >> >> Not sure why 1251, specifically > > This is the encoding of Russian Windows. Files and emails in Russia > are mostly in cp1251 encoding; something like 60-70%, I think. The > second popular encoding is cp866 (Russian DOS); it's used by Windows as > OEM encoding. Yeah, that makes sense. In any case, you pick one "most likely" 8-bit encoding and go with it. ChrisA From rdmurray at bitdance.com Sat Aug 23 04:20:55 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 22 Aug 2014 22:20:55 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822222118.GC2388@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822222118.GC2388@phdru.name> Message-ID: <20140823022056.8CD74250E68@webabinitio.net> On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman wrote: > I'm involved in developing and maintaining a few big commercial > projects that will hardly be ported to Python3. So I'm stuck with > Python2 for many years and I haven't tried Python3. May be I should try > a small personal project, but certainly not this year. May be the next > one... Yes, you should try it. Really, it's not the monster you are constructing in your mind. The functions that read filenames and return them as text use surrogate escape to preserve the bytes, and the functions that accept filenames use surrogate escape to recover those bytes before passing them back to the OS. So posix binary filenames just work, as long as the only thing you depend on is being able to split and join them on the / character (and possibly the . character) and otherwise treat the names as black boxes...which is exactly the same situation you are in in python2. If you need to read filenames out of a file, you'll need to specify the surrogate escape error handler so that the bytes will be there to be recovered when you pass them to the file system functions, but it will work. Or, as discussed, you can treat them as binary and use the os level functions that accept binary input (which are exactly the ones you are used to using in python2). This includes os.path.split and os.path.join, which as noted are the only things you can depend on working correctly when you don't know the encoding of the filenames. So, the way to look at this is that python3 is no worse[1] than python2 for handling posix binary filenames, and also provides additional features if you *do* know the correct encoding of the filenames. --David [1] modulo any remaining API bugs, which is exactly where this thread started: trying to figure out which APIs need to be able to handle binary paths and/or surrogate escaped paths so that posix filenames consistently work as well in python3 as they did in python2). From stephen at xemacs.org Sat Aug 23 10:02:25 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 23 Aug 2014 17:02:25 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> Message-ID: <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Barker writes: > > The third is to specify the UTF-8 with the surrogate escape error > > handler. This allows non-UTF-8 codes to be loaded into > > memory. Read as bytes and incrementally decode. If you hit an Exception, retry from that point. > Just so I'm clear here -- if you write that back out, encoded as > utf-8 -- you'll get the exact same binary blob out as came in? If and only if there are no changes to the content. > I wonder if this would make it hard to preserve byte boundaries, > though. I'm not sure what you mean by "byte boundaries". If you mean after concatenation of such objects, yes, the uninterpretable bytes will be encoded in such a way as to be identifiable as lone bytes; they won't be interpreted as Unicode characters. > By the way, IIUC correctly, you can also use the python latin-1 > decoder -- anything latin-1 will come through correctly, anything > not valid latin-1 will come in as garbage, but if you re-encode > with latin-1 the original bytes will be preserved. I think this > will also preserve a 1:1 relationship between character count and > byte count, which could be handy. Bad idea, especially for Oleg's use case -- you can't decode those by codec without reencoding to bytes first. No point in abandoning codecs just because there isn't one designed for his use case exactly. Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points and a very few plausible coding systems, which can be fairly well distinguished by the range of bytes used and probably nearly perfectly with additional information from the structure and distribution of apparently decoded characters. From stephen at xemacs.org Sat Aug 23 10:20:40 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 23 Aug 2014 17:20:40 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <20140822185005.GA2388@phdru.name> <53F7A568.5090605@g.nevcal.com> Message-ID: <87ha13d3c7.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Angelico writes: > Not sure why 1251, All of those codes have repertoires that are Cyrillic supersets, presumably Russian-language content, based on Oleg's top domain. > But it's important to note that this is a method of handling junk. > It's not a design intention; this is for a situation where I really > want to cope with any byte stream and attempt to display it as text. > And if I get something that's neither UTF-8 nor CP-1252, I will > display it wrongly, and there's nothing can be done about that. Of course there is. It just gets more heuristic the more numerous the potential encodings are. From stephen at xemacs.org Sat Aug 23 11:02:06 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 23 Aug 2014 18:02:06 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> Message-ID: <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> Chris Barker writes: > So I write bytes that are encoded one way into a text file that's encoded > another way, and expect to be abel to read that later? No, not you. Crap software does that. Your MUD server. Oleg's favorite web pages with ads, or more likely the ad servers. > Not for me (or many other users) -- terminals are sometimes set > with ascii-only encoding, So? That means you can't handle text files in general, only those restricted to ASCII. That's a completely different issue. > Python3 supports this case very well. But it does indeed make it > hard to work with filenames when you don't know the encoding they > are in. No, it doesn't. Reasonably handling "text streams" in unknown, possibly multiple, encodings is just hard. Python 3 has nothing to do with it, and Oleg should know that very well. It's true that code written in Python 2 to handle these issues needs to be ported to Python 3. Things is, Oleg says "another tool" -- any non-Python-2 tool will need porting of his code too. > And apparently that's pretty common -- or common enough that it > would be nice for Python to support it well. This trick is how -- > we'd like the "just pass it around and do path manipulations" case > to work with (almost) arbitrary bytes, It does. That's what os.path is for. > but everything else to work naturally with text (unicode text). No gloss, please. It's text, period. The internal Unicode encoding is *not exposed*, with a few (important) exceptions such as Han unification. > I think the way to do this is to abstract the path concept, like pathlib > does. You forgot to append the word "well". > From my personal experience, non-ascii filenames are much easier to > deal with if I use unicode for filenames everywhere (py2). Somehow, > I have yet to be bitten by mixed encoding in filenames. .gov domain? ASCII-only terminal settings? It's not "somehow", it's that you live a sheltered life. > So will using a surrogate-escape error handling with pathlib make > all this just work? Not answerable until you define "all this" more precisely. And that's the big problem with Oleg's complaint, too. It's not at all clear what he wants, except that all of his current code should continue to work in Python 3. Just like all of us. The question then is persuading him that it's worth moving to Python 3 despite the effort of porting Python-2-specific code. Maybe he can be persuaded, maybe not. Python 2 is a better than average language. From marko at pacujo.net Sat Aug 23 10:21:57 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sat, 23 Aug 2014 11:21:57 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Sat, 23 Aug 2014 17:02:25 +0900") References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87vbpj8vkq.fsf@elektro.pacujo.net> "Stephen J. Turnbull" : > Just read as bytes and decode piecewise in one way or another. For > Oleg's HTML case, there's a well-understood structure that can be used > to determine retry points HTML and XML are interesting examples since their encoding is initially unknown: ^ +--- Now I know it is UTF-8 ^ +--- Now I know it was UTF-16 all along! Then we have: HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1 See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16. Marko From rosuav at gmail.com Sat Aug 23 11:32:57 2014 From: rosuav at gmail.com (Chris Angelico) Date: Sat, 23 Aug 2014 19:32:57 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On Sat, Aug 23, 2014 at 7:02 PM, Stephen J. Turnbull wrote: > Chris Barker writes: > > > So I write bytes that are encoded one way into a text file that's encoded > > another way, and expect to be abel to read that later? > > No, not you. Crap software does that. Your MUD server. Oleg's > favorite web pages with ads, or more likely the ad servers. Just to clarify: Presumably you're referring to my previous post regarding my MUD client's heuristic handling of broken encodings. It's "my server" in the sense of the one that I'm connecting to, and not in the sense that I control it. I do also run a MUD server, and it guarantees that everything it sends is UTF-8. (Incidentally, that server has the exact same set of heuristics for coping with broken encodings from other clients. There's no escaping it.) Your point is absolutely right: mess like that is to cope with the fact that there's broken stuff out there. ChrisA From marko at pacujo.net Sat Aug 23 11:46:34 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sat, 23 Aug 2014 12:46:34 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: (Isaac Morland's message of "Sat, 23 Aug 2014 05:27:54 -0400 (EDT)") References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> Message-ID: <87a96v8rnp.fsf@elektro.pacujo.net> Isaac Morland : >> HTTP/1.1 200 OK >> Content-Type: text/html; charset=ISO-8859-1 >> >> >> >> >> > > For HTML it's not quite so bad. According to the HTML 4 standard: > [...] > > The Content-Type header takes precedence over a element. I > thought I read once that the reason was to allow proxy servers to > transcode documents but I don't have a cite for that. Also, the > element "must only be used when the character encoding is organized > such that ASCII-valued bytes stand for ASCII characters" so the > initial UTF-16 example wouldn't be conformant in HTML. That's not how I read it: The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element. IOW, you must obey the HTTP character encoding until you have parsed a conflicting META content-type declaration. The author of the standard keeps a straight face and continues: For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding. Marko From stephen at xemacs.org Sat Aug 23 12:14:47 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 23 Aug 2014 19:14:47 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140822222118.GC2388@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822222118.GC2388@phdru.name> Message-ID: <87egw7cy20.fsf@uwakimon.sk.tsukuba.ac.jp> Oleg Broytman writes: > This is the core of the problem. Python2 favors Unix model but > Windows people pays the price. Python3 reverses that This is certainly not true. What is true is that Python 3 makes no attempt to make it easy to write crappy software in the old Unix style, that breaks when unexpected character encoding are encountered. Python 3 is designed to make it easier to write reliable software, even if it will only ever be used on one platform. Nevertheless, it's still a reasonable language for writing byte-shoveling software, with the last piece in place as of the acceptance of PEP 461. As of that PEP, you can use regexps for tokenizing byte streams and %-formatting to conveniently produce them. If you want to treat them piecewise as character streams with different encodings, you have a large library of codecs, which provide an incremental decoder interface. While AFAIK no codec implements a decode-until-error mode, that's not all that much of a loss, as many encodings overlap. Eg, if you start decoding using a latin-1 codec, decoding the whole document will succeed, even if it switches to windows-1251 in the meantime. Oleg, I gather Russian is your native language. That's moderately complicated, I admit. But the Russians are a distant second to the Japanese in self-destructive proliferation of incompatible character coding standards and non-standard variants. After 24 years of dealing with the mess that is East Asian encodings (which is even bound up with the "religion" of Japanese exceptionalism -- some Japanese have argued that there is a spiritual superiority to Japanese JIS codes!), I cannot believe you are going to find a better environment for dealing with these issues than Python 3. From steve at pearwood.info Sat Aug 23 13:08:29 2014 From: steve at pearwood.info (Steven D'Aprano) Date: Sat, 23 Aug 2014 21:08:29 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> Message-ID: <20140823110828.GY25957@ando> On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote: > The point is that if you are reading a file name from the system, and then > passing it back to the system, then you can treat it as just bytes -- who > cares? And if you add the byte value of 47 thing, then you can even do > basic path manipulations. But once you want to do other things with your > file name, then you need to know the encoding. And it is very, very common > for users to need to do other things with filenames, and they almost always > want them as text that they can read and understand. > > Python3 supports this case very well. But it does indeed make it hard to > work with filenames when you don't know the encoding they are in. Just "not knowing" is not sufficient. In that case, you'll likely get a Unicode string containing moji-bake: # I write a file name using UTF-8 on my system: filename = 'music by ????.txt'.encode('utf-8') # You try to use it assuming ISO-8859-7 (Greek) filename.decode('iso-8859-7') => 'music by ?\x9d??????.txt' which, even though it looks wrong, still lets you refer to the file (provided you then encode back to bytes with ISO-8859-7 again). This won't always be the case, sometimes the encoding you guess will be wrong. When I started this email, I originally began to say that the actual problem was with byte file names that cannot be decoded into Unicode using the system encoding (typically UTF-8 on Linux systems. But I've actually had difficulty demonstrating that it actually is a problem. I started with a byte sequence which is invalid UTF-8, namely: b'ZZ\xdb\xdf\xfa\xff' created a file with that name, and then tried listing it with os.listdir. Even in Python 3.1 it worked fine. I was able to list the directory and open the file, so I'm not entirely sure where the problem lies exactly. Can somebody demonstrate the failure mode? -- Steven From rdmurray at bitdance.com Sat Aug 23 15:41:22 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sat, 23 Aug 2014 09:41:22 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140823110828.GY25957@ando> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140823110828.GY25957@ando> Message-ID: <20140823134122.85F17250E6A@webabinitio.net> On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano wrote: > When I started this email, I originally began to say that the actual > problem was with byte file names that cannot be decoded into Unicode > using the system encoding (typically UTF-8 on Linux systems. But I've > actually had difficulty demonstrating that it actually is a problem. I > started with a byte sequence which is invalid UTF-8, namely: > > b'ZZ\xdb\xdf\xfa\xff' > > created a file with that name, and then tried listing it with > os.listdir. Even in Python 3.1 it worked fine. I was able to list the > directory and open the file, so I'm not entirely sure where the problem > lies exactly. Can somebody demonstrate the failure mode? The "failure" happens only when you try to cross from the domain of posix binary filenames into the domain of text streams (that is, a stream with a consistent encoding). If you stick with os interfaces that handle filenames, Python3 handles posix bytes filenames just fine (though there may be a few corner-case rough edges yet to be fixed, and the standard streams was one of them). The difficultly comes if you try to use a filename that contains undecodable bytes in a non-os-interface text-context (such as writing it to a text file that you have declared to be a utf-8 encoding): there you will get an error...not completely unlike the old "your code works until your user uses unicode" problem we had in python2, but in this case only happening in a very narrow set of circumstances involving trying to translate between one domain (posix binary filenames) and another domain (io streams with a consistent declared encoding). This is not a common operation, but appears to be the one Oleg is concerned about. The old unicode-blowup errors would happen almost any time someone with a non-ascii language tried to use a program written by an ascii-only programmer (which was most of us). The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error. If you really want a stream with an inconsistent encoding, open it as binary and use the surrogate escape error handler to recover the bytes in the filenames. That is, *be explicit* about your intentions. So yes, we've shifted a burden from those who want non-ascii text to work consistently to those who wanted inconsistently encoded text to "just work" (or rather *appear* to "just work"). The number of people who benefit from the improved text model *greatly* outweighs the number of people inconvenienced by the new strictness when the domain line (posix binary filenames to consistently encoded text stream) are crossed. And the result is more *valid* programs, and fewer unexpected errors overall, with no inconvenience unless that domain line is crossed, and even then the inconvenience is limited to the open call that creates the binary stream. --David From phd at phdru.name Sat Aug 23 17:15:52 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 17:15:52 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20140823151552.GA4264@phdru.name> On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" wrote: > And that's the big problem with Oleg's complaint, too. It's not at > all clear what he wants The first thing is I want to understand why people continue to refer to Unix was as "broken". Better yet, to persuade them it's not. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From phd at phdru.name Sat Aug 23 17:16:39 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 17:16:39 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <87egw7cy20.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822222118.GC2388@phdru.name> <87egw7cy20.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20140823151639.GB4264@phdru.name> On Sat, Aug 23, 2014 at 07:14:47PM +0900, "Stephen J. Turnbull" wrote: > I cannot believe you are going to find a better environment for > dealing with these issues than Python 3. Well, that's may be. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From ijmorlan at uwaterloo.ca Sat Aug 23 11:27:54 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Sat, 23 Aug 2014 05:27:54 -0400 (EDT) Subject: [Python-Dev] Bytes path support In-Reply-To: <87vbpj8vkq.fsf@elektro.pacujo.net> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> Message-ID: On Sat, 23 Aug 2014, Marko Rauhamaa wrote: > "Stephen J. Turnbull" : > >> Just read as bytes and decode piecewise in one way or another. For >> Oleg's HTML case, there's a well-understood structure that can be used >> to determine retry points > > HTML and XML are interesting examples since their encoding is initially > unknown: > > > ^ > +--- Now I know it is UTF-8 > > > ^ > +--- Now I know it was UTF-16 > all along! > > Then we have: > > > HTTP/1.1 200 OK > Content-Type: text/html; charset=ISO-8859-1 > > > > > > > See how deep you have to parse the TCP stream before you realize the > content encoding is UTF-16. For HTML it's not quite so bad. According to the HTML 4 standard: http://www.w3.org/TR/html4/charset.html The Content-Type header takes precedence over a element. I thought I read once that the reason was to allow proxy servers to transcode documents but I don't have a cite for that. Also, the element "must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 example wouldn't be conformant in HTML. In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte order mark) is used: http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration Not sure about XML. Of course this whole area is a bit of an "arms race" between programmers competing to get away with being as sloppy as possible and other programmers who have to deal with their mess. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From marko at pacujo.net Sat Aug 23 18:33:06 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sat, 23 Aug 2014 19:33:06 +0300 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140823134122.85F17250E6A@webabinitio.net> (R. David Murray's message of "Sat, 23 Aug 2014 09:41:22 -0400") References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140823110828.GY25957@ando> <20140823134122.85F17250E6A@webabinitio.net> Message-ID: <87tx536u9p.fsf@elektro.pacujo.net> "R. David Murray" : > The same problem existed in python2 if your goal was to produce a stream > with a consistent encoding, but now python3 treats that as an error. I have a different interpretation of the situation: as a rule, use byte strings in Python3. Text strings are a special corner case for applications that have to deal with human languages. If your application has to talk SMTP, use bytes. If your application has to do IPC, use bytes. If your application has to do file I/O, use bytes. If your application is a word processor or an IM client, you have text strings available. You might find, though, that barely any modern GUI application is satisfied with crude text strings. You will need weights, styles, sizes, emoticons, positions, directions, shadows, alignment etc etc so it may be that Python's text strings are only good enough for storing individual characters or short snippets. In sum, Python's text strings might have one sweet spot: Usenet clients. Marko From p.f.moore at gmail.com Sat Aug 23 19:40:37 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 23 Aug 2014 18:40:37 +0100 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140823151552.GA4264@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> Message-ID: On 23 August 2014 16:15, Oleg Broytman wrote: > On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" wrote: >> And that's the big problem with Oleg's complaint, too. It's not at >> all clear what he wants > > The first thing is I want to understand why people continue to refer > to Unix was as "broken". Better yet, to persuade them it's not. Generally, it seems to be mostly a reaction to the repeated claims that Python, or Windows, or whatever, is "broken". Unix advocates (not yourself) are prone to declaring anything *other* than the Unix model as "broken", so it's tempting to give them a taste of their own medicine. Sorry for that (to the extent that I was one of the people doing so). Rhetoric aside, none of Unix, Windows or Python are "broken". They just react in different ways to fundamentally difficult edge cases. But expecting Python (a cross-platform language) to prefer the Unix model is putting all the pain on non-Unix users of Python, which I don't feel is reasonable. Let's all compromise a little. Paul PS The key thing *I* think is a problem with the Unix behaviour is that it treats filenames as bytes rather than Unicode. People name files using *characters*. So every filename is semantically text, in the mind of the person who created it. Unix enforces a transformation to bytes, but does not retain the encoding of those bytes. So information about the original author's intent is lost. But that's a historical fact, baked into Unix at a low level. Whether that's "broken" or just "something to deal with" is not important to me. From phd at phdru.name Sat Aug 23 20:37:29 2014 From: phd at phdru.name (Oleg Broytman) Date: Sat, 23 Aug 2014 20:37:29 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> Message-ID: <20140823183729.GA7819@phdru.name> Hi! On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore wrote: > On 23 August 2014 16:15, Oleg Broytman wrote: > > On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" wrote: > >> And that's the big problem with Oleg's complaint, too. It's not at > >> all clear what he wants > > > > The first thing is I want to understand why people continue to refer > > to Unix was as "broken". Better yet, to persuade them it's not. "Unix was" => "Unix way" > Generally, it seems to be mostly a reaction to the repeated claims > that Python, or Windows, or whatever, is "broken". Ah, if that's the only problem I certainly can live with that. My problem is that it *seems* this anti-Unix attitude infiltrates Python core development. I very much hope I'm wrong and it really isn't. > Unix advocates (not > yourself) are prone to declaring anything *other* than the Unix model > as "broken", so it's tempting to give them a taste of their own > medicine. Sorry for that (to the extent that I was one of the people > doing so). You didn't see me in my younger years. I surely was one of those Windows bashers. Please take my apology. > Rhetoric aside, none of Unix, Windows or Python are "broken". They > just react in different ways to fundamentally difficult edge cases. > > But expecting Python (a cross-platform language) to prefer the Unix > model is putting all the pain on non-Unix users of Python, which I > don't feel is reasonable. Let's all compromise a little. > > Paul > > PS The key thing *I* think is a problem with the Unix behaviour is > that it treats filenames as bytes rather than Unicode. People name > files using *characters*. So every filename is semantically text, in > the mind of the person who created it. Unix enforces a transformation > to bytes, but does not retain the encoding of those bytes. So > information about the original author's intent is lost. But that's a > historical fact, baked into Unix at a low level. Whether that's > "broken" or just "something to deal with" is not important to me. The problem is hardly specific to Unix. Despite Joel Spolsky's "There Ain't No Such Thing As Plain Text" people create text files all the time. Without specifying an encoding. And put filenames into those text files (audio playlists, like .m3u and .pls are just text files with pathnames). Unix takes the idea that everything is text and a stream of bytes to its extreme. Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From p.f.moore at gmail.com Sat Aug 23 22:42:45 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 23 Aug 2014 21:42:45 +0100 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140823183729.GA7819@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> Message-ID: On 23 August 2014 19:37, Oleg Broytman wrote: > Unix takes the idea that everything is text and a stream of bytes to > its extreme. I don't really understand the idea of "text and a stream of bytes". The two are fundamentally different in my view. But I guess that's why we have to agree to differ - our perspectives are just very different. Paul From greg.ewing at canterbury.ac.nz Sun Aug 24 03:11:10 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sun, 24 Aug 2014 13:11:10 +1200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> Message-ID: <53F93BAE.5050803@canterbury.ac.nz> Isaac Morland wrote: > In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF > (byte order mark) is used: > > http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration > > Not sure about XML. According to Appendix F here: http://www.w3.org/TR/xml/#sec-guessing an XML parser needs to be prepared to try all the encodings it supports until it finds one that works well enough to decode the XML declaration, then it can find out the exact encoding used. -- Greg From ncoghlan at gmail.com Sun Aug 24 05:27:55 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 24 Aug 2014 13:27:55 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140823183729.GA7819@phdru.name> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> Message-ID: On 24 August 2014 04:37, Oleg Broytman wrote: > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore wrote: >> Generally, it seems to be mostly a reaction to the repeated claims >> that Python, or Windows, or whatever, is "broken". > > Ah, if that's the only problem I certainly can live with that. My > problem is that it *seems* this anti-Unix attitude infiltrates Python > core development. I very much hope I'm wrong and it really isn't. The POSIX locale based approach to handling encodings is genuinely broken - it's almost as broken as code pages are on Windows. The fundamental flaw is that locales encourage *bilingual* computing: handling English plus one other language correctly. Given a global internet, bilingual computing *is a fundamentally broken approach*. We need multilingual computing (any human language, all the time), and that means Unicode. As some examples of where bilingual computing breaks down: * My NFS client and server may have different locale settings * My FTP client and server may have different locale settings * My SSH client and server may have different locale settings * I save a file locally and send it to someone with a different locale setting * I attempt to access a Windows share from a Linux client (or vice-versa) * I clone my POSIX hosted git or Mercurial repository on a Windows client * I have to connect my Linux client to a Windows Active Directory domain (or vice-versa) * I have to interoperate between native code and JVM code The entire computing industry is currently struggling with this monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale encoding/code pages) -> multilingual (Unicode) transition. It's been going on for decades, and it's still going to be quite some time before we're done. The POSIX world is slowly clawing its way towards a multilingual model that actually works: UTF-8 Windows (including the CLR) and the JVM adopted a different multilingual model, but still one that actually works: UTF-16-LE POSIX is hampered by legacy ASCII defaults in various subsystems (most notably the default locale) and the assumption that system metadata is "just bytes" (an assumption that breaks down as soon as you have to hand that metadata over to another machine that may have different locale settings) Windows is hampered by the fact they kept the old 8-bit APIs around for backwards compatibility purposes, so applications using those APIs are still only bilingual (at best) rather than multilingual. JVM and CLR applications will at least handle the Basic Multilingual Plane (UCS-2) correctly, but may not correctly handle code points beyond the 16-bit boundary (this is the "Python narrow builds don't handle Unicode correctly" problem that was resolved for Python 3.3+ by PEP 393) Individual users (including some organisations) may have the luxury of saying "well, all my clients and all my servers are POSIX, so I don't care about interoperability with other platforms". As the providers of a cross-platform runtime environment, we don't have that luxury - we need to figure out how to get *all* the major platforms playing nice with each other, regardless of whether they chose UTF-8 or UTF-16-LE as the basis for their approach towards providing multilingual computing environments. Historically, that question of cross platform interoperability for open source software has been handled in a few different ways: * Don't really interoperate with anybody, reinvent all the wheels (the JVM way) * Emulate POSIX on Windows (the Cygwin/MinGW way) * Let the application developer figure it out (the Python 2 way) The first approach is inordinately expensive - it took the resources of Sun in its heyday to make it possible, and it effectively locks the JVM out of certain kinds of computing (e.g. it's hard to do array oriented programming in JVM languages, because the CPU and GPU vectorisation features aren't readily accessible). The second approach prevents the creation of truly native Windows applications, which makes it uncompelling as a way of attracting Windows users - it sends a clear signal that the project doesn't *really* care about supporting Windows as a platform, but instead only grudgingly accepts that there are Windows users out there that might like to use their software. The third approach is the one we tried for a long time with Python 2, and essentially found to be an "experts only" solution. Yes, you can *make* it work, but the runtime isn't set up so it works *by default*. The Unicode changes in Python 3 are a result of the Python core development team saying "it really shouldn't be this hard for application developers to get cross-platform interoperability between correctly configured systems when dealing solely with correctly encoded data and metadata". The idea of Python 3 is that applications should require additional complexity solely to deal with *incorrectly* configured systems and improperly encoded data and metadata (and, ideally, the detection of the need for such handling should be "Python 3 threw an exception" rather than "something further down the line detected corrupted data"). This is software rather than magic, though - these improvements only happen through people actually knuckling down and solving the related problems. When folks complain about Python 3's operating system interface handling causing problems in some situations? They're almost always referring to areas where we're still relying on the locale system on POSIX or the code page system on Windows. Both of those approaches are irredeemably broken - the answer is to stop relying on them, but appropriately updating the affected subsystems generally isn't a trivial task. A lot of the affected code runs before the interpreter is fully initialised, which makes it really hard to test, and a lot of it is incredibly convoluted due to various configuration options and platform specific details, which makes it incredibly hard to modify without breaking anything. One of those areas is the fact that we still use the old 8-bit APIs to interact with the Windows console. Those are just as broken in a multilingual world as the other Windows 8-bit APIs, so Drekin came up with a project to expose the Windows console as a UTF-16-LE stream that uses the 16-bit APIs instead: https://pypi.python.org/pypi/win_unicode_console I personally hope we'll be able to get the issues Drekin references there resolved for Python 3.5 - if other folks hope for the same thing, then one of the best ways to help that happen is to try out the win_unicode_console module and provide feedback on what does and doesn't work. Another was getting exceptions attempting to write OS data to sys.stdout when the locale settings had been scrubbed from the environment. For Python 3.5, we better tolerate that situation by setting "errors=surrogateescape" on sys.stdout when the environment claims "ascii" as a suitable encoding for talking to the operating system (this is our way of saying "we don't actually believe you, but also don't have the data we need to overrule you completely"). While I was going to wait for more feedback from Fedora folks before pushing the idea again, this thread also makes me think it would be worth our while to add more tools for dealing with surrogate escapes and latin-1 binary data smuggling just to help make those techniques more discoverable and accessible: http://bugs.python.org/issue18814#msg225791 These various discussions are also giving me plenty of motivation to get back to working on PEP 432 (the rewrite of the interpreter startup sequence) for Python 3.5. A lot of these things are just plain hard to change because of the complexity of the current startup code. Redesigning that to use a cleaner, multiphase startup sequence that gets the core interpreter running *before* configuring the operating system integration should give us several more options when it comes to dealing with some of these challenges. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Sun Aug 24 06:17:34 2014 From: guido at python.org (Guido van Rossum) Date: Sat, 23 Aug 2014 21:17:34 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> Message-ID: I declare this thread irreparably broken. Do not make any decisions in this thread. Tell me (in another thread) when it's time to decide and I will. On Sat, Aug 23, 2014 at 8:27 PM, Nick Coghlan wrote: > On 24 August 2014 04:37, Oleg Broytman wrote: > > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore < > p.f.moore at gmail.com> wrote: > >> Generally, it seems to be mostly a reaction to the repeated claims > >> that Python, or Windows, or whatever, is "broken". > > > > Ah, if that's the only problem I certainly can live with that. My > > problem is that it *seems* this anti-Unix attitude infiltrates Python > > core development. I very much hope I'm wrong and it really isn't. > > The POSIX locale based approach to handling encodings is genuinely > broken - it's almost as broken as code pages are on Windows. The > fundamental flaw is that locales encourage *bilingual* computing: > handling English plus one other language correctly. Given a global > internet, bilingual computing *is a fundamentally broken approach*. We > need multilingual computing (any human language, all the time), and > that means Unicode. > > As some examples of where bilingual computing breaks down: > > * My NFS client and server may have different locale settings > * My FTP client and server may have different locale settings > * My SSH client and server may have different locale settings > * I save a file locally and send it to someone with a different locale > setting > * I attempt to access a Windows share from a Linux client (or vice-versa) > * I clone my POSIX hosted git or Mercurial repository on a Windows client > * I have to connect my Linux client to a Windows Active Directory > domain (or vice-versa) > * I have to interoperate between native code and JVM code > > The entire computing industry is currently struggling with this > monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale > encoding/code pages) -> multilingual (Unicode) transition. It's been > going on for decades, and it's still going to be quite some time > before we're done. > > The POSIX world is slowly clawing its way towards a multilingual model > that actually works: UTF-8 > Windows (including the CLR) and the JVM adopted a different > multilingual model, but still one that actually works: UTF-16-LE > > POSIX is hampered by legacy ASCII defaults in various subsystems (most > notably the default locale) and the assumption that system metadata is > "just bytes" (an assumption that breaks down as soon as you have to > hand that metadata over to another machine that may have different > locale settings) > Windows is hampered by the fact they kept the old 8-bit APIs around > for backwards compatibility purposes, so applications using those APIs > are still only bilingual (at best) rather than multilingual. > JVM and CLR applications will at least handle the Basic Multilingual > Plane (UCS-2) correctly, but may not correctly handle code points > beyond the 16-bit boundary (this is the "Python narrow builds don't > handle Unicode correctly" problem that was resolved for Python 3.3+ by > PEP 393) > > Individual users (including some organisations) may have the luxury of > saying "well, all my clients and all my servers are POSIX, so I don't > care about interoperability with other platforms". As the providers of > a cross-platform runtime environment, we don't have that luxury - we > need to figure out how to get *all* the major platforms playing nice > with each other, regardless of whether they chose UTF-8 or UTF-16-LE > as the basis for their approach towards providing multilingual > computing environments. > > Historically, that question of cross platform interoperability for > open source software has been handled in a few different ways: > > * Don't really interoperate with anybody, reinvent all the wheels (the JVM > way) > * Emulate POSIX on Windows (the Cygwin/MinGW way) > * Let the application developer figure it out (the Python 2 way) > > The first approach is inordinately expensive - it took the resources > of Sun in its heyday to make it possible, and it effectively locks the > JVM out of certain kinds of computing (e.g. it's hard to do array > oriented programming in JVM languages, because the CPU and GPU > vectorisation features aren't readily accessible). > > The second approach prevents the creation of truly native Windows > applications, which makes it uncompelling as a way of attracting > Windows users - it sends a clear signal that the project doesn't > *really* care about supporting Windows as a platform, but instead only > grudgingly accepts that there are Windows users out there that might > like to use their software. > > The third approach is the one we tried for a long time with Python 2, > and essentially found to be an "experts only" solution. Yes, you can > *make* it work, but the runtime isn't set up so it works *by default*. > > The Unicode changes in Python 3 are a result of the Python core > development team saying "it really shouldn't be this hard for > application developers to get cross-platform interoperability between > correctly configured systems when dealing solely with correctly > encoded data and metadata". The idea of Python 3 is that applications > should require additional complexity solely to deal with *incorrectly* > configured systems and improperly encoded data and metadata (and, > ideally, the detection of the need for such handling should be "Python > 3 threw an exception" rather than "something further down the line > detected corrupted data"). > > This is software rather than magic, though - these improvements only > happen through people actually knuckling down and solving the related > problems. When folks complain about Python 3's operating system > interface handling causing problems in some situations? They're almost > always referring to areas where we're still relying on the locale > system on POSIX or the code page system on Windows. Both of those > approaches are irredeemably broken - the answer is to stop relying on > them, but appropriately updating the affected subsystems generally > isn't a trivial task. A lot of the affected code runs before the > interpreter is fully initialised, which makes it really hard to test, > and a lot of it is incredibly convoluted due to various configuration > options and platform specific details, which makes it incredibly hard > to modify without breaking anything. > > One of those areas is the fact that we still use the old 8-bit APIs to > interact with the Windows console. Those are just as broken in a > multilingual world as the other Windows 8-bit APIs, so Drekin came up > with a project to expose the Windows console as a UTF-16-LE stream > that uses the 16-bit APIs instead: > https://pypi.python.org/pypi/win_unicode_console > > I personally hope we'll be able to get the issues Drekin references > there resolved for Python 3.5 - if other folks hope for the same > thing, then one of the best ways to help that happen is to try out the > win_unicode_console module and provide feedback on what does and > doesn't work. > > Another was getting exceptions attempting to write OS data to > sys.stdout when the locale settings had been scrubbed from the > environment. For Python 3.5, we better tolerate that situation by > setting "errors=surrogateescape" on sys.stdout when the environment > claims "ascii" as a suitable encoding for talking to the operating > system (this is our way of saying "we don't actually believe you, but > also don't have the data we need to overrule you completely"). > > While I was going to wait for more feedback from Fedora folks before > pushing the idea again, this thread also makes me think it would be > worth our while to add more tools for dealing with surrogate escapes > and latin-1 binary data smuggling just to help make those techniques > more discoverable and accessible: > http://bugs.python.org/issue18814#msg225791 > > These various discussions are also giving me plenty of motivation to > get back to working on PEP 432 (the rewrite of the interpreter startup > sequence) for Python 3.5. A lot of these things are just plain hard to > change because of the complexity of the current startup code. > Redesigning that to use a cleaner, multiphase startup sequence that > gets the core interpreter running *before* configuring the operating > system integration should give us several more options when it comes > to dealing with some of these challenges. > > Regards, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 24 06:44:36 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 24 Aug 2014 14:44:36 +1000 Subject: [Python-Dev] Bytes path related questions for Guido Message-ID: At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him. 1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib) This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general. [1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797 2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text? My proposal [3] is to add: * string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI) "s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?" [3] http://bugs.python.org/issue18814#msg225791 Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Aug 24 15:04:31 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 24 Aug 2014 23:04:31 +1000 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: On 24 August 2014 14:44, Nick Coghlan wrote: > 2. Should we add some additional helpers to the string module for > dealing with surrogate escaped bytes and other techniques for > smuggling arbitrary binary data as text? > > My proposal [3] is to add: > > * string.escaped_surrogates (constant with the 128 escaped code points) > * string.clean(s): replaces surrogates with '\ufffd' or another > specified code point > * string.redecode(s, encoding): encodes a string back to bytes and > then decodes it again using the specified encoding (the old encoding > defaults to 'latin-1' to match the assumptions in WSGI) Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader) "s != codecs.clean_surrogate_escapes(s)" would then become the check for "does this string contain any surrogate escaped bytes?" Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From antoine at python.org Sun Aug 24 16:23:52 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 24 Aug 2014 10:23:52 -0400 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: Le 24/08/2014 09:04, Nick Coghlan a ?crit : > On 24 August 2014 14:44, Nick Coghlan wrote: >> 2. Should we add some additional helpers to the string module for >> dealing with surrogate escaped bytes and other techniques for >> smuggling arbitrary binary data as text? >> >> My proposal [3] is to add: >> >> * string.escaped_surrogates (constant with the 128 escaped code points) >> * string.clean(s): replaces surrogates with '\ufffd' or another >> specified code point >> * string.redecode(s, encoding): encodes a string back to bytes and >> then decodes it again using the specified encoding (the old encoding >> defaults to 'latin-1' to match the assumptions in WSGI) > > > Serhiy & Ezio convinced me to scale this one back to a proposal for > "codecs.clean_surrogate_escapes(s)", which replaces surrogates that > may be produced by surrogateescape (that's what string.clean() above > was supposed to be, but my description was not correct, and the name > was too vague for that error to be obvious to the reader) "clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences. Regards Antoine. From ncoghlan at gmail.com Sun Aug 24 17:26:43 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 25 Aug 2014 01:26:43 +1000 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: On 25 August 2014 00:23, Antoine Pitrou wrote: > Le 24/08/2014 09:04, Nick Coghlan a ?crit : >> Serhiy & Ezio convinced me to scale this one back to a proposal for >> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that >> may be produced by surrogateescape (that's what string.clean() above >> was supposed to be, but my description was not correct, and the name >> was too vague for that error to be obvious to the reader) > > > "clean" conveys the wrong meaning. It should use a scary word such as > "trap". "Cleaning" surrogates is unlikely to be the right procedure when > dealing with surrogates produced by undecodable byte sequences. "purge_surrogate_escapes" was the other term that occurred to me. Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From guido at python.org Sun Aug 24 19:55:25 2014 From: guido at python.org (Guido van Rossum) Date: Sun, 24 Aug 2014 10:55:25 -0700 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? For #2 I think you should probably just work with the others you have mentioned. On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote: > At Guido's request, splitting out two specific questions from Serhiy's > thread where I believe we could do with an explicit "yes or no" from > him. > > 1. Should we accept patches adding support for the direct use of bytes > paths in lower level filesystem manipulation APIs? (i.e. everything > that isn't pathlib) > > This was Serhiy's original question (due to some open issues [1,2]). I > think the answer is yes, as we already do in some cases, and the > "pathlib doesn't support binary paths" design decision is a high level > platform independent API vs low level potentially platform dependent > API one rather than being about disallowing the use of bytes paths in > general. > > [1] http://bugs.python.org/issue19997 > [2] http://bugs.python.org/issue20797 > > 2. Should we add some additional helpers to the string module for > dealing with surrogate escaped bytes and other techniques for > smuggling arbitrary binary data as text? > > My proposal [3] is to add: > > * string.escaped_surrogates (constant with the 128 escaped code points) > * string.clean(s): replaces surrogates with '\ufffd' or another > specified code point > * string.redecode(s, encoding): encodes a string back to bytes and > then decodes it again using the specified encoding (the old encoding > defaults to 'latin-1' to match the assumptions in WSGI) > > "s != string.clean(s)" would then serve as a check for "does this > string contain any surrogate escaped bytes?" > > [3] http://bugs.python.org/issue18814#msg225791 > > Regards, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Mon Aug 25 01:19:19 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 25 Aug 2014 09:19:19 +1000 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: On 25 Aug 2014 03:55, "Guido van Rossum" wrote: > > Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). Thanks! > The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? It's text only and Antoine plans to keep it that - the concatenation operations, etc, are really only safe if you decode first. > > For #2 I think you should probably just work with the others you have mentioned. Yes, that sounds like a good idea. There's been some good progress on the issue tracker, so I think we can thrash out some workable (and comprehensible!) utilities that will be useful in their own right while also serving as aids to understanding for the underlying mechanisms. Cheers, Nick. > > > On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote: >> >> At Guido's request, splitting out two specific questions from Serhiy's >> thread where I believe we could do with an explicit "yes or no" from >> him. >> >> 1. Should we accept patches adding support for the direct use of bytes >> paths in lower level filesystem manipulation APIs? (i.e. everything >> that isn't pathlib) >> >> This was Serhiy's original question (due to some open issues [1,2]). I >> think the answer is yes, as we already do in some cases, and the >> "pathlib doesn't support binary paths" design decision is a high level >> platform independent API vs low level potentially platform dependent >> API one rather than being about disallowing the use of bytes paths in >> general. >> >> [1] http://bugs.python.org/issue19997 >> [2] http://bugs.python.org/issue20797 >> >> 2. Should we add some additional helpers to the string module for >> dealing with surrogate escaped bytes and other techniques for >> smuggling arbitrary binary data as text? >> >> My proposal [3] is to add: >> >> * string.escaped_surrogates (constant with the 128 escaped code points) >> * string.clean(s): replaces surrogates with '\ufffd' or another >> specified code point >> * string.redecode(s, encoding): encodes a string back to bytes and >> then decodes it again using the specified encoding (the old encoding >> defaults to 'latin-1' to match the assumptions in WSGI) >> >> "s != string.clean(s)" would then serve as a check for "does this >> string contain any surrogate escaped bytes?" >> >> [3] http://bugs.python.org/issue18814#msg225791 >> >> Regards, >> Nick. >> >> -- >> Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia >> _______________________________________________ >> Python-Dev mailing list >> Python-Dev at python.org >> https://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org > > > > > -- > --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: From phd at phdru.name Mon Aug 25 12:15:31 2014 From: phd at phdru.name (Oleg Broytman) Date: Mon, 25 Aug 2014 12:15:31 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> Message-ID: <20140825101531.GA4482@phdru.name> Hi! Thank you very much, Nick, for long and detailed explanation! On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan wrote: > On 24 August 2014 04:37, Oleg Broytman wrote: > > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore wrote: > >> Generally, it seems to be mostly a reaction to the repeated claims > >> that Python, or Windows, or whatever, is "broken". > > > > Ah, if that's the only problem I certainly can live with that. My > > problem is that it *seems* this anti-Unix attitude infiltrates Python > > core development. I very much hope I'm wrong and it really isn't. > > The POSIX locale based approach to handling encodings is genuinely > broken - it's almost as broken as code pages are on Windows. The > fundamental flaw is that locales encourage *bilingual* computing: > handling English plus one other language correctly. Given a global > internet, bilingual computing *is a fundamentally broken approach*. We > need multilingual computing (any human language, all the time), and > that means Unicode. > > As some examples of where bilingual computing breaks down: > > * My NFS client and server may have different locale settings > * My FTP client and server may have different locale settings > * My SSH client and server may have different locale settings > * I save a file locally and send it to someone with a different locale setting > * I attempt to access a Windows share from a Linux client (or vice-versa) > * I clone my POSIX hosted git or Mercurial repository on a Windows client > * I have to connect my Linux client to a Windows Active Directory > domain (or vice-versa) > * I have to interoperate between native code and JVM code > > The entire computing industry is currently struggling with this > monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale > encoding/code pages) -> multilingual (Unicode) transition. It's been > going on for decades, and it's still going to be quite some time > before we're done. > > The POSIX world is slowly clawing its way towards a multilingual model > that actually works: UTF-8 > Windows (including the CLR) and the JVM adopted a different > multilingual model, but still one that actually works: UTF-16-LE > > POSIX is hampered by legacy ASCII defaults in various subsystems (most > notably the default locale) and the assumption that system metadata is > "just bytes" (an assumption that breaks down as soon as you have to > hand that metadata over to another machine that may have different > locale settings) > Windows is hampered by the fact they kept the old 8-bit APIs around > for backwards compatibility purposes, so applications using those APIs > are still only bilingual (at best) rather than multilingual. > JVM and CLR applications will at least handle the Basic Multilingual > Plane (UCS-2) correctly, but may not correctly handle code points > beyond the 16-bit boundary (this is the "Python narrow builds don't > handle Unicode correctly" problem that was resolved for Python 3.3+ by > PEP 393) > > Individual users (including some organisations) may have the luxury of > saying "well, all my clients and all my servers are POSIX, so I don't > care about interoperability with other platforms". As the providers of > a cross-platform runtime environment, we don't have that luxury - we > need to figure out how to get *all* the major platforms playing nice > with each other, regardless of whether they chose UTF-8 or UTF-16-LE > as the basis for their approach towards providing multilingual > computing environments. > > Historically, that question of cross platform interoperability for > open source software has been handled in a few different ways: > > * Don't really interoperate with anybody, reinvent all the wheels (the JVM way) > * Emulate POSIX on Windows (the Cygwin/MinGW way) > * Let the application developer figure it out (the Python 2 way) > > The first approach is inordinately expensive - it took the resources > of Sun in its heyday to make it possible, and it effectively locks the > JVM out of certain kinds of computing (e.g. it's hard to do array > oriented programming in JVM languages, because the CPU and GPU > vectorisation features aren't readily accessible). > > The second approach prevents the creation of truly native Windows > applications, which makes it uncompelling as a way of attracting > Windows users - it sends a clear signal that the project doesn't > *really* care about supporting Windows as a platform, but instead only > grudgingly accepts that there are Windows users out there that might > like to use their software. > > The third approach is the one we tried for a long time with Python 2, > and essentially found to be an "experts only" solution. Yes, you can > *make* it work, but the runtime isn't set up so it works *by default*. > > The Unicode changes in Python 3 are a result of the Python core > development team saying "it really shouldn't be this hard for > application developers to get cross-platform interoperability between > correctly configured systems when dealing solely with correctly > encoded data and metadata". The idea of Python 3 is that applications > should require additional complexity solely to deal with *incorrectly* > configured systems and improperly encoded data and metadata (and, > ideally, the detection of the need for such handling should be "Python > 3 threw an exception" rather than "something further down the line > detected corrupted data"). > > This is software rather than magic, though - these improvements only > happen through people actually knuckling down and solving the related > problems. When folks complain about Python 3's operating system > interface handling causing problems in some situations? They're almost > always referring to areas where we're still relying on the locale > system on POSIX or the code page system on Windows. Both of those > approaches are irredeemably broken - the answer is to stop relying on > them, but appropriately updating the affected subsystems generally > isn't a trivial task. A lot of the affected code runs before the > interpreter is fully initialised, which makes it really hard to test, > and a lot of it is incredibly convoluted due to various configuration > options and platform specific details, which makes it incredibly hard > to modify without breaking anything. > > One of those areas is the fact that we still use the old 8-bit APIs to > interact with the Windows console. Those are just as broken in a > multilingual world as the other Windows 8-bit APIs, so Drekin came up > with a project to expose the Windows console as a UTF-16-LE stream > that uses the 16-bit APIs instead: > https://pypi.python.org/pypi/win_unicode_console > > I personally hope we'll be able to get the issues Drekin references > there resolved for Python 3.5 - if other folks hope for the same > thing, then one of the best ways to help that happen is to try out the > win_unicode_console module and provide feedback on what does and > doesn't work. > > Another was getting exceptions attempting to write OS data to > sys.stdout when the locale settings had been scrubbed from the > environment. For Python 3.5, we better tolerate that situation by > setting "errors=surrogateescape" on sys.stdout when the environment > claims "ascii" as a suitable encoding for talking to the operating > system (this is our way of saying "we don't actually believe you, but > also don't have the data we need to overrule you completely"). > > While I was going to wait for more feedback from Fedora folks before > pushing the idea again, this thread also makes me think it would be > worth our while to add more tools for dealing with surrogate escapes > and latin-1 binary data smuggling just to help make those techniques > more discoverable and accessible: > http://bugs.python.org/issue18814#msg225791 > > These various discussions are also giving me plenty of motivation to > get back to working on PEP 432 (the rewrite of the interpreter startup > sequence) for Python 3.5. A lot of these things are just plain hard to > change because of the complexity of the current startup code. > Redesigning that to use a cleaner, multiphase startup sequence that > gets the core interpreter running *before* configuring the operating > system integration should give us several more options when it comes > to dealing with some of these challenges. > > Regards, > Nick. > > -- > Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/phd%40phdru.name Oleg. -- Oleg Broytman http://phdru.name/ phd at phdru.name Programmers don't die, they just GOSUB without RETURN. From rdmurray at bitdance.com Mon Aug 25 16:32:22 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 25 Aug 2014 10:32:22 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <87tx536u9p.fsf@elektro.pacujo.net> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140823110828.GY25957@ando> <20140823134122.85F17250E6A@webabinitio.net> <87tx536u9p.fsf@elektro.pacujo.net> Message-ID: <20140825143222.C36BE250E23@webabinitio.net> On Sat, 23 Aug 2014 19:33:06 +0300, Marko Rauhamaa wrote: > "R. David Murray" : > > > The same problem existed in python2 if your goal was to produce a stream > > with a consistent encoding, but now python3 treats that as an error. > > I have a different interpretation of the situation: as a rule, use byte > strings in Python3. Text strings are a special corner case for > applications that have to deal with human languages. Clearly, then, you are writing unix (or perhaps posix)-only programs. Also, as has been discussed in this thread previously, any program that deals with filenames is dealing with human readable languages, even if posix itself treats the filenames as bytes. --David From ijmorlan at uwaterloo.ca Mon Aug 25 18:46:46 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Mon, 25 Aug 2014 12:46:46 -0400 (EDT) Subject: [Python-Dev] Bytes path support In-Reply-To: <87a96v8rnp.fsf@elektro.pacujo.net> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> <87a96v8rnp.fsf@elektro.pacujo.net> Message-ID: On Sat, 23 Aug 2014, Marko Rauhamaa wrote: > Isaac Morland : > >>> HTTP/1.1 200 OK >>> Content-Type: text/html; charset=ISO-8859-1 >>> >>> >>> >>> >>> >> >> For HTML it's not quite so bad. According to the HTML 4 standard: >> [...] >> >> The Content-Type header takes precedence over a element. I >> thought I read once that the reason was to allow proxy servers to >> transcode documents but I don't have a cite for that. Also, the >> element "must only be used when the character encoding is organized >> such that ASCII-valued bytes stand for ASCII characters" so the >> initial UTF-16 example wouldn't be conformant in HTML. > > That's not how I read it: > > The META declaration must only be used when the character encoding is > organized such that ASCII characters stand for themselves (at least > until the META element is parsed). META declarations should appear as > early as possible in the HEAD element. > > ml#doc-char-set> > > IOW, you must obey the HTTP character encoding until you have parsed a > conflicting META content-type declaration. >From the same document: -------------------------------------------------------------------------- To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): An HTTP "charset" parameter in a "Content-Type" field. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". The charset attribute set on an element that designates an external resource. -------------------------------------------------------------------------- (In the original they are numbered) This is a priority list - if the Content-Type header gives a charset, it takes precedence, and all other sources for the encoding are ignored. The "charset=" on an or similar is only used if it is the only source for the encoding. The "at least until the META element is parsed" bit allows for the use of encodings which make use of shifting. So maybe they start out ASCII-compatible, but after a particular shift byte is seen those bytes now stand for Japanese Kanji characters until another shift byte is seen. This is allowed by the specification, as long as none of the non-ASCII-compatible stuff is seen before the META element. > The author of the standard keeps a straight face and continues: I like your way of putting this - "straight face" indeed. The third option really is a hack to allow working around nonsensical situations (and even the META tag is pretty questionable). All this complexity because people can't be bothered to do things properly. > For cases where neither the HTTP protocol nor the META element > provides information about the character encoding of a document, HTML > also provides the charset attribute on several elements. By combining > these mechanisms, an author can greatly improve the chances that, > when the user retrieves a resource, the user agent will recognize the > character encoding. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From stephen at xemacs.org Tue Aug 26 04:11:31 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 26 Aug 2014 11:11:31 +0900 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: References: Message-ID: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> Nick Coghlan writes: > "purge_surrogate_escapes" was the other term that occurred to me. "purge" suggests removal, not replacement. That may be useful too. neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') maybe? (Of course the remove argument is feature creep, so I'm only about +0.5 myself. And the name is long, but I can't think of any better synonyms for "make safe" in English right now). > Either way, my use case is to filter them out when I *don't* want to > pass them along to other software, but would prefer the Unicode > replacement character to the ASCII question mark created by using the > "replace" filter when encoding. I think it would be preferable to be unicodely correct here by default, since this is a str -> str function. From stephen at xemacs.org Tue Aug 26 04:25:19 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 26 Aug 2014 11:25:19 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140825143222.C36BE250E23@webabinitio.net> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140823110828.GY25957@ando> <20140823134122.85F17250E6A@webabinitio.net> <87tx536u9p.fsf@elektro.pacujo.net> <20140825143222.C36BE250E23@webabinitio.net> Message-ID: <877g1wc7hs.fsf@uwakimon.sk.tsukuba.ac.jp> R. David Murray writes: > Also, as has been discussed in this thread previously, any program that > deals with filenames is dealing with human readable languages, even > if posix itself treats the filenames as bytes. That's a bit extreme. I can name two interesting applications offhand: git's object database and the Coda filesystem's containers. It's true that for debugging purposes bytestrings representing largish numbers are readably encoded (in hexadecimal and decimal, respectively), but they're clearly not "human readable" in the sense you mean. Nevertheless, these are the applications that prove your rule. You don't need the power of pathlib to conveniently (for the programmer) and efficiently handle the file structures these programs use. os.path is plenty. From rdmurray at bitdance.com Tue Aug 26 04:41:31 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Mon, 25 Aug 2014 22:41:31 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <877g1wc7hs.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140823110828.GY25957@ando> <20140823134122.85F17250E6A@webabinitio.net> <87tx536u9p.fsf@elektro.pacujo.net> <20140825143222.C36BE250E23@webabinitio.net> <877g1wc7hs.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <20140826024132.1CC56250E11@webabinitio.net> On Tue, 26 Aug 2014 11:25:19 +0900, "Stephen J. Turnbull" wrote: > R. David Murray writes: > > > Also, as has been discussed in this thread previously, any program that > > deals with filenames is dealing with human readable languages, even > > if posix itself treats the filenames as bytes. > > That's a bit extreme. I can name two interesting applications > offhand: git's object database and the Coda filesystem's containers. As soon as I hit send I realized there were a few counter examples :) So, replace "any" with "most". --David From stephen at xemacs.org Tue Aug 26 04:47:24 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 26 Aug 2014 11:47:24 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> <87a96v8rnp.fsf@elektro.pacujo.net> Message-ID: <8761hgc6gz.fsf@uwakimon.sk.tsukuba.ac.jp> Isaac Morland writes: > I like your way of putting this - "straight face" indeed. The third > option really is a hack to allow working around nonsensical situations > (and even the META tag is pretty questionable). All this complexity > because people can't be bothered to do things properly. At least in Japan and Russia, doing things "properly" in your sense in heterogenous distributed systems is really hard, requiring use of rather fragile encoding detection heuristics that break at the slightest whiff of encodings that are unusual in the particular locale, and in Japan requiring equally fragile transcoding programs that break on vendor charset variations. The META "charset" attribute is useful in those contexts, and the "charset" attribute for external elements may have been useful in the past as well, although I've never needed it. I agree that an environment where "charset" attributes on META and other elements are needed kinda sucks, but the prerequisite for "doing things properly" is basically Unicode[1], and that just wasn't going to happen until at least the 1990s. To make the transition in less than several decades would have required a degree of monopoly in software production that I shudder to contemplate. Even today there are programmers around the world grumbling about having to deal with the Unicode coded character set. Footnotes: [1] More precisely, a universal coded character set. TRON code or MULE code would have done (but yuck!) ISO 2022 won't do! From ncoghlan at gmail.com Tue Aug 26 09:32:51 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Tue, 26 Aug 2014 17:32:51 +1000 Subject: [Python-Dev] Fwd: Accepting PEP 440: Version Identification and Dependency Specification In-Reply-To: References: Message-ID: Antoine pointed out that it would still be a good idea to forward packaging PEP acceptance announcements to python-dev, even when the actual acceptance happens on distutils-sig. That makes sense to me, so here's last week's notice of the acceptance of PEP 440, the implementation independent versioning standard derived from pkg_resources, PEP 386, and ideas from both Linux distributions and other open source language communities. Regards, Nick. ---------- Forwarded message ---------- From: Nick Coghlan Date: 22 August 2014 22:34 Subject: Accepting PEP 440: Version Identification and Dependency Specification To: DistUtils mailing list I just pushed Donald's final round of edits in response to the feedback on the last PEP 440 thread, and as such I'm happy to announce that I am accepting PEP 440 as the recommended approach to identifying versions and specifying dependencies when distributing Python software. The PEP is available in the usual place at http://www.python.org/dev/peps/pep-0440/ It's been a long road to get to an implementation independent versioning standard that has a feasible migration path from the current pkg_resources defined de facto standard, and I'd like to thank a few folks: * Donald Stufft for his extensive work on PEP 440 itself, especially the proof of concept integration into pip * Vinay Sajip for his efforts in validating earlier versions of the PEP * Tarek Ziad? for starting us down the road to an implementation independent versioning standard with the initial creation of PEP 386 back in June 2009, more than five years ago! Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From martin at v.loewis.de Tue Aug 26 13:14:23 2014 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Tue, 26 Aug 2014 13:14:23 +0200 Subject: [Python-Dev] Bytes path support In-Reply-To: <53F93BAE.5050803@canterbury.ac.nz> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <20140822151911.GS25957@ando> <20140822155104.GA28425@phdru.name> <53F771B9.1060203@g.nevcal.com> <20140822165222.GA2290@phdru.name> <53F77941.3040700@g.nevcal.com> <87ioljd46m.fsf@uwakimon.sk.tsukuba.ac.jp> <87vbpj8vkq.fsf@elektro.pacujo.net> <53F93BAE.5050803@canterbury.ac.nz> Message-ID: <53FC6C0F.7050906@v.loewis.de> Am 24.08.14 03:11, schrieb Greg Ewing: > Isaac Morland wrote: >> In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF >> (byte order mark) is used: >> >> http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration >> >> Not sure about XML. > > According to Appendix F here: > > http://www.w3.org/TR/xml/#sec-guessing > > an XML parser needs to be prepared to try all the encodings it > supports until it finds one that works well enough to decode > the XML declaration, then it can find out the exact encoding > used. That's not what this section says. Instead, it says that you need to auto-detect UCS-4, UTF-16, UTF-8 from the BOM, or guess them or EBCDIC from the encoding of ' References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <53FC7007.2060502@mrabarnett.plus.com> On 2014-08-26 03:11, Stephen J. Turnbull wrote: > Nick Coghlan writes: > > > "purge_surrogate_escapes" was the other term that occurred to me. > > "purge" suggests removal, not replacement. That may be useful too. > > neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') > How about: replace_surrogate_escapes(s, replacement='\uFFFD') If you want them removed, just pass an empty string as the replacement. > maybe? (Of course the remove argument is feature creep, so I'm only > about +0.5 myself. And the name is long, but I can't think of any > better synonyms for "make safe" in English right now). > > > Either way, my use case is to filter them out when I *don't* want to > > pass them along to other software, but would prefer the Unicode > > replacement character to the ASCII question mark created by using the > > "replace" filter when encoding. > > I think it would be preferable to be unicodely correct here by > default, since this is a str -> str function. > From rdmurray at bitdance.com Tue Aug 26 15:11:31 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Tue, 26 Aug 2014 09:11:31 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> Message-ID: <20140826131132.6FB45250E3E@webabinitio.net> On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan wrote: > As some examples of where bilingual computing breaks down: > > * My NFS client and server may have different locale settings > * My FTP client and server may have different locale settings > * My SSH client and server may have different locale settings > * I save a file locally and send it to someone with a different locale setting > * I attempt to access a Windows share from a Linux client (or vice-versa) > * I clone my POSIX hosted git or Mercurial repository on a Windows client > * I have to connect my Linux client to a Windows Active Directory > domain (or vice-versa) > * I have to interoperate between native code and JVM code > > The entire computing industry is currently struggling with this > monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale > encoding/code pages) -> multilingual (Unicode) transition. It's been > going on for decades, and it's still going to be quite some time > before we're done. > > The POSIX world is slowly clawing its way towards a multilingual model > that actually works: UTF-8 > Windows (including the CLR) and the JVM adopted a different > multilingual model, but still one that actually works: UTF-16-LE This kind of puts the "length" of the python2->python3 transition period in perspective, doesn't it? --David From p.f.moore at gmail.com Tue Aug 26 17:23:30 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Tue, 26 Aug 2014 16:23:30 +0100 Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path support] Message-ID: On 24 August 2014 04:27, Nick Coghlan wrote: > One of those areas is the fact that we still use the old 8-bit APIs to > interact with the Windows console. Those are just as broken in a > multilingual world as the other Windows 8-bit APIs, so Drekin came up > with a project to expose the Windows console as a UTF-16-LE stream > that uses the 16-bit APIs instead: > https://pypi.python.org/pypi/win_unicode_console > > I personally hope we'll be able to get the issues Drekin references > there resolved for Python 3.5 - if other folks hope for the same > thing, then one of the best ways to help that happen is to try out the > win_unicode_console module and provide feedback on what does and > doesn't work. This looks very cool, and I plan on giving it a try. But I don't see any issues mentioned there (unless you mean the fact that it's not possible to hook into Python's interactive interpreter directly, but I don't see how that could be fixed in an external module). There's no open issues on the project's github tracker. I'd love to see this go into 3.5, so any more specific suggestions as to what would be needed to move it forwards would be great. Paul From tjreedy at udel.edu Tue Aug 26 18:51:02 2014 From: tjreedy at udel.edu (Terry Reedy) Date: Tue, 26 Aug 2014 12:51:02 -0400 Subject: [Python-Dev] Bytes path support In-Reply-To: <20140826131132.6FB45250E3E@webabinitio.net> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> Message-ID: On 8/26/2014 9:11 AM, R. David Murray wrote: > On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan wrote: >> As some examples of where bilingual computing breaks down: >> >> * My NFS client and server may have different locale settings >> * My FTP client and server may have different locale settings >> * My SSH client and server may have different locale settings >> * I save a file locally and send it to someone with a different locale setting >> * I attempt to access a Windows share from a Linux client (or vice-versa) >> * I clone my POSIX hosted git or Mercurial repository on a Windows client >> * I have to connect my Linux client to a Windows Active Directory >> domain (or vice-versa) >> * I have to interoperate between native code and JVM code >> >> The entire computing industry is currently struggling with this >> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale >> encoding/code pages) -> multilingual (Unicode) transition. It's been >> going on for decades, and it's still going to be quite some time >> before we're done. >> >> The POSIX world is slowly clawing its way towards a multilingual model >> that actually works: UTF-8 >> Windows (including the CLR) and the JVM adopted a different >> multilingual model, but still one that actually works: UTF-16-LE Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post. > This kind of puts the "length" of the python2->python3 transition > period in perspective, doesn't it? -- Terry Jan Reedy From ncoghlan at gmail.com Wed Aug 27 00:52:32 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Aug 2014 08:52:32 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> Message-ID: On 27 Aug 2014 02:52, "Terry Reedy" wrote: > > On 8/26/2014 9:11 AM, R. David Murray wrote: >> >> On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan wrote: >>> >>> As some examples of where bilingual computing breaks down: >>> >>> * My NFS client and server may have different locale settings >>> * My FTP client and server may have different locale settings >>> * My SSH client and server may have different locale settings >>> * I save a file locally and send it to someone with a different locale setting >>> * I attempt to access a Windows share from a Linux client (or vice-versa) >>> * I clone my POSIX hosted git or Mercurial repository on a Windows client >>> * I have to connect my Linux client to a Windows Active Directory >>> domain (or vice-versa) >>> * I have to interoperate between native code and JVM code >>> >>> The entire computing industry is currently struggling with this >>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale >>> encoding/code pages) -> multilingual (Unicode) transition. It's been >>> going on for decades, and it's still going to be quite some time >>> before we're done. >>> >>> The POSIX world is slowly clawing its way towards a multilingual model >>> that actually works: UTF-8 >>> Windows (including the CLR) and the JVM adopted a different >>> multilingual model, but still one that actually works: UTF-16-LE > > > Nick, I think the first half of your post is one of the clearest expositions yet of 'why Python 3' (in particular, the str to unicode change). It is worthy of wider distribution and without much change, it would be a great blog post. Indeed, I had the same idea - I had been assuming users already understood this context, which is almost certainly an invalid assumption. The blog post version is already mostly written, but I ran out of weekend. Will hopefully finish it up and post it some time in the next few days :) >> This kind of puts the "length" of the python2->python3 transition >> period in perspective, doesn't it? I realised in writing the post that ASCII is over 50 years old at this point, while Unicode as an official standard is more than 20. By the time this is done, we'll likely be talking 30+ years for Unicode to displace the confusing mess that is code pages and locale encodings :) Cheers, Nick. > > > -- > Terry Jan Reedy > > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nikolaus at rath.org Wed Aug 27 03:39:35 2014 From: Nikolaus at rath.org (Nikolaus Rath) Date: Tue, 26 Aug 2014 18:39:35 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: (Nick Coghlan's message of "Wed, 27 Aug 2014 08:52:32 +1000") References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> Message-ID: <8761heaey0.fsf@vostro.rath.org> Nick Coghlan writes: >>>> As some examples of where bilingual computing breaks down: >>>> >>>> * My NFS client and server may have different locale settings >>>> * My FTP client and server may have different locale settings >>>> * My SSH client and server may have different locale settings >>>> * I save a file locally and send it to someone with a different locale > setting >>>> * I attempt to access a Windows share from a Linux client (or > vice-versa) >>>> * I clone my POSIX hosted git or Mercurial repository on a Windows > client >>>> * I have to connect my Linux client to a Windows Active Directory >>>> domain (or vice-versa) >>>> * I have to interoperate between native code and JVM code >>>> >>>> The entire computing industry is currently struggling with this >>>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale >>>> encoding/code pages) -> multilingual (Unicode) transition. It's been >>>> going on for decades, and it's still going to be quite some time >>>> before we're done. >>>> >>>> The POSIX world is slowly clawing its way towards a multilingual model >>>> that actually works: UTF-8 >>>> Windows (including the CLR) and the JVM adopted a different >>>> multilingual model, but still one that actually works: UTF-16-LE >> >> >> Nick, I think the first half of your post is one of the clearest > expositions yet of 'why Python 3' (in particular, the str to unicode > change). It is worthy of wider distribution and without much change, it > would be a great blog post. > > Indeed, I had the same idea - I had been assuming users already understood > this context, which is almost certainly an invalid assumption. > > The blog post version is already mostly written, but I ran out of weekend. > Will hopefully finish it up and post it some time in the next few days > :) In that case, maybe it'd be nice to also explain why you use the term "bilingual" for codepage based encoding. At least to me, a codepage/locale is pretty monolingual, or alternatively covering a whole region (e.g. western europe). I figure with bilingual you mean ascii + something, but that's mostly a guess from my side. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F ?Time flies like an arrow, fruit flies like a Banana.? From stephen at xemacs.org Wed Aug 27 04:52:46 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 27 Aug 2014 11:52:46 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <8761heaey0.fsf@vostro.rath.org> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> <8761heaey0.fsf@vostro.rath.org> Message-ID: <87iolebq4h.fsf@uwakimon.sk.tsukuba.ac.jp> Nikolaus Rath writes: > In that case, maybe it'd be nice to also explain why you use the > term "bilingual" for codepage based encoding. Modern computing systems are written in languages which are invariably based on syntax expressed using ASCII, and provide by default functionality for expressing dates etc suitable for rendering American English. Thus ASCII (ie, American English) is always an available language. Code pages provide facilities for rendering one or more languages languages sharing a common coded character set, but are unsuitable for rendering most of the rest of the world's dozens of language groups (grouping languages by common character set). Multilingual has come to mean "able to express (almost) any set of languages in a single text" (see, for example, Emacs's "HELLO" file), not just "more than two". So code pages are closer in spirit to "bilingual" (two of many) than to "multilingual" (all of many). It's messy, analogical terminology. But then, natural language is messy and analogical. From ncoghlan at gmail.com Wed Aug 27 10:09:13 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Aug 2014 18:09:13 +1000 Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path support] In-Reply-To: References: Message-ID: On 27 August 2014 01:23, Paul Moore wrote: > On 24 August 2014 04:27, Nick Coghlan wrote: >> One of those areas is the fact that we still use the old 8-bit APIs to >> interact with the Windows console. Those are just as broken in a >> multilingual world as the other Windows 8-bit APIs, so Drekin came up >> with a project to expose the Windows console as a UTF-16-LE stream >> that uses the 16-bit APIs instead: >> https://pypi.python.org/pypi/win_unicode_console >> >> I personally hope we'll be able to get the issues Drekin references >> there resolved for Python 3.5 - if other folks hope for the same >> thing, then one of the best ways to help that happen is to try out the >> win_unicode_console module and provide feedback on what does and >> doesn't work. > > This looks very cool, and I plan on giving it a try. But I don't see > any issues mentioned there (unless you mean the fact that it's not > possible to hook into Python's interactive interpreter directly, but I > don't see how that could be fixed in an external module). There's no > open issues on the project's github tracker. There are two links to CPython issues from the project description: http://bugs.python.org/issue1602 http://bugs.python.org/issue17620 Part of the feedback on those was that as much as possible should be made available as a third party module before returning to the question of how to update CPython. If we can get additional confirmation that the module addresses the CLI integration issues, then we can take a closer look at switching CPython itself over. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From p.f.moore at gmail.com Wed Aug 27 11:46:53 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 27 Aug 2014 10:46:53 +0100 Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path support] In-Reply-To: References: Message-ID: On 27 August 2014 09:09, Nick Coghlan wrote: > There are two links to CPython issues from the project description: > > http://bugs.python.org/issue1602 > http://bugs.python.org/issue17620 > > Part of the feedback on those was that as much as possible should be > made available as a third party module before returning to the > question of how to update CPython. OK, ta. The only issues I'm seeing are that it doesn't play well with the interactive interpreter, which is a known problem but unfortunately makes it pretty hard for me to do any significant testing (nearly all of the stuff that I do which prints to the screen is in the REPL, or in IPython which has its own custom interpreter loop). If I come up with anything worth commenting on, I will do so (I assume that comments of the form "+1 me too!" are not needed ;-)) Paul From ncoghlan at gmail.com Wed Aug 27 14:16:35 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Wed, 27 Aug 2014 22:16:35 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> Message-ID: On 27 August 2014 08:52, Nick Coghlan wrote: > On 27 Aug 2014 02:52, "Terry Reedy" wrote: >> Nick, I think the first half of your post is one of the clearest >> expositions yet of 'why Python 3' (in particular, the str to unicode >> change). It is worthy of wider distribution and without much change, it >> would be a great blog post. > > Indeed, I had the same idea - I had been assuming users already understood > this context, which is almost certainly an invalid assumption. > > The blog post version is already mostly written, but I ran out of weekend. > Will hopefully finish it up and post it some time in the next few days :) Aaand, it's up: http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ndbecker2 at gmail.com Wed Aug 27 14:58:56 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 27 Aug 2014 08:58:56 -0400 Subject: [Python-Dev] pip enhancement Message-ID: On systems where os-level packaging is available (e.g., fedora linux), it is not unusual to want a newer python package installed than available from the vendor. pip install --user can be used for this. But then there is the danger that these pip installed packages are not maintained. At least, pip should have the ability to alert the user to potential updates, pip update could list which packages need updating, and offer to perform the update. I think this would go a long way to helping with this problem. -- -- Those who don't understand recursion are doomed to repeat it From skip at pobox.com Wed Aug 27 15:21:24 2014 From: skip at pobox.com (Skip Montanaro) Date: Wed, 27 Aug 2014 08:21:24 -0500 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On Wed, Aug 27, 2014 at 7:58 AM, Neal Becker wrote: > On systems where os-level packaging is available (e.g., fedora linux), it is not > unusual to want a newer python package installed than available from the vendor. > pip install --user can be used for this. How? I have exactly this problem with nose. We actually get it bundled (currently at ancient 1.1.2, trying to get to 1.3.4) with a bunch of other open source software from an outside packaging company, and even though I add the --user flag, it still complains that a version is already installed. When I add the --upgrade flag it tries to uninstall the global version. Skip From p.f.moore at gmail.com Wed Aug 27 15:24:42 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 27 Aug 2014 14:24:42 +0100 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On 27 August 2014 13:58, Neal Becker wrote: > At least, pip should have the ability to alert the user to potential updates, > > pip update > > could list which packages need updating, and offer to perform the update. I > think this would go a long way to helping with this problem. Do you mean something like "pip list --outdated"? Paul From skip at pobox.com Wed Aug 27 15:46:01 2014 From: skip at pobox.com (Skip Montanaro) Date: Wed, 27 Aug 2014 08:46:01 -0500 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On Wed, Aug 27, 2014 at 8:24 AM, Paul Moore wrote: > Do you mean something like "pip list --outdated"? I was unaware of that command, as we were stuck at pip 1.2.1. I just updated pip manually to 1.5.6. That is a very helpful command. It would be even better if it understood --user so it could restrict it's view to user-installed stuff. Also, given that packages can be found in multiple places on a system, for me: * the OpenSuSE system packages * TWW-provided system-wide packages * our own system-wide packages in /opt/local * my private stuff in ~/.local it would be great if there was a way for it to tell me where on my system it found outdated package X. The --verbose flag tells me all sorts of other stuff I'm not really interested in, but not the installed location of the outdated package. Skip From p.f.moore at gmail.com Wed Aug 27 15:57:57 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Wed, 27 Aug 2014 14:57:57 +0100 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On 27 August 2014 14:46, Skip Montanaro wrote: > it would be great if there was a way for it to tell me where on my > system it found outdated package X. The --verbose flag tells me all > sorts of other stuff I'm not really interested in, but not the > installed location of the outdated package. There's also packaged environments like conda. It would be nice if pip could distinguish between conda-managed packages and ones I installed myself. Really, though, this is what the PEP 376 "INSTALLER" file was intended for. As far as I know, though, it was never implemented (and you'd also need to persuade the Linux vendors, the conda people, etc, to use it as well if it were to be of any practical use). Agreed about reporting the installed location, though. Specific suggestions like this would be good things to add to the pip issue tracker. Paul From graffatcolmingov at gmail.com Wed Aug 27 16:04:17 2014 From: graffatcolmingov at gmail.com (Ian Cordasco) Date: Wed, 27 Aug 2014 09:04:17 -0500 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On Wed, Aug 27, 2014 at 8:24 AM, Paul Moore wrote: > On 27 August 2014 13:58, Neal Becker wrote: >> At least, pip should have the ability to alert the user to potential updates, >> >> pip update >> >> could list which packages need updating, and offer to perform the update. I >> think this would go a long way to helping with this problem. > > Do you mean something like "pip list --outdated"? > Paul Also, isn't this discussion better suited for Distutils-SIG? From skip at pobox.com Wed Aug 27 17:36:34 2014 From: skip at pobox.com (Skip Montanaro) Date: Wed, 27 Aug 2014 10:36:34 -0500 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: On Wed, Aug 27, 2014 at 9:04 AM, Ian Cordasco wrote: > Also, isn't this discussion better suited for Distutils-SIG? I started up a thread there. I'd post an archive link, but it hasn't yet turned up in the distutils-sig archive. Skip From ndbecker2 at gmail.com Wed Aug 27 15:36:13 2014 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 27 Aug 2014 09:36:13 -0400 Subject: [Python-Dev] pip enhancement In-Reply-To: References: Message-ID: Wow, I didn't know that existed. Maybe needs to be more obvious. But not quite. It doesn't distinguish between locally installed files, and globally installed. Here, globally installed are maintained by the OS vendor packaging, while locally (user, not virtualenv) installed are managed by pip. Really what's needed is for pip --user to apply to all pip commands, and tell pip to ignore the system stuff. Running pip list --outdated runs a long time, and gives me a very long list of packages that are outdated, leaving me to still sort through which are --user (and I might want to update via pip) and which are global (and I can't really do anything about, other than filing a bug report requesting an update). On Wed, Aug 27, 2014 at 9:24 AM, Paul Moore wrote: > On 27 August 2014 13:58, Neal Becker wrote: > > At least, pip should have the ability to alert the user to potential > updates, > > > > pip update > > > > could list which packages need updating, and offer to perform the > update. I > > think this would go a long way to helping with this problem. > > Do you mean something like "pip list --outdated"? > Paul > -- *Those who don't understand recursion are doomed to repeat it* -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 27 20:18:11 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 27 Aug 2014 11:18:11 -0700 Subject: [Python-Dev] Bytes path support In-Reply-To: References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> Message-ID: <53FE20E3.2060700@g.nevcal.com> On 8/27/2014 5:16 AM, Nick Coghlan wrote: > On 27 August 2014 08:52, Nick Coghlan wrote: >> On 27 Aug 2014 02:52, "Terry Reedy" wrote: >>> Nick, I think the first half of your post is one of the clearest >>> expositions yet of 'why Python 3' (in particular, the str to unicode >>> change). It is worthy of wider distribution and without much change, it >>> would be a great blog post. >> Indeed, I had the same idea - I had been assuming users already understood >> this context, which is almost certainly an invalid assumption. >> >> The blog post version is already mostly written, but I ran out of weekend. >> Will hopefully finish it up and post it some time in the next few days :) > Aaand, it's up: > http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html > > Cheers, > Nick. > Indeed, I also enjoyed and found enlightening your response to this issue, including the broader historical context. I remember when Unicode was first published back in 1991, and it sounded interesting, but far removed from the reality of implementations of the day. I was intrigued by UTF-8 at the time, and even wrote an encoder and decoder for it for a software package that eventually never reached any real customers. Your blog post says: > > Choosing UTF-8 aims to treat formatting text for communication with > the user as "just a display issue". It's a low impact design that will > "just work" for a lot of software, but it comes at a price: > > * because encoding consistency checks are mostly avoided, data in > different encodings may be freely concatenated and passed on to > other applications. Such data is typically not usable by the > receiving application. > I don't believe this is a necessary result of using UTF-8. It is a possible result, and I guess some implementations are using it this way, but a proper language could still provide and/or require proper usage of UTF-8 data through its type system just as Python3 is doing with PEP 393. In fact, if it were not for the requirement to support passing character strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython add-on packages) and the resulting practical performance considerations of converting to/from UTF-8 repeatedly when calling those APIs, Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today. Of course, nothing can be "required" if the user chooses to continue operating in the encoded domain, and manipulate data using the necessary byte-oriented features of of whatever language is in use. One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation, and of limited general applicability when considering grapheme clusters. An iterator based approach can solve both problems, but would have been best introduced as part of Python3.0, although it may have made 2to3 harder, and may have made it less practical to implement six and other "run on both Py2 and Py3" type solutions harder, without introducing those same iterative solutions into Python 2.6 or 2.7. Such solutions could still be implemented as options. Even PEP 393 grudgingly supports some use of UTF-8 when requested by the user, as I understand it. Whether such an implementation would be better based on bytes or str is uncertain without further analysis, although type checking would probably be easier if based on str. A high-performance implementation would likely need to be implemented at least partly in C rather than CPython, although it could be prototyped in Python for proof of functionality. The iterators could obviously be implemented to work based on top of solutions such as PEP 393, by simply using indexing underneath, when fixed-width characters are available, and other techniques when UTF-8 is the only available format (rather than converting from UTF-8 to fixed-width characters because of calling the iterator). -------------- next part -------------- An HTML attachment was scrubbed... URL: From v+python at g.nevcal.com Wed Aug 27 20:21:06 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 27 Aug 2014 11:21:06 -0700 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FC7007.2060502@mrabarnett.plus.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> Message-ID: <53FE2192.7050206@g.nevcal.com> On 8/26/2014 4:31 AM, MRAB wrote: > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> Nick Coghlan writes: >> >> > "purge_surrogate_escapes" was the other term that occurred to me. >> >> "purge" suggests removal, not replacement. That may be useful too. >> >> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') >> > How about: > > replace_surrogate_escapes(s, replacement='\uFFFD') > > If you want them removed, just pass an empty string as the replacement. And further, replacement could be a vector of 128 characters, to do immediate transcoding, or a single character to do wholesale replacement with some gibberish character, or None to remove (or an empty string). -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Thu Aug 28 01:54:31 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 28 Aug 2014 09:54:31 +1000 Subject: [Python-Dev] Bytes path support In-Reply-To: <53FE20E3.2060700@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> <53FE20E3.2060700@g.nevcal.com> Message-ID: On 28 Aug 2014 04:20, "Glenn Linderman" wrote: > > On 8/27/2014 5:16 AM, Nick Coghlan wrote: >> >> On 27 August 2014 08:52, Nick Coghlan wrote: >>> >>> On 27 Aug 2014 02:52, "Terry Reedy" wrote: >>>> >>>> Nick, I think the first half of your post is one of the clearest >>>> expositions yet of 'why Python 3' (in particular, the str to unicode >>>> change). It is worthy of wider distribution and without much change, it >>>> would be a great blog post. >>> >>> Indeed, I had the same idea - I had been assuming users already understood >>> this context, which is almost certainly an invalid assumption. >>> >>> The blog post version is already mostly written, but I ran out of weekend. >>> Will hopefully finish it up and post it some time in the next few days :) >> >> Aaand, it's up: >> http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html >> >> Cheers, >> Nick. >> > > Indeed, I also enjoyed and found enlightening your response to this issue, including the broader historical context. I remember when Unicode was first published back in 1991, and it sounded interesting, but far removed from the reality of implementations of the day. I was intrigued by UTF-8 at the time, and even wrote an encoder and decoder for it for a software package that eventually never reached any real customers. > > Your blog post says: >> >> Choosing UTF-8 aims to treat formatting text for communication with the user as "just a display issue". It's a low impact design that will "just work" for a lot of software, but it comes at a price: >> >> because encoding consistency checks are mostly avoided, data in different encodings may be freely concatenated and passed on to other applications. Such data is typically not usable by the receiving application. > > > I don't believe this is a necessary result of using UTF-8. It is a possible result, and I guess some implementations are using it this way, but a proper language could still provide and/or require proper usage of UTF-8 data through its type system just as Python3 is doing with PEP 393. Yes, Go works that way, for example. I doubt it actually checks for valid UTF-8 at OS boundaries though - that would be a potentially expensive check, and as a network service centric language, Go can afford to place more constraints on the operating environment than we can. >In fact, if it were not for the requirement to support passing character strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython add-on packages) and the resulting practical performance considerations of converting to/from UTF-8 repeatedly when calling those APIs, Python3 could have evolved to using UTF-8 as its underlying data format, and obtained equal encoding consistency as it has today. We already have string processing algorithms that work for fixed width encodings (and are known not to work for variable width encodings, hence the bugs in Unicode handling on the old narrow builds). It isn't that variable width encodings aren't a viable choice for programming language text modelling, it's that the assumption of a fixed width model is more deeply entrenched in CPython (and especially the C API) than the exact number of bits used per code point. > Of course, nothing can be "required" if the user chooses to continue operating in the encoded domain, and manipulate data using the necessary byte-oriented features of of whatever language is in use. > > One of the choices of Python3, was to retain character indexing as an underlying arithmetic implementation citing algorithmic speed, but that is a seldom needed operation, and of limited general applicability when considering grapheme clusters. The choice that was made was to say no to the question "Do we rewrite a Unicode type that we already know works from scratch?". The decisions about how to handle *text* were made way back before the PEP process even existed, and later captured as PEP 100. What changed in Python 3 was dropping the hybrid 8-bit str type with its locale dependent behaviour, and parcelling its responsibilities out to either the existing unicode type (renamed as str, as it was the default choice), or the new locale independent bytes type. > An iterator based approach can solve both problems, but would have been best introduced as part of Python3.0, although it may have made 2to3 harder, and may have made it less practical to implement six and other "run on both Py2 and Py3" type solutions harder, without introducing those same iterative solutions into Python 2.6 or 2.7. The option of fundamentally changing the text handling design was never on the table. The Python 2 unicode type works fine, it is the Python 2 str type that needed changing. > Such solutions could still be implemented as options. Even PEP 393 grudgingly supports some use of UTF-8 when requested by the user, as I understand it. Not quite. PEP 393 heavily favours and optimises UTF-8, trading memory for speed by implicitly caching the UTF-8 representation the support isn't begrudged, it's enthusiastic. We just don't use it for the text processing algorithms, because those assume a fixed width encoding. > Whether such an implementation would be better based on bytes or str is uncertain without further analysis, although type checking would probably be easier if based on str. A high-performance implementation would likely need to be implemented at least partly in C rather than CPython, although it could be prototyped in Python for proof of functionality. The iterators could obviously be implemented to work based on top of solutions such as PEP 393, by simply using indexing underneath, when fixed-width characters are available, and other techniques when UTF-8 is the only available format (rather than converting from UTF-8 to fixed-width characters because of calling the iterator). For the cost of rewriting every single string manipulation algorithm in CPython to avoid relying on C array access, the only thing you would save over PEP 393 is a bit of memory - we already store the UTF-8 representation when appropriate. There's simply not a sufficient payoff to justify the cost. Cheers, Nick. > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Thu Aug 28 03:08:43 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 28 Aug 2014 10:08:43 +0900 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FE2192.7050206@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> Message-ID: <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On 8/26/2014 4:31 AM, MRAB wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >> Nick Coghlan writes: > > How about: > > > > replace_surrogate_escapes(s, replacement='\uFFFD') > > > > If you want them removed, just pass an empty string as the > > replacement. That seems better to me (I had too much C for breakfast, I think). > And further, replacement could be a vector of 128 characters, to do > immediate transcoding, Using what encoding? If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. OTOH, I could see using replace_surrogate_escapes(s, replacement='�') in HTML. (Actually, probably not; if it makes sense to use Unicode features you're probably using Unicode as the external encoding, so a character entity is silly. But there might be contexts with a useful multicharacter replacements.) > or a single character to do wholesale replacement with some > gibberish character, or None to remove (or an empty string). Not None, that means default (which should be the Unicode standard REPLACEMENT CHARACTER U+FFFD). Steve From stephen at xemacs.org Thu Aug 28 04:04:01 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 28 Aug 2014 11:04:01 +0900 Subject: [Python-Dev] Bytes path support In-Reply-To: <53FE20E3.2060700@g.nevcal.com> References: <20140821222721.GA13888@cskk.homeip.net> <53F68802.7080906@g.nevcal.com> <5124983344373446869@unknownmsgid> <20140822024229.GA8192@phdru.name> <87fvgnd1f5.fsf@uwakimon.sk.tsukuba.ac.jp> <20140823151552.GA4264@phdru.name> <20140823183729.GA7819@phdru.name> <20140826131132.6FB45250E3E@webabinitio.net> <53FE20E3.2060700@g.nevcal.com> Message-ID: <87egw1bca6.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On 8/27/2014 5:16 AM, Nick Coghlan wrote: > > Choosing UTF-8 aims to treat formatting text for communication with > > the user as "just a display issue". It's a low impact design that will > > "just work" for a lot of software, but it comes at a price: > > > > * because encoding consistency checks are mostly avoided, data in > > different encodings may be freely concatenated and passed on to > > other applications. Such data is typically not usable by the > > receiving application. > > I don't believe this is a necessary result of using UTF-8. No, it's not, but if you're going to do the same kind of checks that are necessary for transcoding UTF-8 to abstract Unicode, there's no benefit to using UTF-8 internally, and you lose a lot. The only operations that you can do efficiently are concatenation and iteration. I've worked with a UTF-8-like internal encoding for 20 years now -- it's a huge cost. > Python3 could have evolved to using UTF-8 as its underlying data > format, and obtained equal encoding consistency as it has today. Thank heaven it didn't! > One of the choices of Python3, was to retain character indexing as an > underlying arithmetic implementation citing algorithmic speed, but that > is a seldom needed operation, That simply isn't true. The negative effects of algorithmic slowness in Emacsen are visible both as annoying user delays, and as excessive developer concentration on optimizing a fundamentally insufficient data structure. > and of limited general applicability when considering grapheme > clusters. An iterator based approach can solve both problems, On the contrary, grapheme clusters are the relatively rare use case in textual computing, at least currently, that can be optimized for when necessary. There's no problem with creating iterators from arrays, but making an iterator behave like a array ... well, that involves creating the array. > Such solutions could still be implemented as options. Sure, but the problems to be solved in that implementation are not due to Python 3's internal representation. A lot of painstaking (and possibly hard?) work remains to be done. > A high-performance implementation would likely need to be > implemented at least partly in C rather than CPython, That's how Emacs did it, and (a) over the decades it has involved an inordinate amount of effort compared to rewriting the text-handling functions for an array, (b) is fragile, and (c) performance sucks in practice. Unicode, not UTF-8, is the central component of the solution. The various UTFs are application-specific implementations of Unicode. UTF-8 is an excellent solution for text streams, such as disk files and network communication. Fixed-width representations (ISO-8859-1, UCS-2, UTF-32, PEP-393) are useful for applications of large buffers that need O(1) "random" access, and can trivially be iterated for stream applications. Steve From v+python at g.nevcal.com Thu Aug 28 06:56:50 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Wed, 27 Aug 2014 21:56:50 -0700 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <53FEB692.7000207@g.nevcal.com> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: > Glenn Linderman writes: > > On 8/26/2014 4:31 AM, MRAB wrote: > > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > > >> Nick Coghlan writes: > > > > How about: > > > > > > replace_surrogate_escapes(s, replacement='\uFFFD') > > > > > > If you want them removed, just pass an empty string as the > > > replacement. > > That seems better to me (I had too much C for breakfast, I think). > > > And further, replacement could be a vector of 128 characters, to do > > immediate transcoding, > > Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector. > If you knew that much, why didn't you use > (write, if necessary) an appropriate codec? I can't envision this > being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen at xemacs.org Thu Aug 28 08:30:44 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 28 Aug 2014 15:30:44 +0900 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FEB692.7000207@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> Message-ID: <87bnr5azxn.fsf@uwakimon.sk.tsukuba.ac.jp> Glenn Linderman writes: > On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: > > Glenn Linderman writes: > > > And further, replacement could be a vector of 128 characters, to do > > > immediate transcoding, > > > > Using what encoding? > > The vector would contain the transcoding. Each lone surrogate would map > to a character in the vector. Yes, that's obvious. The question is where do you get the vector? > > If you knew that much, why didn't you use (write, if necessary) > > an appropriate codec? I can't envision this being useful. > > If the data format describes its encoding, possibly containing data from > several encodings in various spots, then perhaps it is best read as > binary, and processed as binary until those definitions are found. Exactly. That's precisely why bytes have a .decode method. > But an alternative would be to read with surrogate escapes, and > then when the encoding is determined, to transcode the data. Not every one-line expression needs to be in the stdlib: data[start, end] = data[start, end].encode('utf-8', errors=surrogateescape).decode('DTRT-now') Note that you *do* need to know start and end, because of the possibility of "several encodings", where once you apply this technique to the whole text, you can't recover the surrogates when you get the encoding wrong. > Previously, a proposal was made to reverse the surrogate escapes to > the original bytes, and then apply the (now known) appropriate > codec. Sure. And in fact I do this kind of thing all the time in Emacs, using the decode(encode(slice)) approach. The only times in 25 years of working with the insanity of digitized Japanese I've had a use for anything other than that is when I don't have a round-tripping codec. In that case I have to preserve the bytes or suffer lossy conversion anyway, regardless of the method used to reconvert. But surrogateescape is necessarily round-tripping (maybe with a few exceptions in Chinese and a very small number in other languages, but those failures are due to Unicode, not to surrogateescape). > There are not appropriate codecs that can convert directly from > surrogate escapes to the desired end result. And there currently cannot be. codecs are bytes<->str, not str->str. > This technique could be used instead, for single-byte, non-escaped > encodings. That's pure theory, not a use case. We have codecs for all the encodings with significant numbers of users, and writing a new one simply isn't that hard. Steve From python at mrabarnett.plus.com Thu Aug 28 09:30:39 2014 From: python at mrabarnett.plus.com (MRAB) Date: Thu, 28 Aug 2014 08:30:39 +0100 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FEB692.7000207@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> Message-ID: <53FEDA9F.6090201@mrabarnett.plus.com> On 2014-08-28 05:56, Glenn Linderman wrote: > On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: >> Glenn Linderman writes: >> > On 8/26/2014 4:31 AM, MRAB wrote: >> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> > >> Nick Coghlan writes: >> >> > > How about: >> > > >> > > replace_surrogate_escapes(s, replacement='\uFFFD') >> > > >> > > If you want them removed, just pass an empty string as the >> > > replacement. >> >> That seems better to me (I had too much C for breakfast, I think). >> >> > And further, replacement could be a vector of 128 characters, to do >> > immediate transcoding, >> >> Using what encoding? > > The vector would contain the transcoding. Each lone surrogate would map > to a character in the vector. > >> If you knew that much, why didn't you use >> (write, if necessary) an appropriate codec? I can't envision this >> being useful. > > If the data format describes its encoding, possibly containing data from > several encodings in various spots, then perhaps it is best read as > binary, and processed as binary until those definitions are found. > > But an alternative would be to read with surrogate escapes, and then > when the encoding is determined, to transcode the data. Previously, a > proposal was made to reverse the surrogate escapes to the original > bytes, and then apply the (now known) appropriate codec. There are not > appropriate codecs that can convert directly from surrogate escapes to > the desired end result. This technique could be used instead, for > single-byte, non-escaped encodings. On the other hand, writing specialty > codecs for the purpose would be more general. > There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too. From ncoghlan at gmail.com Thu Aug 28 14:26:16 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Thu, 28 Aug 2014 22:26:16 +1000 Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido) In-Reply-To: References: Message-ID: On 26 Aug 2014 21:34, "MRAB" wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> >> Nick Coghlan writes: >> >> > "purge_surrogate_escapes" was the other term that occurred to me. >> >> "purge" suggests removal, not replacement. That may be useful too. >> >> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') >> > How about: > > replace_surrogate_escapes(s, replacement='\uFFFD') > > If you want them removed, just pass an empty string as the replacement. The current proposal on the issue tracker is to instead take advantage of the existing error handlers: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) That code is short, but semantically dense - it took a few iterations to come up with that version. (Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions. The standard name just makes it easier to look up when you come across it in a piece of code, and provides the option of optimising it later if it ever seems worth the extra work) I also filed a separate RFE to make backslashreplace usable on input, since that allows the option of separating the replacement operation from the encoding operation. Cheers, Nick. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.f.moore at gmail.com Thu Aug 28 15:22:55 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Thu, 28 Aug 2014 14:22:55 +0100 Subject: [Python-Dev] Windows Unicode console support [Was: Bytes path support] In-Reply-To: References: Message-ID: On 27 August 2014 10:46, Paul Moore wrote: > If I come up with anything worth commenting on, I will do so (I assume > that comments of the form "+1 me too!" are not needed ;-)) Nevertheless, here's a "Me, too". I've just been writing some PyPI interrogation scripts, and it's absolutely awful having to deal with random encoding errors in the output. Being able to just print *anything* is a HUGE benefit. This is how sys.stdout should behave - presumably the Unix guys are now all rolling their eyes and saying "but it does - just use a proper OS" :-) Enlightened-ly y'rs, Paul From v+python at g.nevcal.com Thu Aug 28 19:15:40 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 28 Aug 2014 10:15:40 -0700 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FEDA9F.6090201@mrabarnett.plus.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> <53FEDA9F.6090201@mrabarnett.plus.com> Message-ID: <53FF63BC.3020603@g.nevcal.com> On 8/28/2014 12:30 AM, MRAB wrote: > On 2014-08-28 05:56, Glenn Linderman wrote: >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: >>> Glenn Linderman writes: >>> > On 8/26/2014 4:31 AM, MRAB wrote: >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >>> > >> Nick Coghlan writes: >>> >>> > > How about: >>> > > >>> > > replace_surrogate_escapes(s, replacement='\uFFFD') >>> > > >>> > > If you want them removed, just pass an empty string as the >>> > > replacement. >>> >>> That seems better to me (I had too much C for breakfast, I think). >>> >>> > And further, replacement could be a vector of 128 characters, to do >>> > immediate transcoding, >>> >>> Using what encoding? >> >> The vector would contain the transcoding. Each lone surrogate would map >> to a character in the vector. >> >>> If you knew that much, why didn't you use >>> (write, if necessary) an appropriate codec? I can't envision this >>> being useful. >> >> If the data format describes its encoding, possibly containing data from >> several encodings in various spots, then perhaps it is best read as >> binary, and processed as binary until those definitions are found. >> >> But an alternative would be to read with surrogate escapes, and then >> when the encoding is determined, to transcode the data. Previously, a >> proposal was made to reverse the surrogate escapes to the original >> bytes, and then apply the (now known) appropriate codec. There are not >> appropriate codecs that can convert directly from surrogate escapes to >> the desired end result. This technique could be used instead, for >> single-byte, non-escaped encodings. On the other hand, writing specialty >> codecs for the purpose would be more general. >> > There'll be a surrogate escape if a byte couldn't be decoded, but just > because a byte could be decoded, it doesn't mean that it's correct. > > If you picked the wrong encoding, the other codepoints could be wrong > too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Thu Aug 28 19:41:03 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 28 Aug 2014 13:41:03 -0400 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FF63BC.3020603@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> <53FEDA9F.6090201@mrabarnett.plus.com> <53FF63BC.3020603@g.nevcal.com> Message-ID: <20140828174104.56783250E01@webabinitio.net> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: > On 8/28/2014 12:30 AM, MRAB wrote: > > On 2014-08-28 05:56, Glenn Linderman wrote: > >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: > >>> Glenn Linderman writes: > >>> > On 8/26/2014 4:31 AM, MRAB wrote: > >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >>> > >> Nick Coghlan writes: > >>> > >>> > > How about: > >>> > > > >>> > > replace_surrogate_escapes(s, replacement='\uFFFD') > >>> > > > >>> > > If you want them removed, just pass an empty string as the > >>> > > replacement. > >>> > >>> That seems better to me (I had too much C for breakfast, I think). > >>> > >>> > And further, replacement could be a vector of 128 characters, to do > >>> > immediate transcoding, > >>> > >>> Using what encoding? > >> > >> The vector would contain the transcoding. Each lone surrogate would map > >> to a character in the vector. > >> > >>> If you knew that much, why didn't you use > >>> (write, if necessary) an appropriate codec? I can't envision this > >>> being useful. > >> > >> If the data format describes its encoding, possibly containing data from > >> several encodings in various spots, then perhaps it is best read as > >> binary, and processed as binary until those definitions are found. > >> > >> But an alternative would be to read with surrogate escapes, and then > >> when the encoding is determined, to transcode the data. Previously, a > >> proposal was made to reverse the surrogate escapes to the original > >> bytes, and then apply the (now known) appropriate codec. There are not > >> appropriate codecs that can convert directly from surrogate escapes to > >> the desired end result. This technique could be used instead, for > >> single-byte, non-escaped encodings. On the other hand, writing specialty > >> codecs for the purpose would be more general. > >> > > There'll be a surrogate escape if a byte couldn't be decoded, but just > > because a byte could be decoded, it doesn't mean that it's correct. > > > > If you picked the wrong encoding, the other codepoints could be wrong > > too. > > Aha! Thanks for pointing out the flaw in my reasoning. But that means it > is also pretty useless to "replace_surrogate_escapes" at all, because it > only cleans out the non-decodable characters, not the incorrectly > decoded characters. Well, replace would still be useful for ASCII+surrogateescape. Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. --David From v+python at g.nevcal.com Thu Aug 28 19:54:44 2014 From: v+python at g.nevcal.com (Glenn Linderman) Date: Thu, 28 Aug 2014 10:54:44 -0700 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <20140828174104.56783250E01@webabinitio.net> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> <53FEDA9F.6090201@mrabarnett.plus.com> <53FF63BC.3020603@g.nevcal.com> <20140828174104.56783250E01@webabinitio.net> Message-ID: <53FF6CE4.8010305@g.nevcal.com> On 8/28/2014 10:41 AM, R. David Murray wrote: > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: >> On 8/28/2014 12:30 AM, MRAB wrote: >>> On 2014-08-28 05:56, Glenn Linderman wrote: >>>> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: >>>>> Glenn Linderman writes: >>>>> > On 8/26/2014 4:31 AM, MRAB wrote: >>>>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >>>>> > >> Nick Coghlan writes: >>>>> >>>>> > > How about: >>>>> > > >>>>> > > replace_surrogate_escapes(s, replacement='\uFFFD') >>>>> > > >>>>> > > If you want them removed, just pass an empty string as the >>>>> > > replacement. >>>>> >>>>> That seems better to me (I had too much C for breakfast, I think). >>>>> >>>>> > And further, replacement could be a vector of 128 characters, to do >>>>> > immediate transcoding, >>>>> >>>>> Using what encoding? >>>> The vector would contain the transcoding. Each lone surrogate would map >>>> to a character in the vector. >>>> >>>>> If you knew that much, why didn't you use >>>>> (write, if necessary) an appropriate codec? I can't envision this >>>>> being useful. >>>> If the data format describes its encoding, possibly containing data from >>>> several encodings in various spots, then perhaps it is best read as >>>> binary, and processed as binary until those definitions are found. >>>> >>>> But an alternative would be to read with surrogate escapes, and then >>>> when the encoding is determined, to transcode the data. Previously, a >>>> proposal was made to reverse the surrogate escapes to the original >>>> bytes, and then apply the (now known) appropriate codec. There are not >>>> appropriate codecs that can convert directly from surrogate escapes to >>>> the desired end result. This technique could be used instead, for >>>> single-byte, non-escaped encodings. On the other hand, writing specialty >>>> codecs for the purpose would be more general. >>>> >>> There'll be a surrogate escape if a byte couldn't be decoded, but just >>> because a byte could be decoded, it doesn't mean that it's correct. >>> >>> If you picked the wrong encoding, the other codepoints could be wrong >>> too. >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it >> is also pretty useless to "replace_surrogate_escapes" at all, because it >> only cleans out the non-decodable characters, not the incorrectly >> decoded characters. > Well, replace would still be useful for ASCII+surrogateescape. How? > Also for > cases where the data stream is *supposed* to be in a given encoding, but > contains undecodable bytes. Showing the stuff that incorrectly decodes > as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Thu Aug 28 20:43:51 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 28 Aug 2014 14:43:51 -0400 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FF6CE4.8010305@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> <53FEDA9F.6090201@mrabarnett.plus.com> <53FF63BC.3020603@g.nevcal.com> <20140828174104.56783250E01@webabinitio.net> <53FF6CE4.8010305@g.nevcal.com> Message-ID: <20140828184352.6DC2B250DC1@webabinitio.net> On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman wrote: > On 8/28/2014 10:41 AM, R. David Murray wrote: > > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: > >> On 8/28/2014 12:30 AM, MRAB wrote: > >>> There'll be a surrogate escape if a byte couldn't be decoded, but just > >>> because a byte could be decoded, it doesn't mean that it's correct. > >>> > >>> If you picked the wrong encoding, the other codepoints could be wrong > >>> too. > >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it > >> is also pretty useless to "replace_surrogate_escapes" at all, because it > >> only cleans out the non-decodable characters, not the incorrectly > >> decoded characters. > > Well, replace would still be useful for ASCII+surrogateescape. > > How? Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more *likely* to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well). > > Also for > > cases where the data stream is *supposed* to be in a given encoding, but > > contains undecodable bytes. Showing the stuff that incorrectly decodes > > as whatever it decodes to is generally what you want in that case. > > Sure, people can learn to recognize mojibake for what it is, and maybe > even learn to recognize it for what it was intended to be, in limited > domains. But suppressing/replacing the surrogates doesn't help with Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case. > that... would it not be better to replace the surrogates with an escape > sequence that shows the original, undecodable, byte value? Like \xNN ? Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :) --David From stephen at xemacs.org Fri Aug 29 02:32:58 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 29 Aug 2014 09:32:58 +0900 Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido) In-Reply-To: References: Message-ID: <87a96ob0ed.fsf@uwakimon.sk.tsukuba.ac.jp> Nick Coghlan writes: > The current proposal on the issue tracker is to instead take advantage of > the existing error handlers: > > def convert_surrogateescape(data, errors='replace'): > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) > > That code is short, but semantically dense And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.) > (Added bonus: once you're alerted to the possibility, it's trivial > to write your own version for existing Python 3 versions. I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option. OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing. Steve From stephen at xemacs.org Fri Aug 29 02:41:03 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 29 Aug 2014 09:41:03 +0900 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: References: Message-ID: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> In the process of booking up for my other post in this thread, I noticed the 'surrogatepass' handler. Is there a real use case for the 'surrogatepass' error handler? It seems like a horrible break in the abstraction. IMHO, if there's a need, the application should handle this. Python shouldn't provide it on encoding as the resulting streams are not Unicode conformant, nor on decoding UTF-16, as conversion of surrogate pairs is a requirement of all Unicode versions since about 1995. Steve From ncoghlan at gmail.com Fri Aug 29 06:55:39 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 29 Aug 2014 14:55:39 +1000 Subject: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido) In-Reply-To: <87a96ob0ed.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87a96ob0ed.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: On 29 August 2014 10:32, Stephen J. Turnbull wrote: > Nick Coghlan writes: > > > The current proposal on the issue tracker is to instead take advantage of > > the existing error handlers: > > > > def convert_surrogateescape(data, errors='replace'): > > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) > > > > That code is short, but semantically dense > > And it doesn't implement your original suggestion of replacement with > '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At > least, AFAICT from the docs there's no way to specify the replacement > character; decoding always uses U+FFFD. (If I knew how to do that, I > would have suggested this.) If that actually matters in a given context, I can do an ordinary string replacement later. I couldn't think of a case where it actually mattered though - if "must be ASCII" was a requirement, then backslashreplace was a suitable alternative that lost less information (hence the RFE to make that also usable on input). > > (Added bonus: once you're alerted to the possibility, it's trivial > > to write your own version for existing Python 3 versions. > > I'm not sure that's true. At least, to me that code was obvious -- I > got the exact definition (except for the function name) on the first > try -- but I ruled it out because it didn't implement your suggestion > of replacement with '?', even as an option. Yeah, part of the tracker discussion involved me realising that part wasn't a necessary requirement - the key is being able to get rid of the surrogates, or replace them with something readily identifiable, and less about being able to control exactly what they get replaced by. > OTOH, I think a lot of the resistance to codec-based solutions is the > misconception that en/decoding streams is expensive, or the > misconception that Python's internal representation of text as an > array of code points (rather than an array of "characters" or > "grapheme clusters") is somehow insufficient for text processing. We don't actually have any technical deep dives into how Python 3's text handling works readily available online, so there's a lot of speculation and misinformation floating around. My recent article gives the high level context, but it really needs to be paired up with a piece (or pieces) that go deep into the details of codec optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE Windows APIs, how the internal storage structure is determined at allocation time, how it maintains compatibility with the legacy C extension APIs, etc. The only current widely distributed articles on those topics are written from a perspective that assumes we don't know anything about Unicode, and are just making things unnecessarily complicated (rather than solving hard cross platform compatibility and text processing performance problems). That perspective is incorrect, but "trust me, they're wrong" doesn't work very well with people that are already angry. Text manipulation is one of the most sophisticated subsystems in the interpreter, though, so it's hard to know where to start on such a series (and easy to get intimidated by the sheer magnitude of the work involved in doing it right). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From mal at egenix.com Fri Aug 29 09:48:50 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 29 Aug 2014 09:48:50 +0200 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <54003062.1040201@egenix.com> On 29.08.2014 02:41, Stephen J. Turnbull wrote: > In the process of booking up for my other post in this thread, I > noticed the 'surrogatepass' handler. > > Is there a real use case for the 'surrogatepass' error handler? It > seems like a horrible break in the abstraction. IMHO, if there's a > need, the application should handle this. Python shouldn't provide > it on encoding as the resulting streams are not Unicode conformant, > nor on decoding UTF-16, as conversion of surrogate pairs is a > requirement of all Unicode versions since about 1995. This error handler allows applications to reactivate the Python 2 style behavior of the UTF codecs in Python 3, which allow reading lone surrogates on input. Since Python allows working with lone surrogates in Unicode (they are valid code points) and we're using UTF-8 for marshal, we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8). See http://bugs.python.org/issue3672 http://bugs.python.org/issue12892 for discussions. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From walter at livinglogic.de Fri Aug 29 12:09:54 2014 From: walter at livinglogic.de (Walter =?utf-8?q?D=C3=B6rwald?=) Date: Fri, 29 Aug 2014 12:09:54 +0200 Subject: [Python-Dev] Bytes path related questions for Guido In-Reply-To: <53FF6CE4.8010305@g.nevcal.com> References: <878umcc84s.fsf@uwakimon.sk.tsukuba.ac.jp> <53FC7007.2060502@mrabarnett.plus.com> <53FE2192.7050206@g.nevcal.com> <87fvghbeuc.fsf@uwakimon.sk.tsukuba.ac.jp> <53FEB692.7000207@g.nevcal.com> <53FEDA9F.6090201@mrabarnett.plus.com> <53FF63BC.3020603@g.nevcal.com> <20140828174104.56783250E01@webabinitio.net> <53FF6CE4.8010305@g.nevcal.com> Message-ID: <7BDC7953-5D22-460B-96CF-977921EF9652@livinglogic.de> On 28 Aug 2014, at 19:54, Glenn Linderman wrote: > On 8/28/2014 10:41 AM, R. David Murray wrote: >> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman >> wrote: >> [...] >> Also for >> cases where the data stream is *supposed* to be in a given encoding, >> but >> contains undecodable bytes. Showing the stuff that incorrectly >> decodes >> as whatever it decodes to is generally what you want in that case. > Sure, people can learn to recognize mojibake for what it is, and maybe > even learn to recognize it for what it was intended to be, in limited > domains. But suppressing/replacing the surrogates doesn't help with > that... would it not be better to replace the surrogates with an > escape sequence that shows the original, undecodable, byte value? > Like \xNN ? For that we could extend the "backslashreplace" codec error callback, so that it can be used for decoding too, not just for encoding. I.e. b"a\xffb".decode("utf-8", "backslashreplace") would return "a\\xffb" Servus, Walter From mal at egenix.com Fri Aug 29 14:18:34 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 29 Aug 2014 14:18:34 +0200 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> <54003062.1040201@egenix.com> Message-ID: <54006F9A.6090704@egenix.com> On 29.08.2014 13:22, Isaac Morland wrote: > On Fri, 29 Aug 2014, M.-A. Lemburg wrote: > >> On 29.08.2014 02:41, Stephen J. Turnbull wrote: >> Since Python allows working with lone surrogates in Unicode (they >> are valid code points) and we're using UTF-8 for marshal, we needed >> a way to make sure that Python 3 also optionally supports working >> with lone surrogates in such UTF-8 streams (nowadays called CESU-8: >> http://en.wikipedia.org/wiki/CESU-8). > > If I want that wouldn't I specify "cesu-8" as the encoding? > > i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8'). Right now, trying this I get > that cesu-8 is an unknown encoding but that could be changed without affecting the behaviour of the > utf-8 codec. Why write a new codec that's almost identical to the utf-8 codec, if you can get the same functionality by explicitly using a special error handler ? >From a maintenance POV that does not sound like a good approach. > It seems to me that .decode ('utf-8') should decode exactly and only valid utf-8, including the > non-use of surrogate pairs as an intermediate encoding step. It does in Python 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From ijmorlan at uwaterloo.ca Fri Aug 29 13:22:10 2014 From: ijmorlan at uwaterloo.ca (Isaac Morland) Date: Fri, 29 Aug 2014 07:22:10 -0400 (EDT) Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: <54003062.1040201@egenix.com> References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> <54003062.1040201@egenix.com> Message-ID: On Fri, 29 Aug 2014, M.-A. Lemburg wrote: > On 29.08.2014 02:41, Stephen J. Turnbull wrote: > Since Python allows working with lone surrogates in Unicode (they > are valid code points) and we're using UTF-8 for marshal, we needed > a way to make sure that Python 3 also optionally supports working > with lone surrogates in such UTF-8 streams (nowadays called CESU-8: > http://en.wikipedia.org/wiki/CESU-8). If I want that wouldn't I specify "cesu-8" as the encoding? i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8'). Right now, trying this I get that cesu-8 is an unknown encoding but that could be changed without affecting the behaviour of the utf-8 codec. It seems to me that .decode ('utf-8') should decode exactly and only valid utf-8, including the non-use of surrogate pairs as an intermediate encoding step. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist From status at bugs.python.org Fri Aug 29 18:08:07 2014 From: status at bugs.python.org (Python tracker) Date: Fri, 29 Aug 2014 18:08:07 +0200 (CEST) Subject: [Python-Dev] Summary of Python tracker Issues Message-ID: <20140829160807.6D4825640B@psf.upfronthosting.co.za> ACTIVITY SUMMARY (2014-08-22 - 2014-08-29) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open 4638 (+17) closed 29431 (+32) total 34069 (+49) Open issues with patches: 2193 Issues opened (41) ================== #17095: Modules/Setup *shared* support broken http://bugs.python.org/issue17095 reopened by haypo #22200: Remove distutils checks for Python version http://bugs.python.org/issue22200 reopened by Arfrever #22232: str.splitlines splitting on non-\r\n characters http://bugs.python.org/issue22232 reopened by terry.reedy #22252: ssl blocking IO errors http://bugs.python.org/issue22252 opened by h.venev #22253: ConfigParser does not handle files without sections http://bugs.python.org/issue22253 opened by kernc #22255: Multiprocessing freeze_support raises RuntimeError http://bugs.python.org/issue22255 opened by Michael.McAuliffe #22256: pyvenv should display a progress indicator while creating an e http://bugs.python.org/issue22256 opened by ncoghlan #22257: PEP 432: Redesign the interpreter startup sequence http://bugs.python.org/issue22257 opened by ncoghlan #22258: set_inheritable(): ioctl(FIOCLEX) is available but fails with http://bugs.python.org/issue22258 opened by igor.pashev #22260: Rearrange tkinter tests, use test discovery http://bugs.python.org/issue22260 opened by zach.ware #22261: Document how to use Concurrent Build when using MsBuild http://bugs.python.org/issue22261 opened by sbspider #22263: Add a resource for CLI tests http://bugs.python.org/issue22263 opened by serhiy.storchaka #22264: Add wsgiref.util helpers for dealing with "WSGI strings" http://bugs.python.org/issue22264 opened by ncoghlan #22268: mrohasattr and mrogetattr http://bugs.python.org/issue22268 opened by Gregory.Salvan #22269: Resolve distutils option conflicts with priorities http://bugs.python.org/issue22269 opened by minrk #22270: cache version selection for documentation http://bugs.python.org/issue22270 opened by thejj #22271: Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning http://bugs.python.org/issue22271 opened by haypo #22273: abort when passing certain structs by value using ctypes http://bugs.python.org/issue22273 opened by weeble #22274: subprocess.Popen(stderr=STDOUT) fails to redirect subprocess s http://bugs.python.org/issue22274 opened by akira #22275: asyncio: enhance documentation of OS support http://bugs.python.org/issue22275 opened by haypo #22276: pathlib glob issues http://bugs.python.org/issue22276 opened by joca.bt #22277: webbrowser.py add parameters to suppress output on stdout and http://bugs.python.org/issue22277 opened by CristianCantoro #22278: urljoin duplicate slashes http://bugs.python.org/issue22278 opened by demian.brecht #22279: read() vs read1() in asyncio.StreamReader documentation http://bugs.python.org/issue22279 opened by oconnor663 #22281: ProcessPoolExecutor/ThreadPoolExecutor should provide introspe http://bugs.python.org/issue22281 opened by dan.oreilly #22282: ipaddress module accepts octal formatted IPv4 addresses in IPv http://bugs.python.org/issue22282 opened by xZise #22283: "AMD64 FreeBSD 9.0 3.x" fails to build the _decimal module: #e http://bugs.python.org/issue22283 opened by haypo #22284: decimal module contains less symbols when the _decimal module http://bugs.python.org/issue22284 opened by haypo #22285: The Modules/ directory should not be added to sys.path http://bugs.python.org/issue22285 opened by haypo #22286: Allow backslashreplace error handler to be used on input http://bugs.python.org/issue22286 opened by ncoghlan #22289: support.transient_internet() doesn't catch timeout on FTP test http://bugs.python.org/issue22289 opened by haypo #22290: "AMD64 OpenIndiana 3.x" buildbot: assertion failed in PyObject http://bugs.python.org/issue22290 opened by haypo #22292: pickle whichmodule RuntimeError http://bugs.python.org/issue22292 opened by attilio.dinisio #22293: unittest.mock: use slots in MagicMock to reduce memory footpri http://bugs.python.org/issue22293 opened by james-w #22294: 2to3 consuming_calls: len, min, max, zip, map, reduce, filter http://bugs.python.org/issue22294 opened by eddygeek #22295: Clarify available commands for package installation http://bugs.python.org/issue22295 opened by ncoghlan #22296: cookielib uses time.time(), making incorrect checks of expirat http://bugs.python.org/issue22296 opened by regu0004 #22297: 2.7 json encoding broken for enums http://bugs.python.org/issue22297 opened by eddygeek #22298: Lib/warnings.py _show_warning does not protect against being c http://bugs.python.org/issue22298 opened by Julius.Lehmann-Richter #22299: resolve() on Windows makes some pathological paths unusable http://bugs.python.org/issue22299 opened by Kevin.Norris #22300: PEP 446 What's New Updates for 2.7.9 http://bugs.python.org/issue22300 opened by ncoghlan Most recent 15 issues with no replies (15) ========================================== #22300: PEP 446 What's New Updates for 2.7.9 http://bugs.python.org/issue22300 #22298: Lib/warnings.py _show_warning does not protect against being c http://bugs.python.org/issue22298 #22297: 2.7 json encoding broken for enums http://bugs.python.org/issue22297 #22296: cookielib uses time.time(), making incorrect checks of expirat http://bugs.python.org/issue22296 #22294: 2to3 consuming_calls: len, min, max, zip, map, reduce, filter http://bugs.python.org/issue22294 #22289: support.transient_internet() doesn't catch timeout on FTP test http://bugs.python.org/issue22289 #22286: Allow backslashreplace error handler to be used on input http://bugs.python.org/issue22286 #22278: urljoin duplicate slashes http://bugs.python.org/issue22278 #22275: asyncio: enhance documentation of OS support http://bugs.python.org/issue22275 #22271: Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning http://bugs.python.org/issue22271 #22268: mrohasattr and mrogetattr http://bugs.python.org/issue22268 #22255: Multiprocessing freeze_support raises RuntimeError http://bugs.python.org/issue22255 #22251: Various markup errors in documentation http://bugs.python.org/issue22251 #22249: Possibly incorrect example is given for socket.getaddrinfo() http://bugs.python.org/issue22249 #22246: add strptime(s, '%s') http://bugs.python.org/issue22246 Most recent 15 issues waiting for review (15) ============================================= #22300: PEP 446 What's New Updates for 2.7.9 http://bugs.python.org/issue22300 #22294: 2to3 consuming_calls: len, min, max, zip, map, reduce, filter http://bugs.python.org/issue22294 #22292: pickle whichmodule RuntimeError http://bugs.python.org/issue22292 #22289: support.transient_internet() doesn't catch timeout on FTP test http://bugs.python.org/issue22289 #22285: The Modules/ directory should not be added to sys.path http://bugs.python.org/issue22285 #22282: ipaddress module accepts octal formatted IPv4 addresses in IPv http://bugs.python.org/issue22282 #22281: ProcessPoolExecutor/ThreadPoolExecutor should provide introspe http://bugs.python.org/issue22281 #22278: urljoin duplicate slashes http://bugs.python.org/issue22278 #22277: webbrowser.py add parameters to suppress output on stdout and http://bugs.python.org/issue22277 #22275: asyncio: enhance documentation of OS support http://bugs.python.org/issue22275 #22274: subprocess.Popen(stderr=STDOUT) fails to redirect subprocess s http://bugs.python.org/issue22274 #22269: Resolve distutils option conflicts with priorities http://bugs.python.org/issue22269 #22268: mrohasattr and mrogetattr http://bugs.python.org/issue22268 #22261: Document how to use Concurrent Build when using MsBuild http://bugs.python.org/issue22261 #22260: Rearrange tkinter tests, use test discovery http://bugs.python.org/issue22260 Top 10 most discussed issues (10) ================================= #18814: Add tools for "cleaning" surrogate escaped strings http://bugs.python.org/issue18814 15 msgs #22232: str.splitlines splitting on non-\r\n characters http://bugs.python.org/issue22232 13 msgs #22264: Add wsgiref.util helpers for dealing with "WSGI strings" http://bugs.python.org/issue22264 10 msgs #22194: access to cdecimal / libmpdec API http://bugs.python.org/issue22194 9 msgs #22277: webbrowser.py add parameters to suppress output on stdout and http://bugs.python.org/issue22277 9 msgs #22240: argparse support for "python -m module" in help http://bugs.python.org/issue22240 8 msgs #22261: Document how to use Concurrent Build when using MsBuild http://bugs.python.org/issue22261 8 msgs #22279: read() vs read1() in asyncio.StreamReader documentation http://bugs.python.org/issue22279 7 msgs #22285: The Modules/ directory should not be added to sys.path http://bugs.python.org/issue22285 7 msgs #21720: "TypeError: Item in ``from list'' not a string" message http://bugs.python.org/issue21720 6 msgs Issues closed (31) ================== #2527: Pass a namespace to timeit http://bugs.python.org/issue2527 closed by pitrou #6550: asyncore incorrect failure when connection is refused and usin http://bugs.python.org/issue6550 closed by haypo #11267: asyncore does not check for POLLERR and POLLHUP if neither rea http://bugs.python.org/issue11267 closed by haypo #16808: inspect.stack() should return list of named tuples http://bugs.python.org/issue16808 closed by pitrou #18530: posixpath.ismount performs extra lstat calls http://bugs.python.org/issue18530 closed by alex #19447: py_compile.compile raises if a file has bad encoding http://bugs.python.org/issue19447 closed by berker.peksag #20745: test_statistics fails in refleak mode http://bugs.python.org/issue20745 closed by zach.ware #20996: Backport TLS 1.1 and 1.2 support for ssl_version http://bugs.python.org/issue20996 closed by alex #21305: PEP 466: update os.urandom http://bugs.python.org/issue21305 closed by python-dev #22034: posixpath.join() and bytearray http://bugs.python.org/issue22034 closed by serhiy.storchaka #22042: signal.set_wakeup_fd(fd): raise an exception if the fd is in b http://bugs.python.org/issue22042 closed by haypo #22059: incorrect type conversion from str to bytes in asynchat module http://bugs.python.org/issue22059 closed by r.david.murray #22090: Decimal and float formatting treat '%' differently for infinit http://bugs.python.org/issue22090 closed by skrah #22182: distutils.file_util.move_file unpacks wrongly an exception http://bugs.python.org/issue22182 closed by berker.peksag #22199: 2.7 sysconfig._get_makefile_filename should be sysconfig.get_m http://bugs.python.org/issue22199 closed by ned.deily #22236: Do not use _default_root in Tkinter tests http://bugs.python.org/issue22236 closed by serhiy.storchaka #22239: asyncio: nested event loop http://bugs.python.org/issue22239 closed by gvanrossum #22243: Documentation on try statement incorrectly implies target of e http://bugs.python.org/issue22243 closed by terry.reedy #22244: load_verify_locations fails to handle unicode paths on Python http://bugs.python.org/issue22244 closed by python-dev #22250: unittest lowercase methods http://bugs.python.org/issue22250 closed by ezio.melotti #22254: match object generated by re.finditer cannot call groups() on http://bugs.python.org/issue22254 closed by leiju #22259: fdopen of directory causes segmentation fault http://bugs.python.org/issue22259 closed by python-dev #22262: Python External Libraries are stored in directory above where http://bugs.python.org/issue22262 closed by zach.ware #22265: fix reliance on refcounting in test_itertools http://bugs.python.org/issue22265 closed by python-dev #22266: fix reliance on refcounting in tarfile.gzopen http://bugs.python.org/issue22266 closed by python-dev #22267: fix reliance on refcounting in test_weakref http://bugs.python.org/issue22267 closed by python-dev #22272: sqlite3 memory leaks in cursor.execute http://bugs.python.org/issue22272 closed by haypo #22280: _decimal: successful import despite build failure http://bugs.python.org/issue22280 closed by skrah #22287: Use clock_gettime() in pytime.c http://bugs.python.org/issue22287 closed by haypo #22288: Incorrect Call grammar in documentation http://bugs.python.org/issue22288 closed by mjpieters #22291: Typo in docs - Lib/random http://bugs.python.org/issue22291 closed by r.david.murray From alex.gaynor at gmail.com Fri Aug 29 21:47:16 2014 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Fri, 29 Aug 2014 19:47:16 +0000 (UTC) Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! Message-ID: Hi all, I've just submitted PEP 476, on enabling certificate validation by default for HTTPS clients in Python. Please have a look and let me know what you think. PEP text follows. Alex --- PEP: 476 Title: Enabling certificate verification by default for stdlib http clients Version: $Revision$ Last-Modified: $Date$ Author: Alex Gaynor Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-August-2014 Abstract ======== Currently when a standard library http client (the ``urllib`` and ``http`` modules) encounters an ``https://`` URL it will wrap the network HTTP traffic in a TLS stream, as is necessary to communicate with such a server. However, during the TLS handshake it will not actually check that the server has an X509 certificate is signed by a CA in any trust root, nor will it verify that the Common Name (or Subject Alternate Name) on the presented certificate matches the requested host. The failure to do these checks means that anyone with a privileged network position is able to trivially execute a man in the middle attack against a Python application using either of these HTTP clients, and change traffic at will. This PEP proposes to enable verification of X509 certificate signatures, as well as hostname verification for Python's HTTP clients by default, subject to opt-out on a per-call basis. Rationale ========= The "S" in "HTTPS" stands for secure. When Python's users type "HTTPS" they are expecting a secure connection, and Python should adhere to a reasonable standard of care in delivering this. Currently we are failing at this, and in doing so, APIs which appear simple are misleading users. When asked, many Python users state that they were not aware that Python failed to perform these validations, and are shocked. The popularity of ``requests`` (which enables these checks by default) demonstrates that these checks are not overly burdensome in any way, and the fact that it is widely recommended as a major security improvement over the standard library clients demonstrates that many expect a higher standard for "security by default" from their tools. The failure of various applications to note Python's negligence in this matter is a source of *regular* CVE assignment [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_ [#]_. .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-4340 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-3533 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5822 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5825 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-1909 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2037 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2073 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2191 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-4111 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6396 .. [#] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6444 Technical Details ================= Python would use the system provided certificate database on all platforms. Failure to locate such a database would be an error, and users would need to explicitly specify a location to fix it. This can be achieved by simply replacing the use of ``ssl._create_stdlib_context`` with ``ssl.create_default_context`` in ``http.client``. Trust database -------------- This PEP proposes using the system-provided certificate database. Previous discussions have suggested bundling Mozilla's certificate database and using that by default. This was decided against for several reasons: * Using the platform trust database imposes a lower maintenance burden on the Python developers -- shipping our own trust database would require doing a release every time a certificate was revoked. * Linux vendors, and other downstreams, would unbundle the Mozilla certificates, resulting in a more fragmented set of behaviors. * Using the platform stores makes it easier to handle situations such as corporate internal CAs. Backwards compatibility ----------------------- This change will have the appearance of causing some HTTPS connections to "break", because they will now raise an Exception during handshake. This is misleading however, in fact these connections are presently failing silently, an HTTPS URL indicates an expectation of confidentiality and authentication. The fact that Python does not actually verify that the user's request has been made is a bug, further: "Errors should never pass silently." Nevertheless, users who have a need to access servers with self-signed or incorrect certificates would be able to do so by providing a context with custom trust roots or which disables validation (documentation should strongly recommend the former where possible). Users will also be able to add necessary certificates to system trust stores in order to trust them globally. Twisted's 14.0 release made this same change, and it has been met with almost no opposition. Other protocols =============== This PEP only proposes requiring this level of validation for HTTP clients, not for other protocols such as SMTP. This is because while a high percentage of HTTPS servers have correct certificates, as a result of the validation performed by browsers, for other protocols self-signed or otherwise incorrect certificates are far more common. Note that for SMTP at least, this appears to be changing and should be reviewed for a potential similar PEP in the future: * https://www.facebook.com/notes/protect-the-graph/the-current-state-of-smtp starttls-deployment/1453015901605223 * https://www.facebook.com/notes/protect-the-graph/massive-growth-in-smtp- starttls-deployment/1491049534468526 Python Versions =============== This PEP proposes making these changes to ``default`` (Python 3) branch. I strongly believe these changes also belong in Python 2, but doing them in a patch-release isn't reasonable, and there is strong opposition to doing a 2.8 release. Copyright ========= This document has been placed into the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 From mal at egenix.com Fri Aug 29 22:00:00 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 29 Aug 2014 22:00:00 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: Message-ID: <5400DBC0.1020700@egenix.com> On 29.08.2014 21:47, Alex Gaynor wrote: > Hi all, > > I've just submitted PEP 476, on enabling certificate validation by default for > HTTPS clients in Python. Please have a look and let me know what you think. > > PEP text follows. Thanks for the PEP. I think this is generally a good idea, but some important parts are missing from the PEP: * transition plan: I think starting with warnings in Python 3.5 and going for exceptions in 3.6 would make a good transition Going straight for exceptions in 3.5 is not in line with our normal procedures for backwards incompatible changes. * configuration: It would be good to be able to switch this on or off without having to change the code, e.g. via a command line switch and environment variable; perhaps even controlling whether or not to raise an exception or warning. * choice of trusted certificate: Instead of hard wiring using the system CA roots into Python it would be good to just make this default and permit the user to point Python to a different set of CA roots. This would enable using self signed certs more easily. Since these are often used for tests, demos and education, I think it's important to allow having more control of the trusted certs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From dreid at dreid.org Fri Aug 29 21:56:59 2014 From: dreid at dreid.org (David Reid) Date: Fri, 29 Aug 2014 19:56:59 +0000 (UTC) Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: Message-ID: Alex Gaynor gmail.com> writes: > > Hi all, > > I've just submitted PEP 476, on enabling certificate validation by default for > HTTPS clients in Python. Please have a look and let me know what you think. Yes please. The two most commons answers I get to "Why did you switch to go?" are "Concurrency" and "The stdlib HTTP client verifies TLS by default." In a work related survey of webhook providers I found that only ~7% of HTTPS URLs would be affected by a change like this. -David From ethan at stoneleaf.us Fri Aug 29 22:07:00 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Fri, 29 Aug 2014 13:07:00 -0700 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5400DBC0.1020700@egenix.com> References: <5400DBC0.1020700@egenix.com> Message-ID: <5400DD64.4050308@stoneleaf.us> On 08/29/2014 01:00 PM, M.-A. Lemburg wrote: > On 29.08.2014 21:47, Alex Gaynor wrote: >> >> I've just submitted PEP 476, on enabling certificate validation by default for >> HTTPS clients in Python. Please have a look and let me know what you think. > > Thanks for the PEP. I think this is generally a good idea, > but some important parts are missing from the PEP: > > * transition plan: > > I think starting with warnings in Python 3.5 and going > for exceptions in 3.6 would make a good transition > > Going straight for exceptions in 3.5 is not in line with > our normal procedures for backwards incompatible changes. > > * configuration: > > It would be good to be able to switch this on or off > without having to change the code, e.g. via a command > line switch and environment variable; perhaps even > controlling whether or not to raise an exception or > warning. > > * choice of trusted certificate: > > Instead of hard wiring using the system CA roots into > Python it would be good to just make this default and > permit the user to point Python to a different set of > CA roots. > > This would enable using self signed certs more easily. > Since these are often used for tests, demos and education, > I think it's important to allow having more control of > the trusted certs. +1 for PEP with above changes. -- ~Ethan~ From donald at stufft.io Fri Aug 29 22:10:03 2014 From: donald at stufft.io (Donald Stufft) Date: Fri, 29 Aug 2014 16:10:03 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5400DBC0.1020700@egenix.com> References: <5400DBC0.1020700@egenix.com> Message-ID: > On Aug 29, 2014, at 4:00 PM, "M.-A. Lemburg" wrote: > > * choice of trusted certificate: > > Instead of hard wiring using the system CA roots into > Python it would be good to just make this default and > permit the user to point Python to a different set of > CA roots. > > This would enable using self signed certs more easily. > Since these are often used for tests, demos and education, > I think it's important to allow having more control of > the trusted certs. If I recall OpenSSL already allows this to be configured via envvar and the python API already allows it to be configured via API. From donald at stufft.io Fri Aug 29 23:11:35 2014 From: donald at stufft.io (Donald Stufft) Date: Fri, 29 Aug 2014 17:11:35 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5400DBC0.1020700@egenix.com> References: <5400DBC0.1020700@egenix.com> Message-ID: Sorry I was on my phone and didn?t get to fully reply to this. > On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg wrote: > > On 29.08.2014 21:47, Alex Gaynor wrote: >> Hi all, >> >> I've just submitted PEP 476, on enabling certificate validation by default for >> HTTPS clients in Python. Please have a look and let me know what you think. >> >> PEP text follows. > > Thanks for the PEP. I think this is generally a good idea, > but some important parts are missing from the PEP: > > * transition plan: > > I think starting with warnings in Python 3.5 and going > for exceptions in 3.6 would make a good transition > > Going straight for exceptions in 3.5 is not in line with > our normal procedures for backwards incompatible changes. As far as a transition plan, I think that this is an important enough thing to have an accelerated process. If we need to provide a warning than let?s add it to the next 3.4 otherwise it?s going to be 2.5+ years until we stop being unsafe by default. Another problem with this is that I don?t think it?s actually possible to do. Python itself isn?t validating the TLS certificates, OpenSSL is doing that. To my knowledge OpenSSL doesn?t have a way to say ?please validate these certificates and if they don?t validate go ahead and keep going and just let me get a warning from it?. It?s a 3 way switch, no validation, validation if a certificate is provided, and validation always. Now that?s strictly for the ?verify the certificate chain? portion, the hostname verification is done entirely on our end and we could do something there? but I?m not sure it makes sense to do so if we can?t do it for invalid certificates too. > > * configuration: > > It would be good to be able to switch this on or off > without having to change the code, e.g. via a command > line switch and environment variable; perhaps even > controlling whether or not to raise an exception or > warning. I?m on the fence about this, if someone provides a certificate that we can validate against (which can be done without touching the code) then the only thing that really can?t be ?fixed? without touching the code is if someone has a certificate that is otherwise invalid (expired, not yet valid, wrong hostname, etc). I?d say if I was voting on this particular thing I?d be -0, I?d rather it didn?t exist but I wouldn?t cry too much if it did. > > * choice of trusted certificate: > > Instead of hard wiring using the system CA roots into > Python it would be good to just make this default and > permit the user to point Python to a different set of > CA roots. > > This would enable using self signed certs more easily. > Since these are often used for tests, demos and education, > I think it's important to allow having more control of > the trusted certs. Like my other email said, the Python API has everything needed to easily specify your own CA roots and/or disable the validations. The OpenSSL library also allows you to specify either a directory or a file to change the root certificates without code changes. The only real problems with the APIs are that the default is bad and an unrelated thing where you can?t pass in an in memory certificate. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdmurray at bitdance.com Fri Aug 29 23:42:34 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 29 Aug 2014 17:42:34 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> Message-ID: <20140829214235.0EE2B250E1A@webabinitio.net> On Fri, 29 Aug 2014 17:11:35 -0400, Donald Stufft wrote: > Sorry I was on my phone and didn???t get to fully reply to this. > > On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg wrote: > > > > * configuration: > > > > It would be good to be able to switch this on or off > > without having to change the code, e.g. via a command > > line switch and environment variable; perhaps even > > controlling whether or not to raise an exception or > > warning. > > I???m on the fence about this, if someone provides a certificate > that we can validate against (which can be done without > touching the code) then the only thing that really can???t be > ???fixed??? without touching the code is if someone has a certificate > that is otherwise invalid (expired, not yet valid, wrong hostname, > etc). I???d say if I was voting on this particular thing I???d be -0, I???d > rather it didn???t exist but I wouldn???t cry too much if it did. Especially if you want an accelerated change, there must be a way to *easily* get back to the previous behavior, or we are going to catch a lot of flack. There may be only 7% of public certs that are problematic, but I'd be willing to bet you that there are more not-really-public ones that are critical to day to day operations *somewhere* :) wget and curl have 'ignore validation' as a command line flag for a reason. --David From solipsis at pitrou.net Fri Aug 29 23:55:40 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 29 Aug 2014 23:55:40 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: <5400DBC0.1020700@egenix.com> Message-ID: <20140829235540.0f73b1d0@fsol> On Fri, 29 Aug 2014 17:11:35 -0400 Donald Stufft wrote: > > Another problem with this is that I don?t think it?s actually > possible to do. Python itself isn?t validating the TLS certificates, > OpenSSL is doing that. To my knowledge OpenSSL doesn?t > have a way to say ?please validate these certificates and if > they don?t validate go ahead and keep going and just let me > get a warning from it?. Actually, there may be a solution. In client mode, OpenSSL always verifies the server cert chain and stores the verification result in the SSL structure. It will then only report an error if the verify mode is not SSL_VERIFY_NONE. (see ssl3_get_server_certificate() in s3_clnt.c) The verification result should then be readable using SSL_get_verify_result(), even with SSL_VERIFY_NONE. (note this is only from reading the source code and needs verifying) Then we could have the following transition phase: - define a new CERT_WARN value for SSLContext.verify_mode - use that value as the default in the HTTP stack (people who want the old silent default will have to set verify_mode explicitly to VERIFY_NONE) - with CERT_WARN, SSL_VERIFY_NONE is passed to OpenSSL and Python manually calls SSL_get_verify_result() after a handshake; if there was a verification error, a warning is printed out And in the following version we switch the HTTP default to CERT_REQUIRED. Regards Antoine. From mal at egenix.com Fri Aug 29 23:58:29 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 29 Aug 2014 23:58:29 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> Message-ID: <5400F785.4090708@egenix.com> On 29.08.2014 23:11, Donald Stufft wrote: > > Sorry I was on my phone and didn?t get to fully reply to this. > >> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg wrote: >> >> On 29.08.2014 21:47, Alex Gaynor wrote: >>> Hi all, >>> >>> I've just submitted PEP 476, on enabling certificate validation by default for >>> HTTPS clients in Python. Please have a look and let me know what you think. >>> >>> PEP text follows. >> >> Thanks for the PEP. I think this is generally a good idea, >> but some important parts are missing from the PEP: >> >> * transition plan: >> >> I think starting with warnings in Python 3.5 and going >> for exceptions in 3.6 would make a good transition >> >> Going straight for exceptions in 3.5 is not in line with >> our normal procedures for backwards incompatible changes. > > As far as a transition plan, I think that this is an important > enough thing to have an accelerated process. If we need > to provide a warning than let?s add it to the next 3.4 otherwise > it?s going to be 2.5+ years until we stop being unsafe by > default. Fine with me; we're still early in the Python 3.4 patch level releases. > Another problem with this is that I don?t think it?s actually > possible to do. Python itself isn?t validating the TLS certificates, > OpenSSL is doing that. To my knowledge OpenSSL doesn?t > have a way to say ?please validate these certificates and if > they don?t validate go ahead and keep going and just let me > get a warning from it?. It?s a 3 way switch, no validation, validation > if a certificate is provided, and validation always. > > Now that?s strictly for the ?verify the certificate chain? portion, > the hostname verification is done entirely on our end and we > could do something there? but I?m not sure it makes sense > to do so if we can?t do it for invalid certificates too. OpenSSL provides a callback for certificate validation, so it is possible to issue a warning and continue with accepting the certificate. >> * configuration: >> >> It would be good to be able to switch this on or off >> without having to change the code, e.g. via a command >> line switch and environment variable; perhaps even >> controlling whether or not to raise an exception or >> warning. > > I?m on the fence about this, if someone provides a certificate > that we can validate against (which can be done without > touching the code) then the only thing that really can?t be > ?fixed? without touching the code is if someone has a certificate > that is otherwise invalid (expired, not yet valid, wrong hostname, > etc). I?d say if I was voting on this particular thing I?d be -0, I?d > rather it didn?t exist but I wouldn?t cry too much if it did. If you're testing code or trying out some new stuff, you don't want to get a valid cert first, but instead go ahead with a self signed one. That's the use case. >> * choice of trusted certificate: >> >> Instead of hard wiring using the system CA roots into >> Python it would be good to just make this default and >> permit the user to point Python to a different set of >> CA roots. >> >> This would enable using self signed certs more easily. >> Since these are often used for tests, demos and education, >> I think it's important to allow having more control of >> the trusted certs. > > > Like my other email said, the Python API has everything needed > to easily specify your own CA roots and/or disable the validations. > The OpenSSL library also allows you to specify either a directory > or a file to change the root certificates without code changes. The > only real problems with the APIs are that the default is bad and > an unrelated thing where you can?t pass in an in memory certificate. Are you sure that's possible ? Python doesn't load the openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env vars only work for the openssl command line binary, AFAIK. In any case, Python will have to tap into the OS CA root provider using special code and this code could easily be made to check other dirs/files as well. The point is that it should be possible to change this default at the Python level, without needing application code changes. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From solipsis at pitrou.net Fri Aug 29 23:57:41 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Fri, 29 Aug 2014 23:57:41 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: <5400DBC0.1020700@egenix.com> <20140829214235.0EE2B250E1A@webabinitio.net> Message-ID: <20140829235741.3bf75d30@fsol> On Fri, 29 Aug 2014 17:42:34 -0400 "R. David Murray" wrote: > > Especially if you want an accelerated change, there must be a way to > *easily* get back to the previous behavior, or we are going to catch a > lot of flack. There may be only 7% of public certs that are problematic, > but I'd be willing to bet you that there are more not-really-public ones > that are critical to day to day operations *somewhere* :) Actually, by construction, there are certs which will always fail verification, for example because they are embedded in telco equipments which don't have a predefined hostname or IP address. (I have encountered some of those) Regards Antoine. From donald at stufft.io Sat Aug 30 00:00:50 2014 From: donald at stufft.io (Donald Stufft) Date: Fri, 29 Aug 2014 18:00:50 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140829214235.0EE2B250E1A@webabinitio.net> References: <5400DBC0.1020700@egenix.com> <20140829214235.0EE2B250E1A@webabinitio.net> Message-ID: > On Aug 29, 2014, at 5:42 PM, R. David Murray wrote: > > On Fri, 29 Aug 2014 17:11:35 -0400, Donald Stufft wrote: >> Sorry I was on my phone and didn?t get to fully reply to this. >>> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg wrote: >>> >>> * configuration: >>> >>> It would be good to be able to switch this on or off >>> without having to change the code, e.g. via a command >>> line switch and environment variable; perhaps even >>> controlling whether or not to raise an exception or >>> warning. >> >> I?m on the fence about this, if someone provides a certificate >> that we can validate against (which can be done without >> touching the code) then the only thing that really can?t be >> ?fixed? without touching the code is if someone has a certificate >> that is otherwise invalid (expired, not yet valid, wrong hostname, >> etc). I?d say if I was voting on this particular thing I?d be -0, I?d >> rather it didn?t exist but I wouldn?t cry too much if it did. > > Especially if you want an accelerated change, there must be a way to > *easily* get back to the previous behavior, or we are going to catch a > lot of flack. There may be only 7% of public certs that are problematic, > but I'd be willing to bet you that there are more not-really-public ones > that are critical to day to day operations *somewhere* :) > > wget and curl have 'ignore validation' as a command line flag for a reason. > Right, that?s why I?m on the fence :) On one hand, it?s going to break things for some people, (arguably they are already broken, just silently so, but we?ll leave that argument aside) and a way to get back the old behavior is good. There are already ways within the Python code itself, so that?s covered. From outside of the Python code there are ways if the certificate is untrusted but otherwise valid which are pretty easy to do. The major ?gap? is when you have an actual invalid certificate due to expiration or hostname or some other such thing. On the other hand Python is not wget/curl and the people who are most likely to be the target for a ?I can?t change the code but I need to get the old behavior back? are people who are likely to not be invoking Python itself but using something written in Python which happens to be using Python. IOW they might be executing ?foobar? not ?python -m foobar?. Like I said though, I?m personally fine either way so don?t take this as being against that particular change! --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From donald at stufft.io Sat Aug 30 00:08:19 2014 From: donald at stufft.io (Donald Stufft) Date: Fri, 29 Aug 2014 18:08:19 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5400F785.4090708@egenix.com> References: <5400DBC0.1020700@egenix.com> <5400F785.4090708@egenix.com> Message-ID: > On Aug 29, 2014, at 5:58 PM, M.-A. Lemburg wrote: > > On 29.08.2014 23:11, Donald Stufft wrote: >> >> Sorry I was on my phone and didn?t get to fully reply to this. >> >>> On Aug 29, 2014, at 4:00 PM, M.-A. Lemburg wrote: >>> >>> On 29.08.2014 21:47, Alex Gaynor wrote: >>>> Hi all, >>>> >>>> I've just submitted PEP 476, on enabling certificate validation by default for >>>> HTTPS clients in Python. Please have a look and let me know what you think. >>>> >>>> PEP text follows. >>> >>> Thanks for the PEP. I think this is generally a good idea, >>> but some important parts are missing from the PEP: >>> >>> * transition plan: >>> >>> I think starting with warnings in Python 3.5 and going >>> for exceptions in 3.6 would make a good transition >>> >>> Going straight for exceptions in 3.5 is not in line with >>> our normal procedures for backwards incompatible changes. >> >> As far as a transition plan, I think that this is an important >> enough thing to have an accelerated process. If we need >> to provide a warning than let?s add it to the next 3.4 otherwise >> it?s going to be 2.5+ years until we stop being unsafe by >> default. > > Fine with me; we're still early in the Python 3.4 > patch level releases. > >> Another problem with this is that I don?t think it?s actually >> possible to do. Python itself isn?t validating the TLS certificates, >> OpenSSL is doing that. To my knowledge OpenSSL doesn?t >> have a way to say ?please validate these certificates and if >> they don?t validate go ahead and keep going and just let me >> get a warning from it?. It?s a 3 way switch, no validation, validation >> if a certificate is provided, and validation always. >> >> Now that?s strictly for the ?verify the certificate chain? portion, >> the hostname verification is done entirely on our end and we >> could do something there? but I?m not sure it makes sense >> to do so if we can?t do it for invalid certificates too. > > OpenSSL provides a callback for certificate validation, > so it is possible to issue a warning and continue with > accepting the certificate. Ah right, I forgot about that. I was thinking in terms of CERT_NONE, CERT_OPTIONAL, CERT_REQUIRED. I think it?s fine to add a warning if possible to Python 3.4, I just couldn?t think off the top of my head a way of doing it. > >>> * configuration: >>> >>> It would be good to be able to switch this on or off >>> without having to change the code, e.g. via a command >>> line switch and environment variable; perhaps even >>> controlling whether or not to raise an exception or >>> warning. >> >> I?m on the fence about this, if someone provides a certificate >> that we can validate against (which can be done without >> touching the code) then the only thing that really can?t be >> ?fixed? without touching the code is if someone has a certificate >> that is otherwise invalid (expired, not yet valid, wrong hostname, >> etc). I?d say if I was voting on this particular thing I?d be -0, I?d >> rather it didn?t exist but I wouldn?t cry too much if it did. > > If you're testing code or trying out some new stuff, you > don't want to get a valid cert first, but instead go ahead > with a self signed one. That's the use case. > >>> * choice of trusted certificate: >>> >>> Instead of hard wiring using the system CA roots into >>> Python it would be good to just make this default and >>> permit the user to point Python to a different set of >>> CA roots. >>> >>> This would enable using self signed certs more easily. >>> Since these are often used for tests, demos and education, >>> I think it's important to allow having more control of >>> the trusted certs. >> >> >> Like my other email said, the Python API has everything needed >> to easily specify your own CA roots and/or disable the validations. >> The OpenSSL library also allows you to specify either a directory >> or a file to change the root certificates without code changes. The >> only real problems with the APIs are that the default is bad and >> an unrelated thing where you can?t pass in an in memory certificate. > > Are you sure that's possible ? Python doesn't load the > openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env > vars only work for the openssl command line binary, AFAIK. I?m not 100% sure on that. I know they are not limited to the command line binary as ruby uses those environment variables in the way I described above. I do not believe that Ruby has done anything special to enable the use of those variables. It?s possible we?re doing something differently that bypasses those variables though. If that is the case then yes let?s add it, ideally doing whatever it needs to be to make OpenSSL respect those variables, or else respecting them ourselves. > > In any case, Python will have to tap into the OS CA root > provider using special code and this code could easily be > made to check other dirs/files as well. > > The point is that it should be possible to change this default > at the Python level, without needing application code changes. Ok, I?m not opposed to it FWIW. Just sayiing I?m pretty sure those things already exist in the form of environment variables and at the python level APIs. Not sure what else there is, global state for the ?default?? A CLI flag? --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sat Aug 30 00:22:54 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 30 Aug 2014 00:22:54 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: <5400DBC0.1020700@egenix.com> <5400F785.4090708@egenix.com> Message-ID: <20140830002254.26351339@fsol> On Fri, 29 Aug 2014 18:08:19 -0400 Donald Stufft wrote: > > > > Are you sure that's possible ? Python doesn't load the > > openssl.cnf file and the SSL_CERT_FILE, SSL_CERT_DIR env > > vars only work for the openssl command line binary, AFAIK. > > I?m not 100% sure on that. I know they are not limited to the command > line binary as ruby uses those environment variables in the way I > described above. SSL_CERT_DIR and SSL_CERT_FILE are used, if set, when SSLContext.load_verify_locations() is called. Actually, come to think of it, this allows us to write a better test for that method. Patch welcome! Regards Antoine. From rdmurray at bitdance.com Sat Aug 30 00:57:35 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 29 Aug 2014 18:57:35 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <20140829214235.0EE2B250E1A@webabinitio.net> Message-ID: <20140829225736.44BCE250E1A@webabinitio.net> On Fri, 29 Aug 2014 18:00:50 -0400, Donald Stufft wrote: > > On Aug 29, 2014, at 5:42 PM, R. David Murray wrote: > > Especially if you want an accelerated change, there must be a way to > > *easily* get back to the previous behavior, or we are going to catch a > > lot of flack. There may be only 7% of public certs that are problematic, > > but I'd be willing to bet you that there are more not-really-public ones > > that are critical to day to day operations *somewhere* :) > > > > wget and curl have 'ignore validation' as a command line flag for a reason. > > > > Right, that???s why I???m on the fence :) > > On one hand, it???s going to break things for some people, (arguably they are > already broken, just silently so, but we???ll leave that argument aside) and a > way to get back the old behavior is good. There are already ways within > the Python code itself, so that???s covered. From outside of the Python code > there are ways if the certificate is untrusted but otherwise valid which are > pretty easy to do. The major ???gap??? is when you have an actual invalid > certificate due to expiration or hostname or some other such thing. > > On the other hand Python is not wget/curl and the people who are most > likely to be the target for a ???I can???t change the code but I need to get the > old behavior back??? are people who are likely to not be invoking Python > itself but using something written in Python which happens to be using > Python. IOW they might be executing ???foobar??? not ???python -m foobar???. Right, so an environment variable is better than a command line switch, for Python. > Like I said though, I???m personally fine either way so don???t take this as > being against that particular change! Ack. --David From greg.ewing at canterbury.ac.nz Sat Aug 30 01:37:18 2014 From: greg.ewing at canterbury.ac.nz (Greg Ewing) Date: Sat, 30 Aug 2014 11:37:18 +1200 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: <54003062.1040201@egenix.com> References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> <54003062.1040201@egenix.com> Message-ID: <54010EAE.4010302@canterbury.ac.nz> M.-A. Lemburg wrote: > we needed > a way to make sure that Python 3 also optionally supports working > with lone surrogates in such UTF-8 streams (nowadays called CESU-8: > http://en.wikipedia.org/wiki/CESU-8). I don't think CESU-8 is the same thing. According to the wiki page, CESU-8 *requires* all code points above 0xffff to be split into surrogate pairs before encoding. It also doesn't say that lone surrogates are valid -- it doesn't mention lone surrogates at all, only pairs. Neither does the linked technical report. The technical report also says that CESU-8 forbids any UTF-8 sequences of more than three bytes, so it's definitely not "UTF-8 plus lone surrogates". -- Greg From alex.gaynor at gmail.com Sat Aug 30 04:44:12 2014 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Sat, 30 Aug 2014 02:44:12 +0000 (UTC) Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: Message-ID: Thanks for the rapid feedback everyone! I want to summarize the action items and discussion points that have come up so far: To add to the PEP: * Emit a warning in 3.4.next for cases that would raise a Exception in 3.5 * Clearly state that the existing OpenSSL environment variables will be respected for setting the trust root Discussion points: * Disabling verification entirely externally to the program, through a CLI flag or environment variable. I'm pretty down on this idea, the problem you hit is that it's a pretty blunt instrument to swing, and it's almost impossible to imagine it not hitting things it shouldn't; it's far too likely to be used in applications that make two sets of outbound connections: 1) to some internal service which you want to disable verification on, and 2) some external service which needs strong validation. A global flag causes the latter to fail silently when subjected to a MITM attack, and that's exactly what we're trying to avoid. It also makes things much harder for library authors: I write an API client for some API, and make TLS connections to it. I want those to be verified by default. I can't even rely on the httplib defaults, because someone might disable them from the outside. Cheers, Alex From stephen at xemacs.org Sat Aug 30 06:21:56 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 30 Aug 2014 13:21:56 +0900 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! In-Reply-To: <54010EAE.4010302@canterbury.ac.nz> References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> <54003062.1040201@egenix.com> <54010EAE.4010302@canterbury.ac.nz> Message-ID: <871trybo9n.fsf@uwakimon.sk.tsukuba.ac.jp> Greg Ewing writes: > M.-A. Lemburg wrote: > > we needed > > a way to make sure that Python 3 also optionally supports working > > with lone surrogates in such UTF-8 streams (nowadays called CESU-8: > > http://en.wikipedia.org/wiki/CESU-8). Besides what Greg says, CESU-8 is an UTF, and therefore encodes valid Unicode. Speaking imprecisely, CESU-8 is UTF-16 with variable-width code units (ie, each 16-bit code point is represented using the UTF-8 variable-width representation).[1] I think you are thinking of Markus Kuhn's utf-8b (which I believe is exactly what is implemented by the surrogateescape handler). As far as the goal of "working with lone surrogates in such UTF-8 streams", the surrogateescape handler already permits that, and does so consistently across streams in the sense that lone surrogates in the UTF-8 stream cannot be mixed with garbage bytes decoded by surrogateescape in another stream, which produces an unencodable mess. I still don't see a justification for the surrogatepass handler. What applications are producing (not merely passing through) UTF-8-encoded surrogates these days? Footnotes: [1] For the curious, it's imprecise because in Unicode code units are fixed-width by definition. From mal at egenix.com Sat Aug 30 12:03:06 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 30 Aug 2014 12:03:06 +0200 Subject: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] In-Reply-To: <54010EAE.4010302@canterbury.ac.nz> References: <878um8b00w.fsf@uwakimon.sk.tsukuba.ac.jp> <54003062.1040201@egenix.com> <54010EAE.4010302@canterbury.ac.nz> Message-ID: <5401A15A.1010902@egenix.com> On 30.08.2014 01:37, Greg Ewing wrote: > M.-A. Lemburg wrote: >> we needed >> a way to make sure that Python 3 also optionally supports working >> with lone surrogates in such UTF-8 streams (nowadays called CESU-8: >> http://en.wikipedia.org/wiki/CESU-8). > > I don't think CESU-8 is the same thing. According to the wiki > page, CESU-8 *requires* all code points above 0xffff to be split > into surrogate pairs before encoding. It also doesn't say that > lone surrogates are valid -- it doesn't mention lone surrogates > at all, only pairs. Neither does the linked technical report. > > The technical report also says that CESU-8 forbids any UTF-8 > sequences of more than three bytes, so it's definitely not > "UTF-8 plus lone surrogates". You're right, it's not the same as UTF-8 plus lone surrogates. CESU-8 does encode surrogates as individual code points using the UTF-8 encoding, which is what probably caused it to be mentioned in discussions when talking about having UTF-8 streams do the same for lone surrogates. So let's call the encoding UTF-8-py so that everyone knows what we're talking about :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From mal at egenix.com Sat Aug 30 12:19:11 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 30 Aug 2014 12:19:11 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: Message-ID: <5401A51F.8050408@egenix.com> On 30.08.2014 04:44, Alex Gaynor wrote: > Thanks for the rapid feedback everyone! > > I want to summarize the action items and discussion points that have come up so > far: > > To add to the PEP: > > * Emit a warning in 3.4.next for cases that would raise a Exception in 3.5 > * Clearly state that the existing OpenSSL environment variables will be > respected for setting the trust root I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that causes OpenSSL to read the global openssl.cnf file for additional configuration. > Discussion points: > > * Disabling verification entirely externally to the program, through a CLI flag > or environment variable. I'm pretty down on this idea, the problem you hit is > that it's a pretty blunt instrument to swing, and it's almost impossible to > imagine it not hitting things it shouldn't; it's far too likely to be used in > applications that make two sets of outbound connections: 1) to some internal > service which you want to disable verification on, and 2) some external > service which needs strong validation. A global flag causes the latter to > fail silently when subjected to a MITM attack, and that's exactly what we're > trying to avoid. It also makes things much harder for library authors: I > write an API client for some API, and make TLS connections to it. I want > those to be verified by default. I can't even rely on the httplib defaults, > because someone might disable them from the outside. The reasoning here is the same as for hash randomization. There are cases where you want to test your application using self-signed certificates which don't validate against the system CA root list. In those cases, you do know what you're doing. The test would fail otherwise and the reason is not a bug in your code, it's just the fact that the environment you're running it in is a test environment. Ideally, all applications should give you this choice, but this is unlikely to happen, so it's good to be able to change the Python default, since with the proposed change, most applications will probably continue to use the Python defaults as they do now. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From solipsis at pitrou.net Sat Aug 30 12:40:26 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 30 Aug 2014 12:40:26 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: <5401A51F.8050408@egenix.com> Message-ID: <20140830124026.6dd1d92b@fsol> On Sat, 30 Aug 2014 12:19:11 +0200 "M.-A. Lemburg" wrote: > > To add to the PEP: > > > > * Emit a warning in 3.4.next for cases that would raise a Exception in 3.5 > > * Clearly state that the existing OpenSSL environment variables will be > > respected for setting the trust root > > I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that > causes OpenSSL to read the global openssl.cnf file for additional > configuration. Python links against OpenSSL as a shared library, not statically. It's unlikely that setting a compile constant inside Python would affect OpenSSL at all. > > Discussion points: > > > > * Disabling verification entirely externally to the program, through a CLI flag > > or environment variable. I'm pretty down on this idea, the problem you hit is > > that it's a pretty blunt instrument to swing, and it's almost impossible to > > imagine it not hitting things it shouldn't; it's far too likely to be used in > > applications that make two sets of outbound connections: 1) to some internal > > service which you want to disable verification on, and 2) some external > > service which needs strong validation. A global flag causes the latter to > > fail silently when subjected to a MITM attack, and that's exactly what we're > > trying to avoid. It also makes things much harder for library authors: I > > write an API client for some API, and make TLS connections to it. I want > > those to be verified by default. I can't even rely on the httplib defaults, > > because someone might disable them from the outside. > > The reasoning here is the same as for hash randomization. There > are cases where you want to test your application using self-signed > certificates which don't validate against the system CA root list. That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE env vars (or, better, by specific settings *inside* the application). I'm against multiplying environment variables, as it makes it more difficult to assess the actual security of a setting. The danger of an ill-secure setting is much more severe than with hash randomization. Regards Antoine. From mal at egenix.com Sat Aug 30 12:46:47 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 30 Aug 2014 12:46:47 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830124026.6dd1d92b@fsol> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> Message-ID: <5401AB97.4080702@egenix.com> On 30.08.2014 12:40, Antoine Pitrou wrote: > On Sat, 30 Aug 2014 12:19:11 +0200 > "M.-A. Lemburg" wrote: >>> To add to the PEP: >>> >>> * Emit a warning in 3.4.next for cases that would raise a Exception in 3.5 >>> * Clearly state that the existing OpenSSL environment variables will be >>> respected for setting the trust root >> >> I'd also suggest to compile Python with OPENSSL_LOAD_CONF, since that >> causes OpenSSL to read the global openssl.cnf file for additional >> configuration. > > Python links against OpenSSL as a shared library, not statically. It's > unlikely that setting a compile constant inside Python would affect > OpenSSL at all. The change is to the OpenSSL API, not the OpenSSL lib. By setting the variable you enable a few special calls to the config loader functions in OpenSSL when calling the initializer it: https://www.openssl.org/docs/crypto/OPENSSL_config.html >>> Discussion points: >>> >>> * Disabling verification entirely externally to the program, through a CLI flag >>> or environment variable. I'm pretty down on this idea, the problem you hit is >>> that it's a pretty blunt instrument to swing, and it's almost impossible to >>> imagine it not hitting things it shouldn't; it's far too likely to be used in >>> applications that make two sets of outbound connections: 1) to some internal >>> service which you want to disable verification on, and 2) some external >>> service which needs strong validation. A global flag causes the latter to >>> fail silently when subjected to a MITM attack, and that's exactly what we're >>> trying to avoid. It also makes things much harder for library authors: I >>> write an API client for some API, and make TLS connections to it. I want >>> those to be verified by default. I can't even rely on the httplib defaults, >>> because someone might disable them from the outside. >> >> The reasoning here is the same as for hash randomization. There >> are cases where you want to test your application using self-signed >> certificates which don't validate against the system CA root list. > > That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE > env vars (or, better, by specific settings *inside* the application). > > I'm against multiplying environment variables, as it makes it more > difficult to assess the actual security of a setting. The danger of an > ill-secure setting is much more severe than with hash randomization. You have a point there. So how about just a python run-time switch and no env var ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From p.f.moore at gmail.com Sat Aug 30 12:48:55 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 30 Aug 2014 11:48:55 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: Message-ID: 30 August 2014 03:44, Alex Gaynor wrote: > Discussion points: > > * Disabling verification entirely externally to the program, through a CLI flag > or environment variable. I'm pretty down on this idea, the problem you hit is > that it's a pretty blunt instrument to swing, and it's almost impossible to > imagine it not hitting things it shouldn't As a data point, I use --no-check-certificates extensively, in wget, curl and some Python programs which have it, like youtube-dl. The reason I do so is typically because the programs do not use the Windows cerificate store, and configuring a second certificate store on a per-program basis is too much of a pain to be worth it (per-program because the hacks such programs use to get round the fact that Windows has no central location like /etc are inconsistent). The key question for me is therefore, does Python's ssl support use the Windows store directly these days? I checked the docs and couldn't find anything explicitly stating this (but all the terminology is foreign to me, so I may have missed it). If it does, programs like youtube-dl will start to "just work" and I won't have the need for a "switch off everything" flag. If a new Python 3.5 installation on a Windows machine will enforce https cert checking and yet will not check the system store (or, I guess, come with an embedded store, but aren't there maintenance issues with doing that?) then I believe a global "don't check" flag will be needed, as not all programs offer a "don't check certificates" mode. And naive users like me may not even know how to code the behaviour for such an option - and the tone of the debate here leads me to believe that it'll be hard for developers to get unbiased advice on how to switch off checking, so it'll end up being patchily implemented. Paul From solipsis at pitrou.net Sat Aug 30 12:55:54 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sat, 30 Aug 2014 12:55:54 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5401AB97.4080702@egenix.com> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> <5401AB97.4080702@egenix.com> Message-ID: <20140830125554.0a6c06e0@fsol> On Sat, 30 Aug 2014 12:46:47 +0200 "M.-A. Lemburg" wrote: > The change is to the OpenSSL API, not the OpenSSL lib. By setting > the variable you enable a few special calls to the config loader > functions in OpenSSL when calling the initializer it: > > https://www.openssl.org/docs/crypto/OPENSSL_config.html Ah, ok. Do you have experience with openssl.cnf? Apparently, it is meant for offline tools such as certificate generation, I am not sure how it could impact certification validation. > > That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE > > env vars (or, better, by specific settings *inside* the application). > > > > I'm against multiplying environment variables, as it makes it more > > difficult to assess the actual security of a setting. The danger of an > > ill-secure setting is much more severe than with hash randomization. > > You have a point there. So how about just a python run-time switch > and no env var ? Well, why not, but does it have a value over letting the code properly configure their SSLContext? Regards Antoine. From mal at egenix.com Sat Aug 30 14:03:57 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 30 Aug 2014 14:03:57 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830125554.0a6c06e0@fsol> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> <5401AB97.4080702@egenix.com> <20140830125554.0a6c06e0@fsol> Message-ID: <5401BDAD.2000203@egenix.com> On 30.08.2014 12:55, Antoine Pitrou wrote: > On Sat, 30 Aug 2014 12:46:47 +0200 > "M.-A. Lemburg" wrote: >> The change is to the OpenSSL API, not the OpenSSL lib. By setting >> the variable you enable a few special calls to the config loader >> functions in OpenSSL when calling the initializer it: >> >> https://www.openssl.org/docs/crypto/OPENSSL_config.html > > Ah, ok. Do you have experience with openssl.cnf? Apparently, it is > meant for offline tools such as certificate generation, I am not sure > how it could impact certification validation. I'm still exploring this: the OpenSSL documentation is, well, less than complete on these things, so searching mailing lists and reading source code appears to be the only reasonable way to figure out what is possible and what not. The openssl.cnf config file is indeed mostly used by the various openssl subcommands (e.g. req and ca), but it can also be used to configuring engines and my hope is that configuration of e.g. default certificate stores also becomes possible. One of the engines can tap into the Windows certificate store, for example. >>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE >>> env vars (or, better, by specific settings *inside* the application). >>> >>> I'm against multiplying environment variables, as it makes it more >>> difficult to assess the actual security of a setting. The danger of an >>> ill-secure setting is much more severe than with hash randomization. >> >> You have a point there. So how about just a python run-time switch >> and no env var ? > > Well, why not, but does it have a value over letting the code properly > configure their SSLContext? Yes, because when Python changes the default to be validating and more secure, application developers will do the same as they do now: simply use the defaults ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From rdmurray at bitdance.com Sat Aug 30 15:32:32 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sat, 30 Aug 2014 09:32:32 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5401BDAD.2000203@egenix.com> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> <5401AB97.4080702@egenix.com> <20140830125554.0a6c06e0@fsol> <5401BDAD.2000203@egenix.com> Message-ID: <20140830133233.1A3FD250DFD@webabinitio.net> On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" wrote: > On 30.08.2014 12:55, Antoine Pitrou wrote: > > On Sat, 30 Aug 2014 12:46:47 +0200 > > "M.-A. Lemburg" wrote: > >>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE > >>> env vars (or, better, by specific settings *inside* the application). > >>> > >>> I'm against multiplying environment variables, as it makes it more > >>> difficult to assess the actual security of a setting. The danger of an > >>> ill-secure setting is much more severe than with hash randomization. > >> > >> You have a point there. So how about just a python run-time switch > >> and no env var ? > > > > Well, why not, but does it have a value over letting the code properly > > configure their SSLContext? > > Yes, because when Python changes the default to be validating > and more secure, application developers will do the same as > they do now: simply use the defaults ;-) But neither of those addresses the articulated use case: someone *using* a program implemented in python that does not itself provide a way to disable the new default security (because it is *new*). Only an environment variable will do that. Since the environment variable is opt-in, I think the "consenting adults" argument applies to Alex's demure about "multiple connections". It could still emit the warnings. --David From mal at egenix.com Sat Aug 30 16:20:22 2014 From: mal at egenix.com (M.-A. Lemburg) Date: Sat, 30 Aug 2014 16:20:22 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830133233.1A3FD250DFD@webabinitio.net> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> <5401AB97.4080702@egenix.com> <20140830125554.0a6c06e0@fsol> <5401BDAD.2000203@egenix.com> <20140830133233.1A3FD250DFD@webabinitio.net> Message-ID: <5401DDA6.3000904@egenix.com> On 30.08.2014 15:32, R. David Murray wrote: > On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" wrote: >> On 30.08.2014 12:55, Antoine Pitrou wrote: >>> On Sat, 30 Aug 2014 12:46:47 +0200 >>> "M.-A. Lemburg" wrote: >>>>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE >>>>> env vars (or, better, by specific settings *inside* the application). >>>>> >>>>> I'm against multiplying environment variables, as it makes it more >>>>> difficult to assess the actual security of a setting. The danger of an >>>>> ill-secure setting is much more severe than with hash randomization. >>>> >>>> You have a point there. So how about just a python run-time switch >>>> and no env var ? >>> >>> Well, why not, but does it have a value over letting the code properly >>> configure their SSLContext? >> >> Yes, because when Python changes the default to be validating >> and more secure, application developers will do the same as >> they do now: simply use the defaults ;-) > > But neither of those addresses the articulated use case: someone *using* > a program implemented in python that does not itself provide a way to > disable the new default security (because it is *new*). Only an > environment variable will do that. > > Since the environment variable is opt-in, I think the "consenting > adults" argument applies to Alex's demure about "multiple connections". > It could still emit the warnings. That would be a possibility as well, yes. I'd just like to see a way to say: I know what I'm doing and I'm not in the mood to configure my own CA list, so please go ahead and accept whatever certs you find -- much like what --no-check-certificate does for wget. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ From Steve.Dower at microsoft.com Sat Aug 30 16:24:05 2014 From: Steve.Dower at microsoft.com (Steve Dower) Date: Sat, 30 Aug 2014 14:24:05 +0000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830133233.1A3FD250DFD@webabinitio.net> References: <5401A51F.8050408@egenix.com> <20140830124026.6dd1d92b@fsol> <5401AB97.4080702@egenix.com> <20140830125554.0a6c06e0@fsol> <5401BDAD.2000203@egenix.com>,<20140830133233.1A3FD250DFD@webabinitio.net> Message-ID: <62d2362898df4dcab65207cb29449905@DM2PR0301MB0734.namprd03.prod.outlook.com> This sounds great, but the disable switch worries me if it's an ENVVAR=1 kind of deal. Those switches have a tendency on Windows of becoming "well known tricks" and they get set globally and permanently, often by application installers or sysadmins (PYTHONPATH suffers the exact same problem). It sounds like the likely approach is a certificate name, which is fine, provided there's no option for "accept everything". I just wanted to get an early vote in against a boolean switch. Cheers, Steve Top-posted from my Windows Phone ________________________________ From: R. David Murray Sent: ?8/?30/?2014 6:33 To: python-dev at python.org Subject: Re: [Python-Dev] PEP 476: Enabling certificate validation by default! On Sat, 30 Aug 2014 14:03:57 +0200, "M.-A. Lemburg" wrote: > On 30.08.2014 12:55, Antoine Pitrou wrote: > > On Sat, 30 Aug 2014 12:46:47 +0200 > > "M.-A. Lemburg" wrote: > >>> That use case should be served with the SSL_CERT_DIR and SSL_CERT_FILE > >>> env vars (or, better, by specific settings *inside* the application). > >>> > >>> I'm against multiplying environment variables, as it makes it more > >>> difficult to assess the actual security of a setting. The danger of an > >>> ill-secure setting is much more severe than with hash randomization. > >> > >> You have a point there. So how about just a python run-time switch > >> and no env var ? > > > > Well, why not, but does it have a value over letting the code properly > > configure their SSLContext? > > Yes, because when Python changes the default to be validating > and more secure, application developers will do the same as > they do now: simply use the defaults ;-) But neither of those addresses the articulated use case: someone *using* a program implemented in python that does not itself provide a way to disable the new default security (because it is *new*). Only an environment variable will do that. Since the environment variable is opt-in, I think the "consenting adults" argument applies to Alex's demure about "multiple connections". It could still emit the warnings. --David _______________________________________________ Python-Dev mailing list Python-Dev at python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/steve.dower%40microsoft.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.gaynor at gmail.com Sat Aug 30 17:22:01 2014 From: alex.gaynor at gmail.com (Alex Gaynor) Date: Sat, 30 Aug 2014 15:22:01 +0000 (UTC) Subject: [Python-Dev] =?utf-8?q?PEP_476=3A_Enabling_certificate_validation?= =?utf-8?q?_by=09default!?= References: Message-ID: The Windows certificate store is used by ``load_default_certs``: * https://github.com/python/cpython/blob/master/Lib/ssl.py#L379-L381 * https://docs.python.org/3.4/library/ssl.html#ssl.enum_certificates Cheers, Alex From p.f.moore at gmail.com Sat Aug 30 17:36:23 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sat, 30 Aug 2014 16:36:23 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: Message-ID: On 30 August 2014 16:22, Alex Gaynor wrote: > The Windows certificate store is used by ``load_default_certs` Cool, in which case this sounds like a good plan. I have no particular opinion on whether there should be a global Python-level "don't check certificates" option, but I would suggest that the docs include a section explaining how a user can implement a "--no-check-certificates" flag in their program if they want to (with appropriate warnings as to the risks, of course!). Better to explain how to do it properly than to say "you shouldn't do that" and have developers implement awkward or incorrect hacks in spite of the advice. Paul From marko at pacujo.net Sat Aug 30 18:17:28 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sat, 30 Aug 2014 19:17:28 +0300 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: (Paul Moore's message of "Sat, 30 Aug 2014 16:36:23 +0100") References: Message-ID: <87bnr2uf3b.fsf@elektro.pacujo.net> Paul Moore : > Cool, in which case this sounds like a good plan. I have no particular > opinion on whether there should be a global Python-level "don't check > certificates" option, but I would suggest that the docs include a > section explaining how a user can implement a > "--no-check-certificates" flag in their program if they want to (with > appropriate warnings as to the risks, of course!). Better to explain > how to do it properly than to say "you shouldn't do that" and have > developers implement awkward or incorrect hacks in spite of the > advice. Will there be a way to specify a particular CA certificate (as in "wget --ca-certificate")? Will there be a way to specify a particular CA certificate directory (as in "wget --ca-directory")? Marko From barry at python.org Sat Aug 30 18:42:12 2014 From: barry at python.org (Barry Warsaw) Date: Sat, 30 Aug 2014 09:42:12 -0700 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5401A51F.8050408@egenix.com> References: <5401A51F.8050408@egenix.com> Message-ID: <20140830094212.5e9c13b5@anarchist> On Aug 30, 2014, at 12:19 PM, M.-A. Lemburg wrote: >The reasoning here is the same as for hash randomization. There >are cases where you want to test your application using self-signed >certificates which don't validate against the system CA root list. > >In those cases, you do know what you're doing. The test would fail >otherwise and the reason is not a bug in your code, it's just >the fact that the environment you're running it in is a test >environment. Exactly. I have test cases where I have to load up a self-signed cert via .load_cert_chain() and in the good-path tests, I expect to make successful https connections. I also have test cases that expect to fail when: * I load bogus self-signed certs * I have an http server masquerading as an https server * I load an expired self-signed cert It certainly makes sense for the default to be the most secure, but other use cases must be preserved. Cheers, -Barry From christian at python.org Sat Aug 30 19:21:41 2014 From: christian at python.org (Christian Heimes) Date: Sat, 30 Aug 2014 19:21:41 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: Message-ID: <54020825.9030700@python.org> On 30.08.2014 17:22, Alex Gaynor wrote: > The Windows certificate store is used by ``load_default_certs``: > > * https://github.com/python/cpython/blob/master/Lib/ssl.py#L379-L381 > * https://docs.python.org/3.4/library/ssl.html#ssl.enum_certificates The Windows part of load_default_certs() has one major flaw: it can only load certificates that are already in Windows's cert store. However Windows comes only with a small set of default certs and downloads more certs on demand. In order to trigger a download Python or OpenSSL would have to use the Windows API to verify root certificates. Christian From martin at v.loewis.de Sat Aug 30 22:03:20 2014 From: martin at v.loewis.de (martin at v.loewis.de) Date: Sat, 30 Aug 2014 22:03:20 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <54020825.9030700@python.org> References: <54020825.9030700@python.org> Message-ID: <20140830220320.Horde.w4hf4ACWV_cJyYetCLrv8g8@webmail.df.eu> Zitat von Christian Heimes : > On 30.08.2014 17:22, Alex Gaynor wrote: >> The Windows certificate store is used by ``load_default_certs``: >> >> * https://github.com/python/cpython/blob/master/Lib/ssl.py#L379-L381 >> * https://docs.python.org/3.4/library/ssl.html#ssl.enum_certificates > > The Windows part of load_default_certs() has one major flaw: it can only > load certificates that are already in Windows's cert store. However > Windows comes only with a small set of default certs and downloads more > certs on demand. In order to trigger a download Python or OpenSSL would > have to use the Windows API to verify root certificates. It's better than you think. Vista+ has a weekly prefetching procedure that should assure that virtually all root certificates are available: http://support.microsoft.com/kb/931125/en-us BTW, it's patented: http://www.google.de/patents/US6816900 Regards, Martin From ncoghlan at gmail.com Sun Aug 31 01:26:30 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 31 Aug 2014 09:26:30 +1000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <5400DD64.4050308@stoneleaf.us> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> Message-ID: On 30 Aug 2014 06:08, "Ethan Furman" wrote: > > On 08/29/2014 01:00 PM, M.-A. Lemburg wrote: >> >> On 29.08.2014 21:47, Alex Gaynor wrote: >>> >>> >>> I've just submitted PEP 476, on enabling certificate validation by default for >>> HTTPS clients in Python. Please have a look and let me know what you think. >> >> >> Thanks for the PEP. I think this is generally a good idea, >> but some important parts are missing from the PEP: >> >> * transition plan: >> >> I think starting with warnings in Python 3.5 and going >> for exceptions in 3.6 would make a good transition >> >> Going straight for exceptions in 3.5 is not in line with >> our normal procedures for backwards incompatible changes. >> >> * configuration: >> >> It would be good to be able to switch this on or off >> without having to change the code, e.g. via a command >> line switch and environment variable; perhaps even >> controlling whether or not to raise an exception or >> warning. >> >> * choice of trusted certificate: >> >> Instead of hard wiring using the system CA roots into >> Python it would be good to just make this default and >> permit the user to point Python to a different set of >> CA roots. >> >> This would enable using self signed certs more easily. >> Since these are often used for tests, demos and education, >> I think it's important to allow having more control of >> the trusted certs. > > > +1 for PEP with above changes. Ditto from me. In relation to changing the Python CLI API to offer some of the wget/curl style command line options, I like the idea of providing recipes in the docs for implementing them at the application layer, but postponing making the *default* behaviour configurable that way. Longer term, I'd like to actually have a per-runtime configuration file for some of these things that also integrated with the pyvenv support, but that requires untangling the current startup code first (and there are only so many hours in the day). Regards, Nick. > > -- > ~Ethan~ > > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From solipsis at pitrou.net Sun Aug 31 03:25:25 2014 From: solipsis at pitrou.net (Antoine Pitrou) Date: Sun, 31 Aug 2014 03:25:25 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> Message-ID: <20140831032525.19b7e48c@fsol> On Sun, 31 Aug 2014 09:26:30 +1000 Nick Coghlan wrote: > >> > >> * configuration: > >> > >> It would be good to be able to switch this on or off > >> without having to change the code, e.g. via a command > >> line switch and environment variable; perhaps even > >> controlling whether or not to raise an exception or > >> warning. > >> > >> * choice of trusted certificate: > >> > >> Instead of hard wiring using the system CA roots into > >> Python it would be good to just make this default and > >> permit the user to point Python to a different set of > >> CA roots. > >> > >> This would enable using self signed certs more easily. > >> Since these are often used for tests, demos and education, > >> I think it's important to allow having more control of > >> the trusted certs. > > > > > > +1 for PEP with above changes. > > Ditto from me. > > In relation to changing the Python CLI API to offer some of the wget/curl > style command line options, I like the idea of providing recipes in the > docs for implementing them at the application layer, but postponing making > the *default* behaviour configurable that way. I'm against any additional environment variables and command-line options. It will only complicate and obscure the security parameters of certificate validation. The existing knobs have already been mentioned in this thread, I won't mention them here again. Regards Antoine. From rdmurray at bitdance.com Sun Aug 31 04:21:49 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sat, 30 Aug 2014 22:21:49 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140831032525.19b7e48c@fsol> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> Message-ID: <20140831022149.F0493250E30@webabinitio.net> On Sun, 31 Aug 2014 03:25:25 +0200, Antoine Pitrou wrote: > On Sun, 31 Aug 2014 09:26:30 +1000 > Nick Coghlan wrote: > > >> > > >> * configuration: > > >> > > >> It would be good to be able to switch this on or off > > >> without having to change the code, e.g. via a command > > >> line switch and environment variable; perhaps even > > >> controlling whether or not to raise an exception or > > >> warning. > > >> > > >> * choice of trusted certificate: > > >> > > >> Instead of hard wiring using the system CA roots into > > >> Python it would be good to just make this default and > > >> permit the user to point Python to a different set of > > >> CA roots. > > >> > > >> This would enable using self signed certs more easily. > > >> Since these are often used for tests, demos and education, > > >> I think it's important to allow having more control of > > >> the trusted certs. > > > > > > > > > +1 for PEP with above changes. > > > > Ditto from me. > > > > In relation to changing the Python CLI API to offer some of the wget/curl > > style command line options, I like the idea of providing recipes in the > > docs for implementing them at the application layer, but postponing making > > the *default* behaviour configurable that way. > > I'm against any additional environment variables and command-line > options. It will only complicate and obscure the security parameters of > certificate validation. > > The existing knobs have already been mentioned in this thread, I won't > mention them here again. Do those knobs allow one to instruct urllib to accept an invalid certificate without changing the program code? --David From stephen at xemacs.org Sun Aug 31 07:53:17 2014 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sun, 31 Aug 2014 14:53:17 +0900 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830220320.Horde.w4hf4ACWV_cJyYetCLrv8g8@webmail.df.eu> References: <54020825.9030700@python.org> <20140830220320.Horde.w4hf4ACWV_cJyYetCLrv8g8@webmail.df.eu> Message-ID: <87sikd9pde.fsf@uwakimon.sk.tsukuba.ac.jp> martin at v.loewis.de writes: > BTW, it's patented: > > http://www.google.de/patents/US6816900 Damn them. I hope they never get a look at my crontab. From ncoghlan at gmail.com Sun Aug 31 08:09:26 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 31 Aug 2014 16:09:26 +1000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140831022149.F0493250E30@webabinitio.net> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: On 31 August 2014 12:21, R. David Murray wrote: > On Sun, 31 Aug 2014 03:25:25 +0200, Antoine Pitrou wrote: >> On Sun, 31 Aug 2014 09:26:30 +1000 >> Nick Coghlan wrote: >> > In relation to changing the Python CLI API to offer some of the wget/curl >> > style command line options, I like the idea of providing recipes in the >> > docs for implementing them at the application layer, but postponing making >> > the *default* behaviour configurable that way. >> >> I'm against any additional environment variables and command-line >> options. It will only complicate and obscure the security parameters of >> certificate validation. As Antoine says here, I'm also opposed to adding more Python specific configuration options. However, I think there may be something worthwhile we can do that's closer to the way browsers work, and has the significant benefit of being implementable as a PyPI module first (more on that in a separate reply). >> The existing knobs have already been mentioned in this thread, I won't >> mention them here again. > > Do those knobs allow one to instruct urllib to accept an invalid > certificate without changing the program code? Only if you add the specific certificate concerned to the certificate store that Python is using (which PEP 476 currently suggests will be the platform wide certificate store). Whether or not that is an adequate solution is the point currently in dispute. My view is that the core problem/concern we need to address here is how we manage the migration away from a network communication model that trusts the network by default. That transition will happen regardless of whether or not we adapt Python as a platform - the challenge for us is how we can address it in a way that minimises the impact on existing users, while still ensuring future users are protected by default. This would be relatively easy if we only had to worry about the public internet (since we're followers rather than leaders in that environment), but we don't. Python made the leap into enterprise environments long ago, so we not only need to cope with corporate intranets, we need to cope with corporate intranets that aren't necessarily being well managed. That's what makes this a harder problem for us than it is for a new language like Go that was created by a public internet utility, specifically for use over the public internet - they didn't *have* an installed base to manage, they could just build a language specifically tailored for the task of running network services on Linux, without needing to account for any other use cases. The reason our existing installed base creates a problem is because corporate network security has historically focused on "perimeter defence": carving out a trusted island behind the corporate firewall where users and other computer systems could be "safely" assumed not to be malicious. As an industry, we have learned though harsh experience that *this model doesn't work*. You can't trust the network, period. A corporate intranet is *less* dangerous than the public internet, but you still can't trust it. This "don't trust the network" ethos is also reinforced by the broad shift to "utility computing" where more and more companies are running distributed networks, where some of their systems are actually running on vendor provided servers. The "network perimeter" is evaporating, as corporate "intranets" start to look a lot more like recreations of the internet in miniature, with the only difference being the existence of more formal contractual relationships than typically exist between internet peers. Unfortunately, far too many organisations (especially those outside the tech industry) still trust in perimeter defence for their internal network security, and hence tolerate the use of unsecured connections, or skipping certificate validation internally. This is actually a really terrible idea, but it's still incredibly common due to the general failure of the technology industry to take usability issues seriously when we design security systems - doing the wrong "unsafe" thing is genuinely easier than doing things right. We have enough evidence now to be able to say (as Alex does in PEP 476) that it has been comprehensively demonstrated that "opt-in security" really just means "security failures are common and silent by default". We've seen it with C buffer overflow vulnerabilities, we've seen it with plain text communication links, we've seen it with SSL certificate validation - the vast majority of users and developers will just run with the default behaviour of the platform or application they're using, even if those defaults have serious problems. As the saying goes, "you can't document your way out of a usability problem" - uncovered connections, or that are vulnerable to a man-in-the-middle attack appear to work for all functional purposes, they're just vulnerable to monitoring and subversion. It turns out "opt-out security with a global off switch" isn't actually much better when it comes to changing *existing* behaviours, as people just turn the new security features off and continue on as they were, rather than figuring out what dangers the new security system is trying to warn them about and encourage them to pre-emptively address them. Offering that kind of flag may sometimes be a necessary transition phase (or we wouldn't have things like "setenforce 0" for SELinux) but it should be considered an absolute last resort. In the specific case of network security, we need to take responsibility as an industry for the endemic failure of the networking infrastructure to provide robust end user security and privacy, and figure out how to get to a state where encrypted and authenticated network connections are as easy to use as uncovered ones. I see Alex's PEP (along with the preceding work on the SSL module that makes it feasible) as a significant step in that direction. At the same time, we need to account for the fact that most existing organisations still trust in perimeter defence for their internal network security, and hence tolerate (or even actively encourage) the use of unsecured connections, or skipping certificate validation, internally. This is actually a really terrible idea, but it's still incredibly common due to the general failure of the technology industry to take usability issues seriously when we design security systems (at least until recently) - doing the wrong "unsafe" thing is genuinely easier than doing things right. We can, and should, tackle this as a design problem, and ensure PEP 476 covers this scenario adequately. We also need to make sure we do it in a way that avoids places any significant additional burdens on teams that may already be trying to explain what "long term maintenance" means, and why the flow of free feature releases for the Python 2 series stopped. This message is already rather long, however, so I'll go into more technical details in a separate reply to David's question. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From donald at stufft.io Sun Aug 31 08:16:55 2014 From: donald at stufft.io (Donald Stufft) Date: Sun, 31 Aug 2014 02:16:55 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: > On Aug 31, 2014, at 2:09 AM, Nick Coghlan wrote: > > At the same time, we need to account for the fact that most existing > organisations still trust in perimeter defence for their internal > network security, and hence tolerate (or even actively encourage) the > use of unsecured connections, or skipping certificate validation, > internally. This is actually a really terrible idea, but it's still > incredibly common due to the general failure of the technology > industry to take usability issues seriously when we design security > systems (at least until recently) - doing the wrong "unsafe" thing is > genuinely easier than doing things right. > Just a quick clarification in order to be a little clearer, this change will (obviously) only effect those who trust perimeter security *and* decided to install an invalid certificate instead of just using HTTP. I'm not saying that this doesn't happen, just being specific (I'm not actually sure why they would install a TLS certificate at all if they are trusting perimeter security, but I'm sure folks do). --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 31 08:24:43 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 31 Aug 2014 16:24:43 +1000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140831022149.F0493250E30@webabinitio.net> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: On 31 August 2014 12:21, R. David Murray wrote: > Do those knobs allow one to instruct urllib to accept an invalid > certificate without changing the program code? My first reply ended up being a context dump of the challenges created by legacy corporate intranets that may not be immediately obvious to folks that spend most of their time working on or with the public internet. I decided to split these more technical details out to a new reply for the benefit of folks that already know all that history :) To answer David's specific question, the existing knobs at the OpenSSL level (SSL_CERT_DIR and SSL_CERT_FILE ) let people add an internal CA, opt out of the default CA system, and trust *specific* self-signed certs. What they don't allow is a global "trust any cert" setting - exceptions need to be added at the individual cert level or at the CA level, or the application needs to offer an option to not do cert validation at all. That "trust anything" option at the platform level is the setting that is a really bad idea - if an organisation thinks it needs that (because they have a lot of self-signed certs, but aren't verifying their HTTPS connections to those servers), then what they really need is an internal CA, where their systems just need to be set up to trust the internal CA in addition to the platform CA certs. With Alex's proposal, organisations that are already running an internal CA should be just fine - Python 3.5 will see the CA cert in the platform cert store and accept certs signed by it as valid. (Note: the Python 3.4 warning should take this into account, which could be a problem since we don't currently do validity checks against the platform store by default. The PEP needs to cover the mechanics of that in more detail, as I think it means we'll need to make *some* changes to the default configuration even in Python 3.4 to get accurate validity data back from OpenSSL) However, we also need to accept that there's a reason browser vendors still offer "click through insecurity" for sites with self-signed certificates, and tools like wget/curl offer the option to say "don't check the certificate": these are necessary compromises to make SSL based network connections actually work on many current corporate intranets. It is corporate environments that also make it desirable to be able to address this potential problem at a *user* level, since many Python users in a large organisations are actually running Python entirely out of their home directories, rather than as a system installation (they may not even have admin access to their own systems). My suggestion at this point is that we take a leaf from both browser vendors and the design of SSH: make it easy to *add* a specific self-signed cert to the set a *particular user* trusts by default (preferably *only* for a particular host, to limit the power of such certs). "python -m ssl" doesn't currently do anything interesting, so it could be used to provide an API for managing that user level certificate store. A Python-specific user level cert store is something that could be developed as a PyPI library for Python 2.7.9+ and 3.4+ (Is cert management considered in scope for cryptography.io? If so, that could be a good home). So while I agree with the intent of PEP 476, and like the suggested end state, I'm back to thinking that the transition plan for existing corporate users needs more work before it can be accepted. This is especially true since it becomes another barrier to migrating from Python 2.7 to Python 3.5+ (a warning in Python 3.4 doesn't help with that aspect, although a new -3 warning might). A third party module that offers a user level certificate store, and a gevent.monkey style way of opting in to this behaviour for existing Python versions would be one way to provide a more compelling transition plan. Regards, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From ncoghlan at gmail.com Sun Aug 31 08:45:42 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sun, 31 Aug 2014 16:45:42 +1000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: On 31 August 2014 16:16, Donald Stufft wrote: > > On Aug 31, 2014, at 2:09 AM, Nick Coghlan wrote: > > At the same time, we need to account for the fact that most existing > organisations still trust in perimeter defence for their internal > network security, and hence tolerate (or even actively encourage) the > use of unsecured connections, or skipping certificate validation, > internally. This is actually a really terrible idea, but it's still > incredibly common due to the general failure of the technology > industry to take usability issues seriously when we design security > systems (at least until recently) - doing the wrong "unsafe" thing is > genuinely easier than doing things right. > > > Just a quick clarification in order to be a little clearer, this change will > (obviously) only effect those who trust perimeter security *and* decided to > install an invalid certificate instead of just using HTTP. I'm not saying > that > this doesn't happen, just being specific (I'm not actually sure why they > would > install a TLS certificate at all if they are trusting perimeter security, > but > I'm sure folks do). It's the end result when a company wide edict to use HTTPS isn't backed up by the necessary documentation and training on how to get a properly signed cert from your internal CA (or, even better, when such an edict comes down without setting up an internal CA first). Folks hit the internet instead, find instructions on creating a self-signed cert, install that, and tell their users to ignore the security warning and accept the cert. Historically, Python clients have "just worked" in environments that required a click-through on the browser side, since you had to opt in to checking the certificates properly. Self-signed certificates can also be really handy for doing local testing - you're not really aiming to authenticate the connection in that case, you're just aiming to test that the secure connection machinery is all working properly. (As far as the "what about requests?" question goes - that's in a similar situation to Go, where being new allows it to choose different defaults, and folks for whom those defaults don't work just won't use it. There's also the fact that most corporate Python users are unlikely to know that PyPI exists, let alone that it contains a module called "requests" that does SSL certificate validation by default. Those of us in the corporate world that interact directly with upstream are still the exception rather than the rule) Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From cory at lukasa.co.uk Sun Aug 31 12:42:12 2014 From: cory at lukasa.co.uk (Cory Benfield) Date: Sun, 31 Aug 2014 11:42:12 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: On 31 August 2014 07:45, Nick Coghlan wrote: > There's also the fact that most corporate Python users are > unlikely to know that PyPI exists, let alone that it contains a module > called "requests" that does SSL certificate validation by default. > Those of us in the corporate world that interact directly with > upstream are still the exception rather than the rule) I think this point deserves just a little bit more emphasis. This is why any solution that begins with 'use PyPI' is insufficient. I've worked on requests for 3 years now and most of my colleagues have never heard of it, and it's not because I don't talk about it (I talk about it all the time!). When building internal tools, corporate environments frequently restrict themselves to the standard library. This is because it's hard enough to get adoption of a tool when it requires a new language runtime, let alone if you have to get people ramped up on package distribution as well! I have enough trouble getting people to upgrade Python versions at work: trying to get them up to speed on pip and PyPI is worse. It is no longer tenable in the long term for Python to trust the network: you're right in this regard Nick. In the past, on this very list, I've been bullish about fixing up Python's network security position. I was an aggressive supporter of PEP 466 (and there are some corners of PEP 466 that I think didn't go far enough). However, I'm with you here: we should do this once and do it right. Corporate users *will* bump into it, and they will look to the docs to fix it. That fix needs to be easy and painless. A user-level cert store is a good start, and if cryptography.io aren't interested in it I might take a look at implementing it under the certifi.io umbrella instead. Cory From christian at python.org Sun Aug 31 13:18:28 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 13:18:28 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140830002254.26351339@fsol> References: <5400DBC0.1020700@egenix.com> <5400F785.4090708@egenix.com> <20140830002254.26351339@fsol> Message-ID: <54030484.8080203@python.org> On 30.08.2014 00:22, Antoine Pitrou wrote: > SSL_CERT_DIR and SSL_CERT_FILE are used, if set, when > SSLContext.load_verify_locations() is called. > > Actually, come to think of it, this allows us to write a better > test for that method. Patch welcome! The environment vars are used only when SSLContext.set_default_verify_paths() is called. load_verify_locations() loads certificates from a given file, directory or memory but it doesn't look at the env vars. create_default_context() calls SSLContext.load_default_certs() when neither cafile, capath nor cadata is given as an argument. SSLContext.load_default_certs() then calls SSLContext.set_default_verify_paths(). However there is a catch: SSLContext.set_default_verify_paths() is not called on Windows. In retrospective it was a bad decision by me to omit the call. http://hg.python.org/cpython/file/164a17eca081/Lib/ssl.py#l376 Christian PS: SSL_CERT_DIR and SSL_CERT_FILE are the default names. It's possible to change the names in OpenSSL. ssl.get_default_verify_paths() returns the names and paths to the default verify locations. From victor.stinner at gmail.com Sun Aug 31 14:44:35 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Sun, 31 Aug 2014 14:44:35 +0200 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR Message-ID: HTML version: http://legacy.python.org/dev/peps/pep-0475/ PEP: 475 Title: Retry system calls failing with EINTR Version: $Revision$ Last-Modified: $Date$ Author: Charles-Fran?ois Natali , Victor Stinner Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 29-July-2014 Python-Version: 3.5 Abstract ======== Retry system calls failing with the ``EINTR`` error and recompute timeout if needed. Rationale ========= Interrupted system calls ------------------------ On POSIX systems, signals are common. Your program must be prepared to handle them. Examples of signals: * The most common signal is ``SIGINT``, signal sent when CTRL+c is pressed. By default, Python raises a ``KeyboardInterrupt`` exception when this signal is received. * When running subprocesses, the ``SIGCHLD`` signal is sent when a child process exits. * Resizing the terminal sends the ``SIGWINCH`` signal to the applications running in the terminal. * Putting the application in background (ex: press CTRL-z and then type the ``bg`` command) sends the ``SIGCONT`` signal. Writing a signal handler is difficult, only "async-signal safe" functions can be called. For example, ``printf()`` and ``malloc()`` are not async-signal safe. When a signal is sent to a process calling a system call, the system call can fail with the ``EINTR`` error to give the program an opportunity to handle the signal without the restriction on signal safe functions. Depending on the platform, on the system call and the ``SA_RESTART`` flag, the system call may or may not fail with ``EINTR``. If the signal handler was set with the ``SA_RESTART`` flag set, the kernel retries some the system call instead of failing with ``EINTR``. For example, ``read()`` is retried, whereas ``select()`` is not retried. The Python function ``signal.signal()`` clears the ``SA_RESTART`` flag when setting the signal handler: all system calls should fail with ``EINTR`` in Python. The problem is that handling ``EINTR`` should be done for all system calls. The problem is similar to handling errors in the C language which does not have exceptions: you must check all function returns to check for error, and usually duplicate the code checking for errors. Python does not have this issue, it uses exceptions to notify errors. Current status -------------- Currently in Python, the code to handle the ``InterruptedError`` exception (``EINTR`` error) is duplicated on case by case. Only a few Python modules handle this exception, and fixes usually took several years to cover a whole module. Example of code retrying ``file.read()`` on ``InterruptedError``:: while True: try: data = file.read(size) break except InterruptedError: continue List of Python modules of the standard library which handle ``InterruptedError``: * ``asyncio`` * ``asyncore`` * ``io``, ``_pyio`` * ``multiprocessing`` * ``selectors`` * ``socket`` * ``socketserver`` * ``subprocess`` Other programming languages like Perl, Java and Go already retry system calls failing with ``EINTR``. Use Case 1: Don't Bother With Signals ------------------------------------- In most cases, you don't want to be interrupted by signals and you don't expect to get ``InterruptedError`` exceptions. For example, do you really want to write such complex code for an "Hello World" example? :: while True: try: print("Hello World") break except InterruptedError: continue ``InterruptedError`` can happen in unexpected places. For example, ``os.close()`` and ``FileIO.close()`` can raises ``InterruptedError``: see the article `close() and EINTR `_. The `Python issues related to EINTR`_ section below gives examples of bugs caused by "EINTR". The expectation is that Python hides the ``InterruptedError``: retry system calls failing with the ``EINTR`` error. Use Case 2: Be notified of signals as soon as possible ------------------------------------------------------ Sometimes, you expect some signals and you want to handle them as soon as possible. For example, you may want to quit immediatly a program using the ``CTRL+c`` keyboard shortcut. Some signals are not interesting and should not interrupt the the application. There are two options to only interrupt an application on some signals: * Raise an exception in the signal handler, like ``KeyboardInterrupt`` for ``SIGINT`` * Use a I/O multiplexing function like ``select()`` with the Python signal "wakeup" file descriptor: see the function ``signal.set_wakeup_fd()``. Proposition =========== If a system call fails with ``EINTR``, Python must call signal handlers: call ``PyErr_CheckSignals()``. If a signal handler raises an exception, the Python function fails with the exception. Otherwise, the system call is retried. If the system call takes a timeout parameter, the timeout is recomputed. Modified functions ------------------ Example of functions that need to be modified: * ``os.read()``, ``io.FileIO.read()``, ``io.FileIO.readinto()`` * ``os.write()``, ``io.FileIO.write()`` * ``os.waitpid()`` * ``socket.accept()`` * ``socket.connect()`` * ``socket.recv()``, ``socket.recv_into()`` * ``socket.recv_from()`` * ``socket.send()`` * ``socket.sendto()`` * ``time.sleep()`` * ``select.select()`` * ``select.poll()`` * ``select.epoll.poll()`` * ``select.devpoll.poll()`` * ``select.kqueue.control()`` * ``selectors.SelectSelector.select()`` and other selector classes Note: The ``selector`` module already retries on ``InterruptedError``, but it doesn't recompute the timeout yet. Backward Compatibility ====================== Applications relying on the fact that system calls are interrupted with ``InterruptedError`` will hang. The authors of this PEP don't think that such application exist. If such applications exist, they are not portable and are subject to race conditions (deadlock if the signal comes before the system call). These applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. For example, use a signal handler which raises an exception, or use a wakeup file descriptor. For applications using event loops, ``signal.set_wakeup_fd()`` is the recommanded option to handle signals. The signal handler writes signal numbers into the file descriptor and the event loop is awaken to read them. The event loop can handle these signals without the restriction of signal handlers. Appendix ======== Wakeup file descriptor ---------------------- Since Python 3.3, ``signal.set_wakeup_fd()`` writes the signal number into the file descriptor, whereas it only wrote a null byte before. It becomes possible to handle different signals using the wakeup file descriptor. Linux has a ``signalfd()`` which provides more information on each signal. For example, it's possible to know the pid and uid who sent the signal. This function is not exposed in Python yet (see the `issue 12304 `_). On Unix, the ``asyncio`` module uses the wakeup file descriptor to wake up its event loop. Multithreading -------------- A C signal handler can be called from any thread, but the Python signal handler should only be called in the main thread. Python has a ``PyErr_SetInterrupt()`` function which calls the ``SIGINT`` signal handler to interrupt the Python main thread. Signals on Windows ------------------ Control events ^^^^^^^^^^^^^^ Windows uses "control events": * ``CTRL_BREAK_EVENT``: Break (``SIGBREAK``) * ``CTRL_CLOSE_EVENT``: Close event * ``CTRL_C_EVENT``: CTRL+C (``SIGINT``) * ``CTRL_LOGOFF_EVENT``: Logoff * ``CTRL_SHUTDOWN_EVENT``: Shutdown The `SetConsoleCtrlHandler() function `_ can be used to install a control handler. The ``CTRL_C_EVENT`` and ``CTRL_BREAK_EVENT`` events can be sent to a process using the `GenerateConsoleCtrlEvent() function `_. This function is exposed in Python as ``os.kill()``. Signals ^^^^^^^ The following signals are supported on Windows: * ``SIGABRT`` * ``SIGBREAK`` (``CTRL_BREAK_EVENT``): signal only available on Windows * ``SIGFPE`` * ``SIGILL`` * ``SIGINT`` (``CTRL_C_EVENT``) * ``SIGSEGV`` * ``SIGTERM`` SIGINT ^^^^^^ The default Python signal handler for ``SIGINT`` sets a Windows event object: ``sigint_event``. ``time.sleep()`` is implemented with ``WaitForSingleObjectEx()``, it waits for the ``sigint_event`` object using ``time.sleep()`` parameter as the timeout. So the sleep can be interrupted by ``SIGINT``. ``_winapi.WaitForMultipleObjects()`` automatically adds ``sigint_event`` to the list of watched handles, so it can also be interrupted. ``PyOS_StdioReadline()`` also used ``sigint_event`` when ``fgets()`` failed to check if Ctrl-C or Ctrl-Z was pressed. Links ----- Misc ^^^^ * `glibc manual: Primitives Interrupted by Signals `_ * `Bug #119097 for perl5: print returning EINTR in 5.14 `_. Python issues related to EINTR ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The main issue is: `handle EINTR in the stdlib `_. Open issues: * `Add a new signal.set_wakeup_socket() function `_ * `signal.set_wakeup_fd(fd): set the fd to non-blocking mode `_ * `Use a monotonic clock to compute timeouts `_ * `sys.stdout.write on OS X is not EINTR safe `_ * `platform.uname() not EINTR safe `_ * `asyncore does not handle EINTR in recv, send, connect, accept, `_ * `socket.create_connection() doesn't handle EINTR properly `_ Closed issues: * `Interrupted system calls are not retried `_ * `Solaris: EINTR exception in select/socket calls in telnetlib `_ * `subprocess: Popen.communicate() doesn't handle EINTR in some cases `_ * `multiprocessing.util._eintr_retry doen't recalculate timeouts `_ * `file readline, readlines & readall methods can lose data on EINTR `_ * `multiprocessing BaseManager serve_client() does not check EINTR on recv `_ * `selectors behaviour on EINTR undocumented `_ * `asyncio: limit EINTR occurrences with SA_RESTART `_ * `smtplib.py socket.create_connection() also doesn't handle EINTR properly `_ * `Faulty RESTART/EINTR handling in Parser/myreadline.c `_ * `test_httpservers intermittent failure, test_post and EINTR `_ * `os.spawnv(P_WAIT, ...) on Linux doesn't handle EINTR `_ * `asyncore fails when EINTR happens in pol `_ * `file.write and file.read don't handle EINTR `_ * `socket.readline() interface doesn't handle EINTR properly `_ * `subprocess is not EINTR-safe `_ * `SocketServer doesn't handle syscall interruption `_ * `subprocess deadlock when read() is interrupted `_ * `time.sleep(1): call PyErr_CheckSignals() if the sleep was interrupted `_ * `siginterrupt with flag=False is reset when signal received `_ * `need siginterrupt() on Linux - impossible to do timeouts `_ * `[Windows] Can not interrupt time.sleep() `_ Python issues related to signals ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Open issues: * `signal.default_int_handler should set signal number on the raised exception `_ * `expose signalfd(2) in the signal module `_ * `missing return in win32_kill? `_ * `Interrupts are lost during readline PyOS_InputHook processing `_ * `cannot catch KeyboardInterrupt when using curses getkey() `_ * `Deferred KeyboardInterrupt in interactive mode `_ Closed issues: * `sys.interrupt_main() `_ Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: From rdmurray at bitdance.com Sun Aug 31 16:16:27 2014 From: rdmurray at bitdance.com (R. David Murray) Date: Sun, 31 Aug 2014 10:16:27 -0400 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: <20140831141628.A6D59250E29@webabinitio.net> On Sun, 31 Aug 2014 16:45:42 +1000, Nick Coghlan wrote: > On 31 August 2014 16:16, Donald Stufft wrote: > > > > On Aug 31, 2014, at 2:09 AM, Nick Coghlan wrote: > > > > At the same time, we need to account for the fact that most existing > > organisations still trust in perimeter defence for their internal > > network security, and hence tolerate (or even actively encourage) the > > use of unsecured connections, or skipping certificate validation, > > internally. This is actually a really terrible idea, but it's still > > incredibly common due to the general failure of the technology > > industry to take usability issues seriously when we design security > > systems (at least until recently) - doing the wrong "unsafe" thing is > > genuinely easier than doing things right. > > > > > > Just a quick clarification in order to be a little clearer, this change will > > (obviously) only effect those who trust perimeter security *and* decided to > > install an invalid certificate instead of just using HTTP. I'm not saying > > that > > this doesn't happen, just being specific (I'm not actually sure why they > > would > > install a TLS certificate at all if they are trusting perimeter security, > > but > > I'm sure folks do). > > It's the end result when a company wide edict to use HTTPS isn't > backed up by the necessary documentation and training on how to get a > properly signed cert from your internal CA (or, even better, when such > an edict comes down without setting up an internal CA first). Folks > hit the internet instead, find instructions on creating a self-signed > cert, install that, and tell their users to ignore the security > warning and accept the cert. Historically, Python clients have "just > worked" in environments that required a click-through on the browser > side, since you had to opt in to checking the certificates properly. > > Self-signed certificates can also be really handy for doing local > testing - you're not really aiming to authenticate the connection in > that case, you're just aiming to test that the secure connection > machinery is all working properly. Self-signed certificates are not crazy in an internal corporate environment even when properly playing the defense in depth game. Once you've acked the cert the first time, you will be warned if it changes (like an ssh host key). Sure, as Nick says the corp could set up an internal signing authority and make sure everyone has their CA...and they *should*...but realistically, that is probably relatively rare at the moment, because it is not particularly easy to accomplish (distributing the CA everywhere it needs to go is still a Hard Problem, though it has gotten a lot better). Given the reality of human nature, even when the documentation accompanying the HTTPS initiative is good, there will *still* be someone who hasn't followed the internal rules, yet you really need to talk to the piece of infrastructure they are maintaining. At least that one is short term problem (for some definition of "short" that may be several months long), but it does exist. In addition, as has been mentioned before, self-signed certs are often embedded in *devices* from vendors (I'm looking at you, Cisco). This is another area where security conciousness has gotten better (the cert exists) but isn't good yet (the cert is self-signed and replacing it isn't trivial when it is even possible; and, because the self-signed cert happens by default....it gets left in place). And in the case of those embedded certs, the cert can wind up *invalid* (expired) as well as self-signed. (This last item is where my concern about being able to talk to invalid certs comes from.) And yes, I have encountered all of this in the wild. --David From stefan at bytereef.org Sun Aug 31 16:51:24 2014 From: stefan at bytereef.org (Stefan Krah) Date: Sun, 31 Aug 2014 16:51:24 +0200 Subject: [Python-Dev] [libmpdec] mpdecimal-2.4.1 released Message-ID: <20140831145124.GA13716@sleipnir.bytereef.org> Hi, I've released mpdecimal-2.4.1: http://www.bytereef.org/mpdecimal/changelog.html da74d3cfab559971a4fbd4fb506e1b4498636eb77d0fd09e44f8e546d18ac068 mpdecimal-2.4.1.tar.gz Starting with Python 3.4.2, this version should be used for an external libmpdec. Stefan Krah From marko at pacujo.net Sun Aug 31 17:19:32 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Sun, 31 Aug 2014 18:19:32 +0300 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR In-Reply-To: (Victor Stinner's message of "Sun, 31 Aug 2014 14:44:35 +0200") References: Message-ID: <87lhq4sn3v.fsf@elektro.pacujo.net> Victor Stinner : > Proposition > =========== > > If a system call fails with ``EINTR``, Python must call signal > handlers: call ``PyErr_CheckSignals()``. If a signal handler raises > an exception, the Python function fails with the exception. > Otherwise, the system call is retried. If the system call takes a > timeout parameter, the timeout is recomputed. Signals are tricky and easy to get wrong, to be sure, but I think it is dangerous for Python to unconditionally commandeer signal handling. If the proposition is accepted, there should be a way to opt out. Marko From christian at python.org Sun Aug 31 18:27:48 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 18:27:48 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <20140831141628.A6D59250E29@webabinitio.net> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> Message-ID: <54034D04.7000304@python.org> On 31.08.2014 16:16, R. David Murray wrote: > Self -signed certificates are not crazy in an internal corporate > environment even when properly playing the defense in depth game. Once > you've acked the cert the first time, you will be warned if it changes > (like an ssh host key). Sure, as Nick says the corp could set up an > internal signing authority and make sure everyone has their CA...and > they *should*...but realistically, that is probably relatively rare at > the moment, because it is not particularly easy to accomplish > (distributing the CA everywhere it needs to go is still a Hard Problem, > though it has gotten a lot better). It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store. That's all. A self-signed certificate acts as its own root CA (so to speak). But there is a downside, too. The certificate is trusted for any and all connections. Python's SSL module has no way to trust a specific certificate for a host. Christian From p.f.moore at gmail.com Sun Aug 31 19:03:30 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 31 Aug 2014 18:03:30 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: <54034D04.7000304@python.org> References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: On 31 August 2014 17:27, Christian Heimes wrote: > It's very simple to trust a self-signed certificate: just download it > and stuff it into the trust store. "Stuff it into the trust store" is the hard bit, though. I have honestly no idea how to do that. Or if it's temporary (which it likely is) how to manage it - delete it when I no longer need it, list what junk I've added over time, etc. And equally relevantly, no idea how to do that in a way that won't clash with my company's policies... Paul From christian at python.org Sun Aug 31 19:23:53 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 19:23:53 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: <54035A29.9020108@python.org> On 31.08.2014 08:24, Nick Coghlan wrote: > To answer David's specific question, the existing knobs at the OpenSSL > level (SSL_CERT_DIR and SSL_CERT_FILE ) let people add an internal CA, > opt out of the default CA system, and trust *specific* self-signed > certs. This works only on Unix platforms iff SSL_CERT_DIR and SSL_CERT_FILE are both set to a non-empty string that points to non-existing files or something like /dev/null. On Windows my enhancement will always cause the system trust store to kick in. There is currently no way to disable the Windows system store for ssl.create_default_context() and ssl._create_stdlib_context() with the functions' default arguments. On Mac OS X the situation is even more obscure. Apple's OpenSSL binaries are using Apple's Trust Evaluation Agent. You have to set OPENSSL_X509_TEA_DISABLE=1 in order to prevent the agent from adding trusted certs from OSX key chain. Hynek Schlawack did a deep dive into it. https://hynek.me/articles/apple-openssl-verification-surprises/ > A Python-specific user level cert store is something that could be > developed as a PyPI library for Python 2.7.9+ and 3.4+ (Is cert > management considered in scope for cryptography.io? If so, that could > be a good home). Python's SSL module is lacking some functionalities in order to implement a fully functional cert store. * no verify hook to verify each certificate in the chain like https://www.openssl.org/docs/ssl/SSL_CTX_set_cert_verify_callback.html http://linux.die.net/man/3/x509_store_ctx_set_verify_cb /api/ssl.html#OpenSSL.SSL.Context.set_verify * no way to get the full cert chain including the root certificate. * no API to get the subject public key information (SPKI). The SPKI hash can be used to identify a certificate. For example it's used in Google's CRLSet. http://dev.chromium.org/Home/chromium-security/crlsets * the cert validation exception could use some additional information. There are probably some more things mising. An X509 object would help, too. Christian From antoine at python.org Sun Aug 31 19:29:38 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 31 Aug 2014 19:29:38 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: Le 31/08/2014 19:03, Paul Moore a ?crit : > On 31 August 2014 17:27, Christian Heimes wrote: >> It's very simple to trust a self-signed certificate: just download it >> and stuff it into the trust store. > > "Stuff it into the trust store" is the hard bit, though. I have > honestly no idea how to do that. You certainly shouldn't do so. If an application has special needs that require trusting a self-signed certificate, then it should expose a configuration setting to let users specify the cert's location. Stuffing self-signed certs into the system trust store is really a measure of last resort. There's another case which isn't solved by this, though, which is when a cert is invalid. The common situation being that it has expired (renewing certs is a PITA and therefore expired certs are more common than it sounds they should be). In this case, there is no way to whitelist it: you have to disable certificate checking altogether. This can be exposed by the application as configuration option if necessary, as well. Regards Antoine. From p.f.moore at gmail.com Sun Aug 31 20:28:58 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 31 Aug 2014 19:28:58 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: On 31 August 2014 18:29, Antoine Pitrou wrote: > If an application has special needs that require trusting a self-signed > certificate, then it should expose a configuration setting to let users > specify the cert's location. I can't see how that would be something the application would know. For example, pip allows me to specify an "alternate cert bundle" but not a single additional cert. So IIUC, I can't use my local index that serves https using a self-signed cert. I'd find it hard to argue that it's pip's responsibility to think of that use case - pretty much any program that interacts with a web service *might* need to interact with a self-signed dummy version, if only under test conditions. Or did you mean that Python should provide such a setting that would cover all applications written in Python? Paul From antoine at python.org Sun Aug 31 20:37:50 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 31 Aug 2014 20:37:50 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: Le 31/08/2014 20:28, Paul Moore a ?crit : > > I can't see how that would be something the application would know. > For example, pip allows me to specify an "alternate cert bundle" but > not a single additional cert. So IIUC, I can't use my local index that > serves https using a self-signed cert. I'd find it hard to argue that > it's pip's responsibility to think of that use case - pretty much any > program that interacts with a web service *might* need to interact > with a self-signed dummy version, if only under test conditions. Well, it's certainly pip's responsibility more than Python's. What would Python do? Provide a setting that would blindly add a cert for all uses of httplib? pip knows about the use cases here, Python doesn't. (perhaps you want to serve your local index using http, though) Regards Antoine. From p.f.moore at gmail.com Sun Aug 31 21:12:28 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 31 Aug 2014 20:12:28 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: On 31 August 2014 19:37, Antoine Pitrou wrote: > Well, it's certainly pip's responsibility more than Python's. What would > Python do? Provide a setting that would blindly add a cert for all uses of > httplib? That's more or less my point, pip doesn't have that much better idea than Python. I was talking about putting the cert in my local cert store, so that *I* can decide, and applications don't need to take special care to allow me to handle this case. You said that doing so was bad, but I don't see why. It seems to me that you're saying that I should raise a feature request for pip instead, which seems unreasonable. Am I missing something? Paul From antoine at python.org Sun Aug 31 22:15:10 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 31 Aug 2014 22:15:10 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: Le 31/08/2014 21:12, Paul Moore a ?crit : > On 31 August 2014 19:37, Antoine Pitrou wrote: >> Well, it's certainly pip's responsibility more than Python's. What would >> Python do? Provide a setting that would blindly add a cert for all uses of >> httplib? > > That's more or less my point, pip doesn't have that much better idea > than Python. I was talking about putting the cert in my local cert > store, so that *I* can decide, and applications don't need to take > special care to allow me to handle this case. What do you call your local cert store? If you mean the system cert store, then that will affect all users. Regards Antoine. From christian at python.org Sun Aug 31 22:16:22 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 22:16:22 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: <54038296.7000207@python.org> On 31.08.2014 19:29, Antoine Pitrou wrote: > You certainly shouldn't do so. If an application has special needs that > require trusting a self-signed certificate, then it should expose a > configuration setting to let users specify the cert's location. Stuffing > self-signed certs into the system trust store is really a measure of > last resort. Correct! I merely wanted to state that OpenSSL can verify a self-signed certificate easily. The certificate 'just' have to be added to the SSLContext's store of trusted root certs. Somebody has to figure out how Python can accomplish the task. > There's another case which isn't solved by this, though, which is when a > cert is invalid. The common situation being that it has expired > (renewing certs is a PITA and therefore expired certs are more common > than it sounds they should be). In this case, there is no way to > whitelist it: you have to disable certificate checking altogether. This > can be exposed by the application as configuration option if necessary, > as well. It's possible to ignore errors with a verify callback. OpenSSL's wiki has an example for the expired certs http://wiki.openssl.org/index.php/Manual:X509_STORE_CTX_set_verify_cb%283%29#EXAMPLES Christian From p.f.moore at gmail.com Sun Aug 31 22:30:28 2014 From: p.f.moore at gmail.com (Paul Moore) Date: Sun, 31 Aug 2014 21:30:28 +0100 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: On 31 August 2014 21:15, Antoine Pitrou wrote: > What do you call your local cert store? I was referring to Christian's comment > It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store. >From his recent response, I guess he meant the system store, and he agrees that this is a bad option. OK, that's fair, but: a) Is there really no OS-level personal trust store? I'm thinking of Windows here for my own personal use, but the same question applies elsewhere. b) I doubt my confusion over Christian's response is atypical. Based on what he said, if we hadn't had the subsequent discussion, I would probably have found a way to add a cert to "the store" without understanding the implications. While it's not Python's job to educate users, it would be a shame if its default behaviour led people to make ill-informed decisions. Maybe an SSL HOWTO would be a useful addition to the docs, if anyone feels motivated to write one. Regardless, thanks for the education! Paul From victor.stinner at gmail.com Sun Aug 31 22:59:16 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Sun, 31 Aug 2014 22:59:16 +0200 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR In-Reply-To: <87lhq4sn3v.fsf@elektro.pacujo.net> References: <87lhq4sn3v.fsf@elektro.pacujo.net> Message-ID: Hi, Sorry but I don't understand your remark. What is your problem with retrying syscall on EINTR? Can you please elaborate? What do you mean by "get wrong"? Victor Le dimanche 31 ao?t 2014, Marko Rauhamaa a ?crit : > Victor Stinner >: > > > Proposition > > =========== > > > > If a system call fails with ``EINTR``, Python must call signal > > handlers: call ``PyErr_CheckSignals()``. If a signal handler raises > > an exception, the Python function fails with the exception. > > Otherwise, the system call is retried. If the system call takes a > > timeout parameter, the timeout is recomputed. > > Signals are tricky and easy to get wrong, to be sure, but I think it is > dangerous for Python to unconditionally commandeer signal handling. If > the proposition is accepted, there should be a way to opt out. > > > Marko > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marko at pacujo.net Sun Aug 31 23:19:15 2014 From: marko at pacujo.net (Marko Rauhamaa) Date: Mon, 01 Sep 2014 00:19:15 +0300 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR In-Reply-To: (Victor Stinner's message of "Sun, 31 Aug 2014 22:59:16 +0200") References: <87lhq4sn3v.fsf@elektro.pacujo.net> Message-ID: <874mwss6gc.fsf@elektro.pacujo.net> Victor Stinner : > Sorry but I don't understand your remark. What is your problem with > retrying syscall on EINTR? The application will often want the EINTR return (exception) instead of having the function resume on its own. > Can you please elaborate? What do you mean by "get wrong"? Proper handling of signals is difficult and at times even impossible. For example it is impossible to wake up reliably from the select(2) system call when a signal is generated (which is why linux now has pselect). Marko From ethan at stoneleaf.us Sun Aug 31 23:38:04 2014 From: ethan at stoneleaf.us (Ethan Furman) Date: Sun, 31 Aug 2014 14:38:04 -0700 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR In-Reply-To: <874mwss6gc.fsf@elektro.pacujo.net> References: <87lhq4sn3v.fsf@elektro.pacujo.net> <874mwss6gc.fsf@elektro.pacujo.net> Message-ID: <540395BC.50701@stoneleaf.us> On 08/31/2014 02:19 PM, Marko Rauhamaa wrote: > Victor Stinner : > >> Sorry but I don't understand your remark. What is your problem with >> retrying syscall on EINTR? > > The application will often want the EINTR return (exception) instead of > having the function resume on its own. Examples? As an ignorant person in this area, I do not know why I would ever want to have EINTR raised instead just getting the results of, say, my read() call. -- ~Ethan~ From victor.stinner at gmail.com Sun Aug 31 23:38:38 2014 From: victor.stinner at gmail.com (Victor Stinner) Date: Sun, 31 Aug 2014 23:38:38 +0200 Subject: [Python-Dev] RFC: PEP 475, Retry system calls failing with EINTR In-Reply-To: <874mwss6gc.fsf@elektro.pacujo.net> References: <87lhq4sn3v.fsf@elektro.pacujo.net> <874mwss6gc.fsf@elektro.pacujo.net> Message-ID: Le dimanche 31 ao?t 2014, Marko Rauhamaa a ?crit : > Victor Stinner >: > > > Sorry but I don't understand your remark. What is your problem with > > retrying syscall on EINTR? > > The application will often want the EINTR return (exception) instead of > having the function resume on its own. This case is described as the use case #2 in the PEP, so it is supported. As written in the PEP, if you want to be notified of the signal, set a signal handler which raises an exception. For example the default signal handler for SIGINT raises KeyboardInterrupt. > > Can you please elaborate? What do you mean by "get wrong"? > > Proper handling of signals is difficult and at times even impossible. > For example it is impossible to wake up reliably from the select(2) > system call when a signal is generated (which is why linux now has > pselect). In my experience, using signal.set_wakeup_fd() works well with select(), even on Windows. The PEP promotes this. It is even thread safe. I don't know issues of signals with select() (and without a file descriptor used to wake up it). Python now exposes signal.pthread_sigmask(), I don't know if it helps. In my experience, signals don't play well with multithreading. On FreeBSD, the signal is send to a "random" thread. So you must have the same signal mask on all threads if you want to rely on them. But I don't get you point. How does this PEP make the situation worse? Victor -------------- next part -------------- An HTML attachment was scrubbed... URL: From ncoghlan at gmail.com Sun Aug 31 23:41:21 2014 From: ncoghlan at gmail.com (Nick Coghlan) Date: Mon, 1 Sep 2014 07:41:21 +1000 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: On 1 Sep 2014 06:32, "Paul Moore" wrote: > > On 31 August 2014 21:15, Antoine Pitrou wrote: > > What do you call your local cert store? > > I was referring to Christian's comment > > It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store. > > From his recent response, I guess he meant the system store, and he > agrees that this is a bad option. > > OK, that's fair, but: > > a) Is there really no OS-level personal trust store? I'm thinking of > Windows here for my own personal use, but the same question applies > elsewhere. > b) I doubt my confusion over Christian's response is atypical. Based > on what he said, if we hadn't had the subsequent discussion, I would > probably have found a way to add a cert to "the store" without > understanding the implications. While it's not Python's job to educate > users, it would be a shame if its default behaviour led people to make > ill-informed decisions. Right, this is why I came to the conclusion we need to follow the browser vendors lead here and support a per-user Python specific supplementary certificate cache before we can start validating certs by default at the *Python* level. There are still too many failure modes for cert management on private networks for us to safely ignore the use case of needing to force connections to services with invalid certs. We don't need to *solve* that problem here today - we can push it back to Alex (and anyone else interested) as a building block to investigate providing as part of cryptography.io or certi.fi, with a view to making a standard library version of that (along with any SSL module updates) part of PEP 476. In the meantime, we can update the security considerations for the ssl module to make it clearer that the defaults are set up for trusted networks and that using it safely on the public internet may mean you're better off with a third party library like requests or Twisted. (I'll start another thread shortly that is highly relevant to that topic) Regards, Nick. > > Maybe an SSL HOWTO would be a useful addition to the docs, if anyone > feels motivated to write one. > > Regardless, thanks for the education! > > Paul > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian at python.org Sun Aug 31 23:43:05 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 23:43:05 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> Message-ID: <540396E9.2020800@python.org> On 31.08.2014 08:09, Nick Coghlan wrote: > As Antoine says here, I'm also opposed to adding more Python specific > configuration options. However, I think there may be something > worthwhile we can do that's closer to the way browsers work, and has > the significant benefit of being implementable as a PyPI module first > (more on that in a separate reply). I'm on your and Antoine's side and strictly against any additional environment variables or command line arguments. That would make the whole validation process even more complex and harder to understand. There might be a better option to give people and companies the option to tune the SSL module to their needs. Python already have a customization hook for the site module called sitecustomize. How about another module named sslcustomize? Such a module could be used to tune the ssl module to the needs of users, e.g. configure a different default context, add certificates to a default context etc. Companies could install them in a system global directory on their servers. Users could put them in their own user site directory and even each virtual env can have one sslcustomize of its own. It's fully backward compatible, doesn't add any flags and developers have the full power of Python for configuration and customization. Christian From antoine at python.org Sun Aug 31 23:53:14 2014 From: antoine at python.org (Antoine Pitrou) Date: Sun, 31 Aug 2014 23:53:14 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: Le 31/08/2014 23:41, Nick Coghlan a ?crit : > Right, this is why I came to the conclusion we need to follow the browser > vendors lead here and support a per-user Python specific supplementary > certificate cache before we can start validating certs by default at the > *Python* level. There are still too many failure modes for cert management > on private networks for us to safely ignore the use case of needing to > force connections to services with invalid certs. We are not ignoring that use case. The proper solution is simply to disable cert validation in the application code (or, for more sophisticated needs, provide an application configuration setting for cert validation). > In the meantime, we can update the security considerations for the ssl > module to make it clearer that the defaults are set up for trusted networks > and that using it safely on the public internet may mean you're better off > with a third party library like requests or Twisted. No, you simply have to select the proper validation settings. Regards Antoine. From christian at python.org Sun Aug 31 23:59:10 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 23:59:10 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: <54039AAE.4060002@python.org> On 31.08.2014 22:30, Paul Moore wrote: > On 31 August 2014 21:15, Antoine Pitrou wrote: >> What do you call your local cert store? > > I was referring to Christian's comment >> It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store. I was referring to the the trust store of the SSLContext object and not to any kind of cert store of the operating system. Sorry for the confusion. > a) Is there really no OS-level personal trust store? I'm thinking of > Windows here for my own personal use, but the same question applies > elsewhere. Windows and OSX have superior cert stores compared to Linux and BSD. They have means for user and system wide cert stores and trust settings Linux just have one central directory or file with all trusted certs. My KDE has some options to disable certs but I don't know how to make use of the configuration. Even worse: Linux distros doesn't make a different between purposes. On Windows a user can trust a certificate for S/MIME but not for server auth or client auth. Ubuntu just puts all certification in one directory but it's wrong. :( https://bugs.launchpad.net/ubuntu/+source/ca-certificates/+bug/1207004 Christian From christian at python.org Sun Aug 31 23:59:10 2014 From: christian at python.org (Christian Heimes) Date: Sun, 31 Aug 2014 23:59:10 +0200 Subject: [Python-Dev] PEP 476: Enabling certificate validation by default! In-Reply-To: References: <5400DBC0.1020700@egenix.com> <5400DD64.4050308@stoneleaf.us> <20140831032525.19b7e48c@fsol> <20140831022149.F0493250E30@webabinitio.net> <20140831141628.A6D59250E29@webabinitio.net> <54034D04.7000304@python.org> Message-ID: <54039AAE.4060002@python.org> On 31.08.2014 22:30, Paul Moore wrote: > On 31 August 2014 21:15, Antoine Pitrou wrote: >> What do you call your local cert store? > > I was referring to Christian's comment >> It's very simple to trust a self-signed certificate: just download it and stuff it into the trust store. I was referring to the the trust store of the SSLContext object and not to any kind of cert store of the operating system. Sorry for the confusion. > a) Is there really no OS-level personal trust store? I'm thinking of > Windows here for my own personal use, but the same question applies > elsewhere. Windows and OSX have superior cert stores compared to Linux and BSD. They have means for user and system wide cert stores and trust settings Linux just have one central directory or file with all trusted certs. My KDE has some options to disable certs but I don't know how to make use of the configuration. Even worse: Linux distros doesn't make a different between purposes. On Windows a user can trust a certificate for S/MIME but not for server auth or client auth. Ubuntu just puts all certification in one directory but it's wrong. :( https://bugs.launchpad.net/ubuntu/+source/ca-certificates/+bug/1207004 Christian